Statistical Aggregate Functions

The ANSI SQL/OLAP extensions provide a number of additional aggregate functions that permit statistical analysis of numeric data. This support includes functions to compute variance, standard deviation, correlation, and linear regression.

Standard deviation and variance

The SQL/OLAP general set functions that take one argument include those appearing in bold in this syntax statement:

<SIMPLE WINDOW AGGREGATE FUNCTION TYPE> ::=
  <BASIC AGGREGATE FUNCTION TYPE>
  | STDDEV | STDDEV_POP | STDDEV_SAMP
  | VARIANCE | VARIANCE_POP | VARIANCE_SAMP
  • STDDEV_POP – computes the population standard deviation of the provided value expression evaluated for each row of the group or partition (if DISTINCT is specified, each row that remains after duplicates are eliminated), defined as the square root of the population variance.

  • STDDEV_SAMP – computes the population standard deviation of the provided value expression evaluated for each row of the group or partition (if DISTINCT is specified, each row that remains after duplicates are eliminated), defined as the square root of the sample variance.

  • VAR_POP – computes the population variance of value expression evaluated for each row of the group or partition (if DISTINCT is specified, each row that remains after duplicates are eliminated), defined as the sum of squares of the difference of value expression from the mean of value expression, divided by the number of rows (remaining) in the group or partition.

  • VAR_SAMP – computes the sample variance of value expression evaluated for each row of the group or partition (if DISTINCT is specified, each row that remains after duplicates are eliminated), defined as the sum of squares of the difference of value expression, divided by one less than the number of rows (remaining) in the group or partition.

These functions, including STDDEV and VARIANCE, are true aggregate functions in that they can compute values for a partition of rows as determined by the query’s ORDER BY clause. As with other basic aggregate functions such as MAX or MIN, their computation ignores NULL values in the input. Also, regardless of the domain of the expression being analyzed, all variance and standard deviation computation uses IEEE double-precision floating point. If the input to any variance or standard deviation function is the empty set, then each function returns NULL as its result. If VAR_SAMP is computed for a single row, it returns NULL, while VAR_POP returns the value 0.

Correlation

The SQL/OLAP function that computes a correlation coefficient is:

  • CORR – returns the correlation coefficient of a set of number pairs.

You can use the CORR function either as a windowing aggregate function (where you specify a window function type over a window name or specification) or as a simple aggregate function with no OVER clause.

Covariance

The SQL/OLAP functions that compute covariances include:

  • COVAR_POP – returns the population covariance of a set of number pairs.

  • COVAR_SAMP – returns the sample covariance of a set of number pairs.

The covariance functions eliminate all pairs where expression1 or expression2 has a null value.

You can use the covariance functions either as windowing aggregate functions (where you specify a window function type over a window name or specification) or as simple aggregate functions with no OVER clause.

Cumulative distribution

The SQL/OLAP function that calculates the relative position of a single value among a group of rows is CUME_DIST.

The window specification must contain an ORDER_BY clause.

Composite sort keys are not allowed in the CUME_DIST function.

Regression analysis

The regression analysis functions calculate the relationship between an independent variable and a dependent variable using a linear regression equation. The SQL/OLAP linear regression functions include:

  • REGR_AVGX – computes the average of the independent variable of the regression line.

  • REGR_AVGY – computes the average of the dependent variable of the regression line.

  • REGR_COUNT – returns an integer representing the number of nonnull number pairs used to fit the regression line.

  • REGR_INTERCEPT – computes the y-intercept of the regression line that best fits the dependent and independent variables.

  • REGR_R2 – computes the coefficient of determination (the goodness-of-fir statistic) for the regression line.

  • REGR_SLOPE – computes the slope of the linear regression line fitted to nonnull pairs.

  • REGR_SXX – returns the sum of squares of the independent expressions used in a linear regression model. Use this function to evaluate the statistical validity of the regression model.

  • REGR_SXY – returns the sum of products of the dependent and independent variables. Use this function to evaluate the statistical validity of the regression model.

  • REGR_SYY – returns values that can evaluate the statistical validity of a regression model.

You can use the regression analysis functions either as windowing aggregate functions (where you specify a window function type over a window name or specification) or as simple aggregate functions with no OVER clause.

Weighted OLAP aggregates

The weighted OLAP aggregate functions calculate weighted moving averages:

  • EXP_WEIGHTED_AVG – calculates an exponentially weighted moving average. Weightings determine the relative importance of each quantity comprising the average. Weights in EXP_WEIGHTED_AVG decrease exponentially. Exponential weighting applies more weight to the most recent values and decreases the weight for older values, while still applying some weight

  • WEIGHTED_AVG – calculates a linearly weighted moving average where weights decrease arithmetically over time. Weights decrease from the highest weight for the most recent data points, down to zero for the oldest data point.

The window specification must contain an ORDER_BY clause.

Nonstandard database industry extensions

Non-ANSI SQL/OLAP aggregate function extensions used in the database industry include FIRST_VALUE, MEDIAN, and LAST_VALUE.

  • FIRST_VALUE – returns the first value from a set of values.

  • MEDIAN – returns the median from an expression.

  • LAST_VALUE – returns the last value from a set of values.

The FIRST_VALUE and LAST_VALUE functions require a window specification. You can use the MEDIAN function either as windowing aggregate function (where you specify a window function type over a window name or specification) or as a simple aggregate function with no OVER clause.