Improving Aggregation Performance

Aggregation functions typically require the server to iterate over every element in a group. For this reason, the performance of the aggregation operator is inversely proportional to the size of the group.

Aggregation functions can be used in a SELECT statement along with a GROUP BY clause or over event caches in SPLASH inside UDFs and FLEX operators.

For the SUM, COUNT, AVG, and valueInserted aggregation functions, the server can perform additive optimization, where the function executes in constant time. In such cases, the time it takes to perform an operation is the same regardless of group size.

In a SELECT statement, the server can perform additive optimization provided functions eligible for optimization are used in all values being selected, with the exception of the columns referenced in the GROUP BY clause.

The following SELECT is optimized for additive optimization since all non-GROUP BY columns (name, counter, summary) only use additive aggregation functions (that is, valueInserted, SUM, and COUNT).

CREATE OUTPUT WINDOW AggResult 
	SCHEMA (id INTEGER, name STRING, counter INTEGER, summary FLOAT)
	PRIMARY KEY DEDUCED
AS
	SELECT BaseInput.intData_1 AS id,
		valueInserted(BaseInput.strData_1) AS name,
		count(BaseInput.intData_1) AS counter,
		sum(BaseInput.dblData_1) AS summary
FROM BaseInput
GROUP BY BaseInput.intData_1
;

Note: For optimal peformance, when selecting only the column in a SELECT statement with a GROUP BY clause, use the valueInserted function, where feasible.

The following SELECT is not optimized for additive optimization since one of the non-GROUP BY columns (name) directly selects a column which cannot be computed additively.

CREATE OUTPUT WINDOW AggResult 
	SCHEMA (id INTEGER, name STRING, counter INTEGER, summary FLOAT)
	PRIMARY KEY DEDUCED
AS
	SELECT BaseInput.intData_1 AS id,
		BaseInput.strData_1 AS name,
		count(BaseInput.intData_1) AS counter,
		sum(BaseInput.dblData_1) AS summary
FROM BaseInput
GROUP BY BaseInput.intData_1
;

When applying aggregation functions over an event cache, additive optimization is turned on when using the SUM, COUNT, AVG, or valueInserted functions only in the ON clause of a FLEX operator. The additive optimization does not apply when functions are used inside a UDF.

The following Flex stream computes the SUM in the ON clause additively, since the SUM function is computed additively and the used EventCaches (e0,e1) are declared locally.

CREATE INPUT WINDOW In1 
	SCHEMA (c1 INTEGER, c2 STRING, c3 INTEGER, summary FLOAT)
	PRIMARY KEY (c1, c2);

CREATE FLEX MyFlex
	IN In1
	OUT OUTPUT WINDOW FlexOut 
	SCHEMA (c1 INTEGER, c2 INTEGER, c3 INTEGER, c4 INTEGER)
	PRIMARY KEY (c1, c2)
BEGIN 
	declare
		eventCache(In1, coalesce) e0;
		eventCache(In1, coalesce) e1;
	end;

	ON In1 {
	 {
		output setOpcode([c1=In1.c1;c2=In1.c2;|c3=sum(e0.c1);c4=sum(e1.c3);],getOpcode(In1));
	 }
	};
END;

The following Flex stream is not computed additively , since the STDDEV function cannot be computed additively.

CREATE INPUT WINDOW In1 
	SCHEMA (c1 INTEGER, c2 STRING, c3 INTEGER)
	PRIMARY KEY (c1, c2);

CREATE FLEX MyFlex
	IN In1
	OUT OUTPUT WINDOW FlexOut 
	SCHEMA (c1 INTEGER, c2 INTEGER, c3 INTEGER, c4 FLOAT)
	PRIMARY KEY (c1, c2)
BEGIN 
	declare
		eventCache(In1, coalesce) e0;
		eventCache(In1, coalesce) e1;
	end;

	ON In1 {
	 {
		output setOpcode([c1=In1.c1;c2=In1.c2;|c3=sum(e0.c1);c4=stddev(e1.c3);],getOpcode(In1));
	 }
	};
END;

Another restriction is that additive optimizations are disabled when functions are used inside nonlinear statements (if, while, for, and case statements). To enable additive optimizations when using a function within a nonlinear statement, assign the result of the function to a variable outside of the statement. Then use the variable inside the nonlinear statement.

Note: The function used within the nonlinear statement must be from the set of functions eligible for additive optimization.

The following SELECT is not optimized for additive optimization since one of the expressions (CASE) in the SELECT list is a nonlinear expression.

CREATE OUTPUT WINDOW AggResult 
	SCHEMA (id INTEGER, name STRING, counter INTEGER, summary FLOAT)
	PRIMARY KEY DEDUCED
AS
	SELECT BaseInput.intData_1 AS id,
		valueInserted(BaseInput.strData_1) AS name,
		CASE WHEN (count(BaseInput.intDATA_1) < 100) THEN 0 ELSE 1 END AS counter,
		sum(BaseInput.dblData_1) AS summary
FROM BaseInput
GROUP BY BaseInput.intData_1
;