Sizing a Log Store

Calculate the size of the log store your project requires. Correctly sizing your log store is important, as stores that are too small or large can lead to performance issues.

You will start this procedure by calculating your project’s internal record size. An internal record represents a row in an Event Stream Processor window. Each row contains a fixed-size header plus a variable-size payload containing the column offsets, column data, and any optional fields. Use this formula for the calculation in step 1:

In the formula,

M represents the number of columns
PS represents the primitive datatype size for each of the M columns

Primitive datatypes are the building blocks that make up more complex structures such as records, dictionaries, vectors, and event caches. This table gives the size for datatype.

Primitive Datatype Sizes
Datatype	Size in Bytes	Notes
Boolean	1
Decimal	18
Integer	4
Long	8
String	1 + number of characters in the string	Estimate an average length
Float	8
Money(n)	8
Date	8
Time	8
Timestamp	8
BigDateTime	8
Binary	4 + number of bytes in the binary value	Estimate an average length

Note: Guaranteed delivery (GD) logs hold events stored for delivery. If no GD logs are stored in the log store, you have the option of skipping step 1, step 2, and step 3. Instead, compute the dataSize using the Playback feature in Studio or the esp_playback utility to record and play back real data to get a better idea of the amount of data you need to store. (See the Studio Users Guide for details on Playback or the Utilities Guide for details on esp_playback.) The log store reports “liveSize” in the server log when the project exits (with log level three or higher) or after every compaction (with log level six or higher). Use the “liveSize” value for the dataSize referenced in step 2 and beyond.

For each window, calculate the size of an internal record. If the window supports GD, compute the size for the GD logs separately.
For purposes of illustration, use this schema:
```
CREATE SCHEMA TradesSchema AS (
       TradeId    LONG,
       Symbol     STRING,
       Price      MONEY(4),
       Volume     INTEGER,
       TradeDate  BIGDATETIME
);
```
1. Using the primitive sizes from the Primitive Datatype Sizes table, compute the column values—the total size in bytes for the datatypes in the schema. For the sample schema, assuming an average STRING length of 4, the calculation is:
```
8 + (4 + 1) + 8 + 4 + 8 = 33 bytes
```
2. Add the size of the offsets to the size of the column values. The offsets are calculated as (4 * M) where M is the number of columns. Plugging in the sample schema’s five columns, we get:
```
(4 * 5) + 33 = 53 bytes
```
3. Add the size of the row header, which is always 56 bytes:
```
56 + 53 = 113 bytes
```
4. Round up to the nearest number divisible by:
  - 8 if ESP is running on a 64-bit architecture
  - 4 if ESP is running on a 32-bit architecture
  For a 64-bit installation, use this formula:
```
URS + (8 - (URS modulo 8))
```
  where URS is the unrounded record size value you computed in step 1.c. (For a 32-bit installation, substitute a 4 for each 8 in the formula.) Continuing with our example, where we assume ESP is running on a 64-bit machine,
```
113 + (8 - (1)) = 120 bytes
```
5. Label your result recordSize and make a note of it.
Estimate the maximum amount of data, in bytes, that you expect to collect in the log store. To do this you must determine the maximum number of records each window assigned to the log store will contain. If the window supports guaranteed delivery, treat the GD logs as a separate window, and for the record count use the maximum number of uncommitted rows you expect the GD logs to contain for this window. Add 1000 to this value because GD logs are purged only when there are at least 1000 fully committed events.

Next, for each window, determine the data size by multiplying the expected record count by the recordSize you computed in step 1.e. Sum the data size for all the windows and GD logs to get the total size of the data that will be stored in the log store. Label this value dataSize.

Also sum the record counts for each window and GD log assigned to this log store and label that value recordCount.
To calculate the basic indexing overhead, multiply the recordCount from step 2 by 96 bytes. Add the result to the dataSize value.
Choose the value of the reservePct parameter. The required store size, in bytes, including the reserve, is calculated as:

storeBytes = dataSize * 100 / (100 - reservePct)

where dataSize is the value you computed in step 3.
Round storeBytes up to the next megabyte.
Ensure the reserve cannot be overrun by the uncheckpointed data.
Estimate the maximum amount of uncheckpointed data that is produced when the input queues of all the streams, except source streams, are full. The records in the queues that are located early in the sequence must be counted together with any records they produce as they are processed through the project. Include the number of output records that are produced by the stream for each of its input records.
This example shows the stream queue depth set to the default of 1024, for a log that contains four streams ordered like this:
```
source --> derived1 --> derived2 --> derived3
```
1. Determine the number of records that are produced by each stream as it consumes the contents of its queue:
  - 1024 records may end up in derived1's input queue. Assuming the queue produces one output record for one input record, it produces 1024 records.
  - 2048 records may end up in derived2's input queue (1024 that are already collected on its own queue, and 1024 more from derived1). Assuming that derived2 is a join and generates on average 2 output records for each input record, it produces 4096 records ([1024 + 1024] * 2).
  - 5120 records may end up in derived3 (1024 from its own queue and 4096 from derived2). Assuming a pass-through ratio of 1, derived3 produces 5120 records.
  When the project’s topology is not linear, you must take all branches into account. The pass-through ratio may be different for data coming from the different parent streams. You must add up the data from all the input paths. Each stream has only one input queue, so its depth is fixed, regardless of how many parent streams it is connected to. However, the mix of records in each queue may vary. Assume the entire queue is composed from the records that produce that highest amount of output. Some input streams may contain static data that is loaded once and never changes during normal work. You do not need to count these inputs. In the example, derived2 is a join stream, and has static data as its second input.
2. Calculate the space required by multiplying the total number of records by the average record size of that stream.
  For example, if the records in derived1 average 100 bytes; derived2, 200 bytes; and derived3, 150 bytes, the calculation is:
  (1024 * 100) + (4096 * 200) + (5120 * 150) = 1,689,600
  
  Trace the record count through the entire project, starting from the source streams down to all the streams in the log store. Add the data sized from the streams located in the log store.
3. Multiply the record count by 96 bytes to calculate the indexing overhead and add the result to the volume in bytes:
  
  (1024 + 4096 + 5120) * 96 = 983,040
  
  1,689,600 + 983,040 = 2,672,640
  Verify that this result is no larger than one quarter of the reserve size:
  uncheckpointedBytes < storeBytes * (reservePct / 4) / 100
  If the result is larger than one quarter of the reserve size, increase the reserve percent and repeat the store size calculation. Uncheckpointed data is mainly a concern for smaller stores. Other than through the uncheckpointed data size, this overhead does not significantly affect the store size calculation, because the cleaning cycle removes it and compacts the data.
When you create the log store, place storeBytes, the log store size value you arrive at here, in the CREATE LOG STORE statement’s maxfilesize parameter.