Sizing a Log Store

Calculate the size of log store you require. Correctly sizing your log store is important, as stores that are too small or large can lead to performance issues.

  1. Estimate the maximum amount of data, in bytes, that you collect in the log store, as both record count and volume. If you are certain about both the number of records arriving in the source streams and the size of the records, simply perform the calculation. If not, you can use the Playback feature to record and play back real data to get a better idea of the amount of data you need to store.
    The log store reports "liveSize" in the server log when the project exits (with log level three or higher) and after every compaction (with log level six or higher).
    Note: Skip step 2 if the messages in the server log report “liveSize,” with the indexing overhead already included.
  2. To calculate the basic indexing overhead, multiply the record count by 96 bytes. Add the result to the volume.
  3. Choose the value of the reservePct parameter. The required store size, in bytes, including the reserve, is calculated as:

    storeBytes = dataBytes * 100 / (100 - reservePct)

    Round this calculation up to the next megabyte.
  4. Ensure the reserve cannot be overrun by the uncheckpointed data.
    Estimate the maximum amount of uncheckpointed data that is produced when the input queues of all the streams, except source streams, are full. The records in the queues that are located early in the sequence must be counted together with any records they produce as they are processed through the project. Include the number of output records that are produced by the stream for each of its input records.
    This example shows the stream queue depth set to the default of 1024, for a log that contains four streams ordered like this:
    source --> derived1 --> derived2 --> derived3
    1. Determine the number of records that are produced by each stream as they consume the contents of its queue:
      • 1024 records may end up in derived1's input queue. Assuming the queue produces one output record for one input record, it produces 1024 records.
      • 2048 records may end up in derived2's input queue (1024 that are already collected on its own queue, and 1024 more from derived1). Assuming that derived2 is a join and generates on average 2 output records for each input record, it produces 4096 records ([1024 + 1024] * 2).
      • 5120 records may end up in derived3 (1024 from its own queue and 4096 from derived2). Assuming a passthrough ratio of 1, derived3 produces 5120 records.
      When the project's topology is not linear, you must take all branches into account. The passthrough ratio may be different for data coming from the different parent streams. You must add up the data from all the input paths. Each stream has only one input queue, so its depth is fixed, regardless of how many parent streams it is connected to. However, the mix of records in each queue may vary. Assume the entire queue is composed from the records that produce that highest amount of output. Some input streams may contain static data that is loaded once and never changes during normal work. You do not need to count these inputs. In the example, derived2 is a join stream, and has static data as its second input.
    2. Calculate the space required by multiplying the total number of records by the average record size of that stream.
      For example, if the records in derived1 average 100 bytes; derived2, 200 bytes; and derived3, 150 bytes, the calculation is:

      (1024 * 100) + (4096 * 200) + (5120 * 150) = 1,689,600

      Trace the record count through the entire project, starting from the source streams down to all the streams in the log store. Add the data sized from the streams located in the log store.
    3. Multiply the record count by 96 bytes to calculate the indexing overhead and add the result to the volume in bytes:

      (1024 + 4096 + 5120) * 96 = 983,040

      1,689,600 + 983,040 = 2,672,640

      Verify that this result is no larger than one quarter of the reserve size:

      uncheckpointedBytes < storeBytes * (reservePct /  4) / 100

      If the result is larger than one quarter of the reserve size, increase the reserve percent and repeat the store size calculation. Uncheckpointed data is mainly a concern for smaller stores. Other than through the uncheckpointed data size, this overhead does not significantly affect the store size calculation, because the cleaning cycle removes it and compacts the data.
Related concepts
Data Backup
Data Restoration