Histograms for columns with highly duplicated values

Histograms for columns with highly duplicated values look very different from histograms for columns with a large number of discrete values. In histograms for columns with highly duplicated values, a single cell, called a frequency cell, represents the duplicated value.

The weight of the frequency cell shows the percentage of columns that have matching values.

Histogram output for frequency cells varies, depending on whether the column values represent one of the following:

A dense frequency count, where values in the column are contiguous in the domain. For example, 1, 2, 3 are contiguous integer values.
A sparse frequency count, where the domain of possible values contains values not represented by the discrete set of values in the table.
A mix of dense and sparse frequency counts.

Histogram output for some columns includes a mix of frequency cells and range cells.

Histograms for dense frequency counts

The following output shows the histogram for a column that has 6 distinct integer values, 1 – 6, and some null values:

Step     Weight                Value

   1     0.13043478     <=     1
   2     0.04347826     <=     1
   3     0.17391305     <=     2
   4     0.30434781     <=     3
   5     0.13043478     <=     4
   6     0.17391305     <=     5
   7     0.04347826      <=     6

The histogram above shows a dense frequency count, because all the values for the column are contiguous integers.

The first cell represents null values (described in section below). Since there are null values, the weight for this cell represents the percentage of null values in the column.

The “Value” column for the first step displays the minimum column value in the table and the < operator.

Histograms for sparse frequency counts

In a histogram representing a column with a sparse frequency count, the highly duplicated values are represented by a step showing the discrete values with the = operator and the weight for the cell.

Preceding each step, there is a step with a weight of 0.0, the same value, and the < operator, indicating that there are no rows in the table with intervening values. For columns with null values, the first step has a nonzero weight if there are null values in the table.

The following histogram represents the type column of the titles table. Since there are only 9 distinct types, they are represented by 18 steps.

Step      Weight                     Value
 
    1     0.00000000        <        "UNDECIDED   "
    2     0.11500000        =        "UNDECIDED   "    
    3     0.00000000        <        "adventure   "
    4     0.11000000        =        "adventure   "
    5     0.00000000        <       "business    "
    6     0.11040000        =       "business    "
    7     0.00000000        <       "computer    "
    8     0.11640000        =       "computer    "
    9     0.00000000        <       "cooking     "
   10     0.11080000        =       "cooking     "
   11     0.00000000        <       "news        "
   12     0.10660000        =       "news        "
   13     0.00000000        <       "psychology  "
   14     0.11180000        =       "psychology  "
   15     0.00000000        <       "romance     "
   16     0.10800000        =       "romance     "
   17     0.00000000        <       "travel      "
   18     0.11100000         =       "travel      "

For example, 10.66% of the values in the type column are “news,” so for a table with 5000 rows, the optimizer estimates that 533 rows will be returned.

Histograms for columns with sparse and dense values

For tables with some values that are highly duplicated, and others that have distributed values, the histogram output shows a combination of operators and a mix of frequency cells and range cells.

The column represented in the histogram below has a value of 30.0 for a large percentage of rows, a value of 50.0 for a large percentage of rows, and a value 100.0 for another large percentage of rows.

There are two steps in the histogram for each of these values: one step representing the highly duplicated value has the = operator and a weight showing the percentage of columns that match the value. The other step for each highly duplicated value has the < operator and a weight of 0.0. The datatype for this column is numeric(5,1).

Step      Weight                Value
    1     0.00000000     <=     0.9
    2     0.04456094     <=     20.0
    3     0.00000000     <      30.0
    4     0.29488859     =      30.0
    5     0.05996068     <=     37.0
    6     0.04292267     <=     49.0
    7     0.00000000     <      50.0
    8     0.19659241     =      50.0
    9     0.06028834     <=     75.0
   10     0.05570118     <=     95.0
   11     0.01572739     <=     99.0
   12     0.00000000     <     100.0
   13     0.22935779      =     100.0

Since the lowest value in the column is 1.0, the step for the null values is represented by 0.9.