Using sampling for update statistics

The optimizer for Adaptive Server uses the statistics on a database to set up and optimize queries. The statistics must be as current as possible to generate optimal results.

Run the update statistics commands against data sets, such as tables, to update information about the distribution of key values in specified indexes or columns, for all columns in an index, or for all columns in a table. The commands revise histograms and density values for column-level statistics. The results are then used by the optimizer to calculate the best way to set up a query plan.

update statistics requires table scans or leaf-level scans of indexes, may increase I/O contention, may use the CPU to perform sorts, and uses data and procedure caches. Use of these resources can adversely affect queries running on the server if you run update statistics when usage is high. In addition, some update statistics commands require shared locks, which can block updates.

To reduce I/O contention and resources, run update statistics using a sampling method, which can reduce the I/O and time when your maintenance window is small and the data set is large. If you are updating a large data set or table that is in constant use, being truncated and repopulated, you may want to do a statistical sampling to reduce the time and the size of the I/O.

You must use caution with sampling since the results are not fully accurate. Balance changes to histogram values against the savings in I/O.

Although a sampling of the data set may not be completely accurate, usually the histograms and density values are reasonable within an acceptable range.

When you are deciding whether or not to use sampling, consider the size of the data set, the time constraints you are working with, and if the histogram produced is as accurate as needed.

The percentage to use when sampling depends on your needs. Test various percentages until you receive a result that reflects the most accurate information on a particular data set.

Statistics are stored in the system tables systabstats and sysstatistics.

To update statistics, use:

update statistics table_name 
    [ [index_name] | [( column_list ) ] ]
    [using step values]
    [with consumers = consumers ] [, sampling = percent]

update index statistics table_name [index_name] 
    [using step values]
    [with consumers = consumers] [, sampling = percent

update all statistics table_name [index_name] [using step values]
[with consumers = consumers] [, sampling = percent]

Where:

table_name – generates statistics for the leading column in each index on the table.
table_name [index_name] – generates statistics for all leading columns on the specified index.
table_name (column_list) – generates statistics for only this column.
table_name (column_name, column_name...) – generates a histogram for the first column in the list, and multi-column density values for the prefix subsets.
sampling = percent – the numeric value of the sampling percentage, such as 05 for 5%, 10 for 10%, and so on. The sampling integer is between zero (0) and one hundred (100).

Example:

update statistics authors(auth_id) with sampling = 5 percent

The serverwide sampling percent can be set using the configuration paramater:

sp_configure ‘sampling percent’, 5

This command sets a serverwide sampling of 5% for update statistics that allows you to do the update statistics without the sampling syntax. The percentage can be anywhere from zero (0) to one hundred (100) percent.