statistics question

Heisenberg 261 Reputation points
2022-04-22T22:53:58.95+00:00

hi folks,
A question related to statistics, in one of the post online i read that sample_pct plays big role in how accurately statistics are utilized. When i run following query in my environment, i get almost all the statistics sample_pct close to 1.0 , i dont see any sample_pct beyond 1.0. Is this true that this number should be close to 100? I'm using ola hellengren's script with the parameters mentioned below.

SELECT OBJECT_SCHEMA_NAME(st.object_id) + '.' + OBJECT_NAME(st.object_id) AS TableName
, col.name AS ColumnName
, st.name AS StatsName
, sp.last_updated
, sp.rows_sampled
, sp.rows
, (1.0*sp.rows_sampled)/(1.0*sp.rows) AS sample_pct
FROM sys.stats st
INNER JOIN sys.stats_columns st_col
ON st.object_id = st_col.object_id
AND st.stats_id = st_col.stats_id
INNER JOIN sys.columns col
ON st_col.object_id = col.object_id
AND st_col.column_id = col.column_id
CROSS APPLY sys.dm_db_stats_properties (st.object_id, st.stats_id) sp
ORDER BY 7

--
Ola hellengren sp.

EXECUTE dbo.IndexOptimize
@Databases = 'DB',
@FragmentationLow = NULL,
@FragmentationMedium = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE',
@FragmentationHigh = 'INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE', -- keep this offline
@FragmentationLevel1 = 5,
@FragmentationLevel2 = 30,
@UpdateStatistics = 'ALL', -- remove this
@OnlyModifiedStatistics = 'N',
@StatisticsSample=100,
@LogToTable ='Y'

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,322 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,597 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Erland Sommarskog 106.5K Reputation points
    2022-04-23T08:48:27.317+00:00

    To get a number close to 100, change

    , (1.0*sp.rows_sampled)/(1.0*sp.rows) AS sample_pct
    

    to

    , (100.0*sp.rows_sampled)/(1.0*sp.rows) AS sample_pct
    

    That is, you query computes fractions but calls the column pct which is misleading.

    As for how important it is to have fullscan statistics (as a sample rate of 100% is referred to), there is no discussion that sampled statistics can sometimes result in less optimal plans. This is particularly true if you query for a condition at the edges. To take a simple example, consider a query like SELECT * FROM tbl WHERE col IS NULL. The table has a sizable number of rows where col indeed is NULL, but when statistics were sampled - and for a big table that may actually be around 1% of the rows - no NULL were found, and the optimizer says "I guess 1 row!" when in fact there were 98234 rows. This can lead to a plan which is not good at all if you are joining to other table. (Since NULL values sort first, they are the edge, so to speak.)

    The problem with fullscan statistics is that they are costly to produce for big tables. If you are rebuilding the index, you get the fullscan for free. It can also be fairly cheap to get fullscan statistics for non-clustered indexes, since they are smaller than the base table. But if you say

    UPDATE STATISTICS WideTable WITH FULLSCAN

    And that table has hundreds of millions of columns, and lots non of non-indexed columns, this can take many hours as each non-indexed column requires a full scan of the table.

    1 person found this answer helpful.
    0 comments No comments

  2. Bert Zhou-msft 3,421 Reputation points
    2022-04-25T05:59:37.017+00:00

    Hi,@Heisenberg

    Welcome to Microsoft T-SQL Q&A Forum!

    When SQL Server creates or updates statistics and the sample rate is not manually specified , SQL Server calculates the default sample rate.
    But sometimes the default sampling rate may not accurately represent the data distribution , in this case you need to manually update the statistics, which is also the content of Ola hellengren's sp implementation.

    You just need to do this:

     (rows_sampled * 100)/(1.0*rows) AS sample_percent  
    FROM sys.stats ss  
    INNER JOIN sys.stats_columns sc   
    ON ss.stats_id = sc.stats_id AND ss.object_id = sc.object_id  
    INNER JOIN sys.all_columns ac   
    ON ac.column_id = sc.column_id AND ac.object_id = sc.object_id  
    CROSS APPLY sys.dm_db_stats_properties(ss.object_id, ss.stats_id) shr  
    

    For details, please refer to the official documentation , which has a detailed case description for this issue.

    Best regards,
    Bert Zhou


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our Documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments