External tables: Why create statistics?

Anuw Malik 1 Reputation point
2022-08-23T08:57:26.147+00:00

The documentation for Polybase says that creating statistics for External Tables will help getting better execution plans. This question is about when the External Data source is another SQL Server.

I guess the better execution plan is on the External Data Source, i.e. for pushdown computations (for where clauses, joins and aggregates).

To get a good execution plan the optimizer also has to know what indexes are on the table. Can the optimizer use the statistics on the remote server (the External Data Source)?

If yes, what are the benefits of creating the statistics locally on the External Table? If the optimizer has too look at the indexes on the remote server, then it might as well look at the statistics...

Please provide references in the answer.

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,361 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. YufeiShao-msft 7,091 Reputation points
    2022-08-24T08:07:28.273+00:00

    Hi @Anuw Malik

    First of all, it is clear that the function of PolyBase is to join the data from external sources to relational tables in an instance of SQL Server, and the statistics allow the SQL Server query optimizer to make the best decision possible on how to execute a query, as far as the creation of statistics itself is concerned, it is beneficial.

    But the doc on CREATE STATISTICS, it also said

    Statistics for external tables
    When creating external table statistics, SQL Server imports the external table into a temporary SQL Server table, and then creates the statistics. For samples statistics, only the sampled rows are imported. If you have a large external table, it will be much faster to use the default sampling instead of the full scan option.

    And the limitations:

    Updating statistics is not supported on external tables. To update statistics on an external table, drop and re-create the statistics.

    If you have enough permissions, the remote optimizer has full access to the statistics on that server, but it may require you to have the plan guides and maintenance procedures to get what you want

    But in any case, statistics is useful, but there are more troubles with external data sources than with local

    -------------

    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.