Generate statistics on Spark created external table

Ryan Abbey 1,186 Reputation points
2023-02-21T01:03:23.9533333+00:00

We have a Spark (lake) external table based on a set of CSV files. Documentation talks about how statistics on a CSV file within Serverless need to be manually maintained, would similar need to be done for a lake table? If so, how?

For a serverless external table, the documentation provides the syntax as

CREATE STATISTICS statistics_name
ON { external_table } ( column )
    WITH
        { FULLSCAN
          | [ SAMPLE number PERCENT ] }
        , { NORECOMPUTE }

however, trying to use this within Spark gives

Error: no viable alternative at input 'CREATE STATISTICS'(line 2, pos 7) == SQL == CREATE STATISTICS test -------^^^

If we can create statistics on an external table within Spark, any ideas how? Pointers to documentation would suffice.

Thanks

PS Can't create within Serverless, this gives an error about statistics operation not allowed for this type of table

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator
    2023-02-22T09:18:26.1566667+00:00

    Hi Ryan Abbey ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your question here.

    As I understand your query, you want to know how to maintain statistics in spark databases as create statistics command is failing in lake database . Please let me know if that is not the ask.

    CREATE STATISTICS command you mentioned is not a valid SQL command in Spark, so you should use ANALYZE TABLE instead.

    In Spark, the statistics for external tables in a data lake are computed through the ANALYZE TABLE command. The ANALYZE TABLE command will scan the data files in the table and compute statistics such as the number of rows, minimum and maximum values, and histogram statistics for the columns. These statistics are then stored in the Spark metastore.

    To analyze a Spark external table, you can use the following syntax:

    
    ANALYZE TABLE table_name COMPUTE STATISTICS [FOR COLUMNS col1, col2, ...]
    

    This will compute statistics for the specified columns in the external table. If you don't specify any columns, statistics will be computed for all columns in the table.

    You can also use the noscan option to compute statistics without scanning the data files, but this may not be as accurate as a full scan. Here's an example:

    
    ANALYZE TABLE table_name COMPUTE STATISTICS NOSCAN
    

    For more information, you can refer to the Spark SQL documentation on the ANALYZE TABLE command: https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html


    Hope it helps. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.