Generate statistics on Spark created external table

Question

Generate statistics on Spark created external table

Ryan Abbey 1,186

We have a Spark (lake) external table based on a set of CSV files. Documentation talks about how statistics on a CSV file within Serverless need to be manually maintained, would similar need to be done for a lake table? If so, how?

For a serverless external table, the documentation provides the syntax as

CREATE STATISTICS statistics_name
ON { external_table } ( column )
    WITH
        { FULLSCAN
          | [ SAMPLE number PERCENT ] }
        , { NORECOMPUTE }

however, trying to use this within Spark gives

Error: no viable alternative at input 'CREATE STATISTICS'(line 2, pos 7) == SQL == CREATE STATISTICS test -------^^^

If we can create statistics on an external table within Spark, any ideas how? Pointers to documentation would suffice.

Thanks

PS Can't create within Serverless, this gives an error about statistics operation not allowed for this type of table

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-02-23T09:37:14.3566667+00:00

Hi Ryan Abbey ,

Just checking in to see if the below answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Accepted answer

0 additional answers

Your answer

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-02-23T09:37:14.3566667+00:00

Hi Ryan Abbey ,

Just checking in to see if the below answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Answer 1

Hi Ryan Abbey ,

Thankyou for using Microsoft Q&A platform and thanks for posting your question here.

As I understand your query, you want to know how to maintain statistics in spark databases as create statistics command is failing in lake database . Please let me know if that is not the ask.

CREATE STATISTICS command you mentioned is not a valid SQL command in Spark, so you should use ANALYZE TABLE instead.

In Spark, the statistics for external tables in a data lake are computed through the ANALYZE TABLE command. The ANALYZE TABLE command will scan the data files in the table and compute statistics such as the number of rows, minimum and maximum values, and histogram statistics for the columns. These statistics are then stored in the Spark metastore.

To analyze a Spark external table, you can use the following syntax:


ANALYZE TABLE table_name COMPUTE STATISTICS [FOR COLUMNS col1, col2, ...]

This will compute statistics for the specified columns in the external table. If you don't specify any columns, statistics will be computed for all columns in the table.

You can also use the noscan option to compute statistics without scanning the data files, but this may not be as accurate as a full scan. Here's an example:


ANALYZE TABLE table_name COMPUTE STATISTICS NOSCAN

For more information, you can refer to the Spark SQL documentation on the ANALYZE TABLE command: https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html

Hope it helps. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-02-24T05:22:35.5333333+00:00

Hi Ryan Abbey ,

Just following up to see if the above answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'
Ryan Abbey 1,186 Reputation points

2023-03-16T22:27:53.9666667+00:00

Thanks, gave us the answer to that but not the resolution to our issue...

Share via

Generate statistics on Spark created external table

0 additional answers

Your answer