Should I worry about a high number of extents?

Question

Hello ADX experts.

I am currently doing a POC to see if ADX will work for us. So far I enjoy the ADX developer experience.

Since I haven't found a clear recommendation for how to partition my table, I have created two tables where I ingest identical data. Then I do query tests on both tables.

Since this approach requires more resources and effort, I would like to conclude which partitioning settings to use (and drop one of the tables) before doing further tests.

Here is the table schema:

.create table [table] (
    MetricId: int,
    Timestamp: datetime,
    Value: real
)

Table A has two partitioning keys: MetricId (Hash, Uniform) and Timestamp Table B has one partitioning key: Timestamp Table A must define MetricId as string since a Hash key only accepts string and guid.

The query performance is noticeably better on Table A for the simple tests I have done so far with the current data. This is probably because all queries have the following filter:

| where MetricId in (...)

This will also be the case for future queries.

Here are the extent stats for both tables:

TableName  Rows            Extents       OriginalSize  CompressedSize  IndexSize  ExtentSize
Table_A    14,813,100,000  432,054       1.28 TB       48.37 GB          1.63 GB  50.00 GB
Table_B    14,813,100,000   13,299       1.30 TB       54.20 GB        256.25 MB  54.45 GB

As you can see, the number of extents and index size are considerably higher for Table A, but the total size is smaller.

Although it seems that Table A wins with regards to both query performance and storage size, there are also some cons with Table A:

MetricId must be string and has to be converted to/from int in application code
Range filters like between can't be used on strings in queries. This is however something I can live with.
Some control commands, e.g. .show table [table] extents takes about 30x longer than for Table B. This is probably due to the high number of extents.

So finally, here are my questions:

Are the high number of extents and large index size for Table A something to worry about? In this post we can see that a warning was given when the number of extents exceeded 500K.
The test tables currently contain 500 unique MetricIds with 30M unique timestamps each. If we decide to use this in production, the number of metrics will be in the range of 200K - 1M, but the number of timestamps per metric will be the same. I guess this will change the extent stats quite a bit. Based on this, is it possible to foresee whether we should go for Table A or Table B?
Why can't a hash partition key be a numeric type (int)?

Any feedback is greatly appreciated.

Accepted Answer

@Pål Kristian Halle

Thanks for using MS Q&A platform and posting your query.

I'll address your questions one by one:

Should you worry about a high number of extents?

In general, a high number of extents can lead to: * Increased storage overhead due to additional metadata and index storage. * Slower performance for certain operations, like .show table [table] extents, as you've observed. * Potential issues with query performance, especially if the extent count grows exponentially.

However, in your case, the query performance is better on Table A, which has a higher extent count. This is likely due to the efficient filtering on MetricId using the hash partition key.

The warning threshold of 500K extents mentioned in the post you linked is not a hard limit. It's a guideline to encourage optimization and monitoring of extent growth. If your extent count continues to grow, it's essential to monitor performance and adjust your partitioning strategy as needed.

How will the extent stats change with more metrics and timestamps?

With more metrics (200K - 1M) and the same number of timestamps per metric (30M), the extent count will likely increase. However, the growth rate will depend on the distribution of data across the metrics.

If the data is evenly distributed across metrics, the extent count might not grow exponentially. On the other hand, if some metrics have a significantly larger number of timestamps, the extent count could increase more rapidly.

To mitigate this, you can consider: * Using a more granular partition key, like a combination of MetricId and a timestamp bucket (e.g., MetricId + Timestamp truncated to the hour or day level). * Implementing a data retention policy to remove older data and maintain a stable extent count.

Why can't a hash partition key be a numeric type (int)?

Hash partition keys in ADX are designed to work with string or GUID types. This is because the hash function used to distribute data across partitions is optimized for these data types.

Using a numeric type like int as a hash partition key could lead to uneven data distribution, as the hash function might not be able to effectively spread the data across partitions. This could result in poor query performance and increased storage overhead.

If you need to use a numeric type as a partition key, consider using a range-based partition key (like Timestamp) or a combination of columns (like MetricId + Timestamp).

In conclusion, while a high extent count can be a concern, it's not necessarily a deal-breaker in your case, given the better query performance on Table A. Monitor your extent growth and adjust your partitioning strategy as needed to ensure optimal performance and storage efficiency.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Should I worry about a high number of extents?

0 additional answers

Your answer