How to optimize ADX Cluster with very high CPU usage?

Ori Bandel 30 Reputation points
2024-06-03T08:36:02.5766667+00:00

We're currently using 3 nodes in our ADX cluster, which are experiencing extreme workload due to running very heavy calculations that result in long periods (between 30-60 mins) of 100% CPU usage of the cluster (all nodes)

The calculation can, and will, be optimized, but this is not the issue. The main problem is -- what is the best way (or ways) to optimize our cluster toward reducing the CPU workload? This is critical since our analytics system is ADX, and when the CPU is at ~100% the dashboards are failing, simple query returns with timeout, etc

Another important factor is that we're using ~55%-60% of our cache and to optimize towards faster analytics we're considering increasing the chance from 30 days to 60 days. We have no ingestion issues whatsoever. We're currently using 3 Standard_L8as_v3 nodes.

We can approach this from a few angles:

  • Scaling Up (SKU) - stronger or more suitable SKU
  • Scaling Out (Nodes) - more or auto-scaling
  • other solutions?

From the research I made it seems that scaling Up is more relevant, but it would be great to hear what you think, and about the experience you had with scaling your ASDX cluster resources :)

Another critical thing is if there is a risk to data ingestion (I assume not) downtime issues or any other data-related risk with the process and how each approach might result at

Many thanks!

p.s.

we try to increase the number of nodes to 6 for an approx 1.5-2 hours but this results in no visible effect we're constantly optimizing our functions/syntax to be more efficient but the increase in user volume is higher than the optimization rate and thus we thought on more resources

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
530 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 90,221 Reputation points Microsoft Employee
    2024-06-04T05:13:19.3433333+00:00

    @Ori Bandel - Thanks for the question and using MS Q&A platform.

    Based on the information you provided, it seems that you are experiencing high CPU usage on your ADX cluster due to heavy calculations. You are considering scaling up (SKU) or scaling out (nodes) to optimize your cluster and reduce the CPU workload.

    Scaling up to a stronger or more suitable SKU can help improve the performance of your cluster. However, it may not be the most cost-effective solution. Scaling out by adding more nodes or using auto-scaling can also help distribute the workload and reduce the CPU usage.

    In terms of data ingestion, there should not be any risk as long as you follow the best practices for scaling your cluster. It is important to monitor your cluster and adjust the resources as needed to ensure optimal performance.

    Before scaling up or out, you may want to consider optimizing your calculations to reduce the workload on your cluster. Additionally, you mentioned that you are using 55%-60% of your cache and considering increasing it from 30 days to 60 days. This can also help improve the performance of your analytics system.

    Overall, it is important to find the right balance between performance and cost when optimizing your ADX cluster. You may want to experiment with different configurations and monitor the performance to determine the best solution for your specific needs.

    I hope this helps! Let me know if you have any other questions.

    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Ori Bandel 30 Reputation points
    2024-06-26T06:40:11.2666667+00:00

    Following up on the solution:

    • We decided to Scale UP - the reason was that based on the documentation and based on the above it seems that Scaling UP is a more suitable solution for high CPU load issues
    • It turned out quite good - we chose 1 size stronger/higher SKU/VM which doubled our CPU power
    • The >99% CPU periods we experienced were by tens of percent (less than half) which is a great outcome for our data infra
    • Clearly the price is not cheap but this proves valuable as we experience fewer failures and increase data stability
    • This can't be our final solution as our user base grows we will explore more solutions - other scaling, code improvements, better integration, etc

    I marked the relevant answer as 'accepted answer' :)

    Reader of the future - if you read this and want to consult please do so! you can reach out here or via the common/professional channels :)

    Tnx!

    1 person found this answer helpful.
    0 comments No comments

  2. Ori Bandel 30 Reputation points
    2024-08-02T20:08:38.02+00:00

    Adding:

    a few weeks ago we also implemented Scaling OUT and increased the number of nodes. This might have a minor impact on the CPU usage/volume. It did improve the overall experience of the ADX users with the dashboards/queries but nothing that I can add a metric/number to

    IMPORTANT -- the major update that almost totally removed the 100% CPU usage was a restructure and optimization of the summarized table we are building via ETLs in the cluster (a process of creating summary tables, for example for different periods, out of the raw tables)

    when we restructured the process to run much less (AND with higher relevance) this made a huge difference

    One lesson here is that ADX is NOT optimized to build complex summary tables via a straight .set-or-replace (or an alike function) and if used so much better to be optimized with a smart logic

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.