Azure Data Lake Store Gen2 performance considerations (IOPS) and single/multiple accounts

Question

I've seen data lake implementations within Azure taking two routes. Centralising around a single storage account, and distributing a logical data lake between multiple accounts (e.g. based on layers/zones within the lake).

Such a decision is something one needs to make rather early on, and reversing it is going to be non-trivial and costly. However, there is very little information I can find to evaluate the correct approach. Even the guidance here leaves room for interpretation and any changes to the structure retroactively is going to be painful - especially in a centralised model (i.e. not a mesh). The series of articles do list some criteria why/when split into multiple accounts, but I did not pick up performance among them.

Furthermore, the document here mentions "high levels of IOPS", but numbers here do not single out ADLS specifically. Is ADLS therefore somehow unique in this respect? And is there a good reason, in practise, to split the lake among multiple accounts based solely on (even if premature) performance optimisation?

To reiterate, my questions are:

Is it an anti-pattern to concentrate any data lake implementation around a single storage account (even with hierarchical namespaces enabled)? Is this going to present any practical risk concerning performance in reality - and if so, when?
Is there any publicly available information regarding the IOPS expectations for Azure Data Lake Store Gen2? Or do the numbers published for regular storage accounts still apply - making the claim regarding "high levels of IOPS" (in the context of ADLS, specifically) misleading? If so, there seems to very little practical benefit from ADLS apart from POSIX support and perhaps the efficiency of some directory-level operations

Accepted Answer

@Veli-Jussi Raitila This decision is based purely on the Your own requirements and objectives – separation due to different business units, different geo-political boundaries, different governance requirements on different zones. However, there is a management overhead and a minor loss of discoverability associated with such a federated approach and so we also see you are going for a single, centralized model. This is especially true if the customer has a long history with Hadoop/HDFS on-premises.

Here is some more guidance on the matter, albeit still not being categorical about when one approach is superior to the other: https://azure.github.io/Storage/docs/analytics/hitchhikers-guide-to-the-datalake/#do-i-want-a-centralized-or-a-federated-data-lake-implementation

ADLS accounts are limited in the same manner as normal storage accounts. However, we have a high degree of flexibility in where those limits are set and generally reflect the customer’s desire as it impacts our capacity planning. However, for very large data lake installations (eg. > 50PB) or data lakes that are to be subject to a very specific IO pattern that unavoidably leads to large IOPS loads (generally outside the range of analytics frameworks – Spark, Hadoop, etc. are capable of generating – think > 100,000 IOPS),

Please let us know if you have any further queries. I’m happy to assist you further.

----------

Please do not forget to and “up-vote” wherever the information provided helps you**, this can be beneficial to other community members.**

Azure Data Lake Store Gen2 performance considerations (IOPS) and single/multiple accounts

0 additional answers