Data Lake and Environments - Best Practice

HimanshuSinha-msft 601 Reputation points

Hello All,

Is it a best practice to have one Big Data Lake for all the environments (Dev, Stage, QA and Prod) or have a Data Lake for Prod and another for Non-Prod ... etc.?

If we chose to share a data lake across environments, then audit will play a major role in it. It would really help if others can share their experience and guidance.


[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question]

MSDN Source: DataLake and Environments - Best Practice

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,426 questions
0 comments No comments
{count} vote

Accepted answer
  1. KranthiPakala-MSFT 46,442 Reputation points Microsoft Employee

    Welcome to the Microsoft Q&A (Preview) platform.

    Happy to answer your query.

    You may checkout “FAQs about organizing a Data Lake”, which addressing your query.

    If I need a separate dev, test, prod environment, how would this usually be handled?

    Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

    We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top-level node.

    Regarding monitoring in ADLS Gen2:

    Azure Data Lake Storage Gen2 provides metrics in the Azure portal under the Data Lake Storage Gen2 account and in Azure Monitor. Availability of Data Lake Storage Gen2 is displayed in the Azure portal. To get the most up-to-date availability of a Data Lake Storage Gen2 account, you must run your own synthetic tests to validate availability. Other metrics such as total storage utilization, read/write requests, and ingress/egress are available to be leveraged by monitoring applications and can also trigger alerts when thresholds (for example, Average latency or # of errors per minute) are exceeded.

    For more details, refer “Best practices for using Azure Data Lake Storage Gen2”.

    Hope this helps. Do let us know if you have any further queries.

    3 people found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful