data lake and database migration

azure_learner 615 Reputation points
2025-05-28T15:13:49.77+00:00

Hi Friends, we are developing a data lake, we are building till bronze layer, now I am in a dilemma. We have a Postgres database data (which is one time load and incremental load afterwards) on premise and we need to move that data into the data lake. 

But some colleagues are suggesting it is better to move it in Postgres available on VM in Azure cloud. I am pretty sure that since we are building a data lake and it is better to move the progress data in the landing zone, we will be doing schema design in the data lake house anyway later on.

Please give me the best approach , which approach to go for in which use-case and pros and cons, I certainly do not want multiple data storages such as ADLS as well as Postgres in Azure.

Thank you in advance for your help.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
0 comments No comments
{count} votes

Accepted answer
  1. Marcin Policht 49,640 Reputation points MVP Volunteer Moderator
    2025-05-28T15:27:02.4+00:00

    My suggestion would be to directly land data into the data lake (ADLS)

    1. Landing directly into the Data Lake (ADLS)

    Pros:

    • Simplifies architecture: only one storage system in Azure (ADLS)
    • Aligns with data lakehouse best practices: raw data in bronze, refined layers after
    • Cheaper storage costs than VM disk or Postgres managed storage
    • Scalability and durability of cloud native storage
    • Easier to track incremental loads via metadata or partitioning
    • Easier integration with downstream lakehouse engines (Databricks, Synapse, etc.)
    • No need to maintain a Postgres instance in cloud

    Cons:

    • Initial data extraction can be complex if you have to build a pipeline to extract from on-prem Postgres to files (e.g., CSV, Parquet)
    • May need a custom CDC/incremental pipeline if you want near real-time updates
    • Requires more upfront pipeline development (e.g., using tools like Azure Data Factory, Databricks, or custom jobs)

    Best Use-case:

    • When you want a scalable, cost-efficient raw data storage
    • When your incremental load can be managed via ETL pipelines and lakehouse metadata
    • When you want to avoid multiple managed DBs/storage in cloud
    • When your downstream workloads will use the data lake files directly
    1. Landing data into a Postgres instance on Azure VM, then moving to the data lake

    Pros:

    • Postgres handles ACID transactions and data consistency inherently
    • Easier to run complex queries, transformations, or CDC on the Postgres side before landing
    • Incremental logic can be easier if built in DB (e.g., triggers, replication slots, logical decoding)
    • Easier for teams familiar with SQL and Postgres tooling to validate data
    • Can leverage existing Postgres tools and ecosystem

    Cons:

    • Additional infrastructure cost and management overhead for Azure VM and Postgres
    • Data duplication and increased latency: data is stored twice (in Postgres VM and in data lake)
    • Potentially higher cost than cloud storage (disk + VM cost)
    • Adds complexity to your architecture
    • Moving data from Postgres VM to data lake still requires ETL pipelines
    • Could become a bottleneck if not sized properly

    Best Use-case:

    • When your source is a relational DB and you want to leverage DB capabilities for incremental loads or transformations
    • When you need low-latency access to data in Postgres for other applications
    • When you have heavy SQL workloads pre-processing data before lake
    • When you want a staging area with transactional guarantees before landing

    Since your source is on-prem Postgres with a one-time bulk load plus incremental loads, and you are building a data lake with schema design in the lakehouse downstream, it's cleaner to extract data directly from on-prem Postgres to ADLS (landing zone / bronze layer). To accomplish this, you can use Azure Data Factory (ADF) or Databricks to orchestrate:

    • One-time bulk extract: dump data as Parquet/CSV into the bronze zone
    • Incremental load: CDC or timestamp-based incremental extracts from Postgres to landing zone files
    • This avoids maintaining an additional Postgres instance in Azure.
    • Later, in the silver/gold layers of your lakehouse, you can define schema, transform data, and create structured tables for consumption.
    • This approach also reduces operational overhead.

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.