data lake and database migration

Question

data lake and database migration

azure_learner 615

Hi Friends, we are developing a data lake, we are building till bronze layer, now I am in a dilemma. We have a Postgres database data (which is one time load and incremental load afterwards) on premise and we need to move that data into the data lake.

But some colleagues are suggesting it is better to move it in Postgres available on VM in Azure cloud. I am pretty sure that since we are building a data lake and it is better to move the progress data in the landing zone, we will be doing schema design in the data lake house anyway later on.

Please give me the best approach , which approach to go for in which use-case and pros and cons, I certainly do not want multiple data storages such as ADLS as well as Postgres in Azure.

Thank you in advance for your help.

Accepted answer

0 additional answers

Your answer

Answer 1

My suggestion would be to directly land data into the data lake (ADLS)

Landing directly into the Data Lake (ADLS)

Pros:

Simplifies architecture: only one storage system in Azure (ADLS)
Aligns with data lakehouse best practices: raw data in bronze, refined layers after
Cheaper storage costs than VM disk or Postgres managed storage
Scalability and durability of cloud native storage
Easier to track incremental loads via metadata or partitioning
Easier integration with downstream lakehouse engines (Databricks, Synapse, etc.)
No need to maintain a Postgres instance in cloud

Cons:

Initial data extraction can be complex if you have to build a pipeline to extract from on-prem Postgres to files (e.g., CSV, Parquet)
May need a custom CDC/incremental pipeline if you want near real-time updates
Requires more upfront pipeline development (e.g., using tools like Azure Data Factory, Databricks, or custom jobs)

Best Use-case:

When you want a scalable, cost-efficient raw data storage
When your incremental load can be managed via ETL pipelines and lakehouse metadata
When you want to avoid multiple managed DBs/storage in cloud
When your downstream workloads will use the data lake files directly

Landing data into a Postgres instance on Azure VM, then moving to the data lake

Pros:

Postgres handles ACID transactions and data consistency inherently
Easier to run complex queries, transformations, or CDC on the Postgres side before landing
Incremental logic can be easier if built in DB (e.g., triggers, replication slots, logical decoding)
Easier for teams familiar with SQL and Postgres tooling to validate data
Can leverage existing Postgres tools and ecosystem

Cons:

Additional infrastructure cost and management overhead for Azure VM and Postgres
Data duplication and increased latency: data is stored twice (in Postgres VM and in data lake)
Potentially higher cost than cloud storage (disk + VM cost)
Adds complexity to your architecture
Moving data from Postgres VM to data lake still requires ETL pipelines
Could become a bottleneck if not sized properly

Best Use-case:

When your source is a relational DB and you want to leverage DB capabilities for incremental loads or transformations
When you need low-latency access to data in Postgres for other applications
When you have heavy SQL workloads pre-processing data before lake
When you want a staging area with transactional guarantees before landing

Since your source is on-prem Postgres with a one-time bulk load plus incremental loads, and you are building a data lake with schema design in the lakehouse downstream, it's cleaner to extract data directly from on-prem Postgres to ADLS (landing zone / bronze layer). To accomplish this, you can use Azure Data Factory (ADF) or Databricks to orchestrate:

One-time bulk extract: dump data as Parquet/CSV into the bronze zone
Incremental load: CDC or timestamp-based incremental extracts from Postgres to landing zone files
This avoids maintaining an additional Postgres instance in Azure.
Later, in the silver/gold layers of your lakehouse, you can define schema, transform data, and create structured tables for consumption.
This approach also reduces operational overhead.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Share via

data lake and database migration

0 additional answers

Your answer