My suggestion would be to directly land data into the data lake (ADLS)
- Landing directly into the Data Lake (ADLS)
Pros:
- Simplifies architecture: only one storage system in Azure (ADLS)
- Aligns with data lakehouse best practices: raw data in bronze, refined layers after
- Cheaper storage costs than VM disk or Postgres managed storage
- Scalability and durability of cloud native storage
- Easier to track incremental loads via metadata or partitioning
- Easier integration with downstream lakehouse engines (Databricks, Synapse, etc.)
- No need to maintain a Postgres instance in cloud
Cons:
- Initial data extraction can be complex if you have to build a pipeline to extract from on-prem Postgres to files (e.g., CSV, Parquet)
- May need a custom CDC/incremental pipeline if you want near real-time updates
- Requires more upfront pipeline development (e.g., using tools like Azure Data Factory, Databricks, or custom jobs)
Best Use-case:
- When you want a scalable, cost-efficient raw data storage
- When your incremental load can be managed via ETL pipelines and lakehouse metadata
- When you want to avoid multiple managed DBs/storage in cloud
- When your downstream workloads will use the data lake files directly
- Landing data into a Postgres instance on Azure VM, then moving to the data lake
Pros:
- Postgres handles ACID transactions and data consistency inherently
- Easier to run complex queries, transformations, or CDC on the Postgres side before landing
- Incremental logic can be easier if built in DB (e.g., triggers, replication slots, logical decoding)
- Easier for teams familiar with SQL and Postgres tooling to validate data
- Can leverage existing Postgres tools and ecosystem
Cons:
- Additional infrastructure cost and management overhead for Azure VM and Postgres
- Data duplication and increased latency: data is stored twice (in Postgres VM and in data lake)
- Potentially higher cost than cloud storage (disk + VM cost)
- Adds complexity to your architecture
- Moving data from Postgres VM to data lake still requires ETL pipelines
- Could become a bottleneck if not sized properly
Best Use-case:
- When your source is a relational DB and you want to leverage DB capabilities for incremental loads or transformations
- When you need low-latency access to data in Postgres for other applications
- When you have heavy SQL workloads pre-processing data before lake
- When you want a staging area with transactional guarantees before landing
Since your source is on-prem Postgres with a one-time bulk load plus incremental loads, and you are building a data lake with schema design in the lakehouse downstream, it's cleaner to extract data directly from on-prem Postgres to ADLS (landing zone / bronze layer). To accomplish this, you can use Azure Data Factory (ADF) or Databricks to orchestrate:
- One-time bulk extract: dump data as Parquet/CSV into the bronze zone
- Incremental load: CDC or timestamp-based incremental extracts from Postgres to landing zone files
- This avoids maintaining an additional Postgres instance in Azure.
- Later, in the silver/gold layers of your lakehouse, you can define schema, transform data, and create structured tables for consumption.
- This approach also reduces operational overhead.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin