How to write data to an on prem Postgresql db from Synapse?

Anonymous
2024-08-14T17:52:20.7866667+00:00

I am trying to write data to an on prem Postgresql db from Synapse. I used two approaches but still not successful.

  1. Created a dataflow for ETL but on prem Postgresql db is not supported as a sink by Synapse(link).
  2. Tried performing ETL using PySpark notebook and used psycopg2 for writing data to the Postgresql db, but getting the following error: (psycopg2.OperationalError) connection to server at "xx.x.xx.xx", port 5432 failed: Connection timed out Is the server running on that host and accepting TCP/IP connections?

Is there any workaround that I can use to write data to the on prem Postgresql db from Synapse?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-08-15T11:25:53.67+00:00

    The error message you encountered suggests that the connection between Azure Synapse and your on-premises PostgreSQL server is being blocked or is unable to be established. This is often due to network configurations. You need to ensure that the PostgreSQL server is reachable from the Azure Synapse environment. Typically, this involves configuring the network to allow inbound traffic on the PostgreSQL server from the Azure Synapse IP addresses.

    • VPN or ExpressRoute: One common approach is to set up a VPN or use Azure ExpressRoute to establish a secure, private connection between your Azure resources and your on-premises network. This ensures that your Synapse environment can communicate with your on-prem PostgreSQL database as if they were on the same local network.
    • Firewall Rules: Make sure that any firewalls or network security groups are configured to allow traffic on port 5432 from the IP range of Azure Synapse. Even if the network configuration is correct, the PostgreSQL server itself might not be configured to accept remote connections. PostgreSQL by default only allows local connections.
    • PostgreSQL Configuration: You need to modify the postgresql.conf file to listen on all available IP addresses (e.g., listen_addresses = '*') and adjust the pg_hba.conf file to allow connections from the IP addresses of your Azure Synapse environment.
    • Authentication: Ensure that the authentication method configured in pg_hba.conf supports connections from Azure Synapse. For instance, you may need to set the method to md5 or another appropriate authentication mechanism.

    Since you mentioned that using a dataflow in Synapse was not successful because the on-prem PostgreSQL database is not supported as a sink, and you faced challenges with PySpark and psycopg2, you might want to consider alternative approaches:

    • Using ADF: ADF is a more robust tool for ETL operations and supports a wider range of connectors, including those for on-prem databases. You can create an ADF pipeline that moves data from Synapse to your on-prem PostgreSQL database. ADF can integrate with your on-prem network using a Self-Hosted Integration Runtime, which allows it to securely connect to on-prem resources.
    • Using a Gateway: Another approach is to use an on-premises data gateway. This can bridge the gap between your on-prem environment and Azure Synapse, allowing for secure data transfer. If direct writing from Synapse to the on-prem PostgreSQL database continues to be problematic, consider the following :
    • Staging in Azure: Write the data to an intermediate storage location within Azure, such as Azure Blob Storage or an Azure SQL Database, and then use a separate process (ADF, custom scripts) to move the data from there to your on-prem PostgreSQL.
    • Batch Processing: If real-time data transfer is not required, consider batching the data and transferring it during off-peak hours when network traffic is lower.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.