In Azure databricks writing pyspark dataframe to eventhub is taking too long (8hrs) as there 3 Million records in dataframe

Shivasai 21 Reputation points
2022-04-08T04:26:00.52+00:00

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.

Below is my pyspark code to connect and read oracle db table as dataframe

df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()

then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.

I have defined ehconf and eventhub connection string. Below is my write to eventhub code

df.select("body") \
   .write\
   .format("eventhubs") \
   .options(**ehconf) \    
   .save()

my pyspark code is taking 8 hours to send 3 million records to eventhub.

My Eventhub is created under eventhub cluster which has 1 CU in capacity

Could you please suggest how to write pyspark dataframe to eventhub faster ?

Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores

Azure Event Hubs
Azure Event Hubs
An Azure real-time data ingestion service.
580 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,017 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 82,271 Reputation points Microsoft Employee
    2022-04-11T04:58:37.007+00:00

    Hello @Shivasai ,

    Thanks for the question and using MS Q&A platform.

    You may checkout the below options to increase the write performance:

    Option1: Increase the throughput unit capacity or enable Auto-Inflate future.

    Auto-Inflate automatically scales the number of Throughput Units assigned to your Standard Tier Event Hubs Namespace when your traffic exceeds the capacity of the Throughput Units assigned to it. You can specify a limit to which the Namespace will automatically scale.

    191630-image.png

    Option2: Setup maxEventsPerTrigger option in your query.

    Rate limit on maximum number of events processed per trigger interval. The specified total number of events will be proportionally split across partitions of different volume.
    After adding below code in the data bricks job to consume more event hub records throughput has been improved.

    `.option("maxEventsPerTrigger", 5000000)`  
    

    For more information, refer to Azure Databricks – Event Hubs.

    Hope this will help. Please let us know if any further queries.


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators