In Azure databricks writing pyspark dataframe to eventhub is taking too long (8hrs) as there 3 Million records in dataframe

Question

In Azure databricks writing pyspark dataframe to eventhub is taking too long (8hrs) as there 3 Million records in dataframe

Shivasai 21

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.

Below is my pyspark code to connect and read oracle db table as dataframe

df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()

then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.

I have defined ehconf and eventhub connection string. Below is my write to eventhub code

df.select("body") \
   .write\
   .format("eventhubs") \
   .options(**ehconf) \    
   .save()

my pyspark code is taking 8 hours to send 3 million records to eventhub.

My Eventhub is created under eventhub cluster which has 1 CU in capacity

Could you please suggest how to write pyspark dataframe to eventhub faster ?

Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-04-13T08:04:49.867+00:00
Hello @Shivasai ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

1 answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-04-13T08:04:49.867+00:00

Hello @Shivasai ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

Answer 1

Hello @Shivasai ,

Thanks for the question and using MS Q&A platform.

You may checkout the below options to increase the write performance:

Option1: Increase the throughput unit capacity or enable Auto-Inflate future.

Auto-Inflate automatically scales the number of Throughput Units assigned to your Standard Tier Event Hubs Namespace when your traffic exceeds the capacity of the Throughput Units assigned to it. You can specify a limit to which the Namespace will automatically scale.

Option2: Setup maxEventsPerTrigger option in your query.

Rate limit on maximum number of events processed per trigger interval. The specified total number of events will be proportionally split across partitions of different volume.
After adding below code in the data bricks job to consume more event hub records throughput has been improved.

`.option("maxEventsPerTrigger", 5000000)`

For more information, refer to Azure Databricks – Event Hubs.

Hope this will help. Please let us know if any further queries.

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-04-18T05:52:47.143+00:00

Hello @Shivasai ,

Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

Share via

In Azure databricks writing pyspark dataframe to eventhub is taking too long (8hrs) as there 3 Million records in dataframe

1 answer

Your answer