How do I pass a pandas DataFrame from a Databricks step to a later step in a Data Factory pipeline?

Question

How do I pass a pandas DataFrame from a Databricks step to a later step in a Data Factory pipeline?

Jeff vG 96

In reference to this question, is it preferred to pass these records to a DataFactory pipeline, and a dataflow activity that ends in a SQL Table sink?

If so, how can I pass the DataFrame from the Databricks activity step to a Dataflow?

2 answers

Your answer

Answer 1

Ryan Abbey 1,186

Not entirely sure of the question but DataBricks should be able to load to a table so why the need to pass it to DataFactory?

To pass to DF, I'd say put them to a file and pass a reference to the file back to DataFactory - if it were a small dataframe, then stringifying has potential (although not sure how easily DF can reconstruct)

Jeff vG 96 Reputation points

2021-07-25T21:29:46.237+00:00

I would love to hear how to load a 5000-row pd.DataFrame object to Azure SQL Server using the jdbc modules in spark: https://learn.microsoft.com/en-us/answers/questions/488391/how-can-i-insert-a-pandas-dataframe-into-azure-sql.html
Ryan Abbey 1,186 Reputation points

2021-07-25T21:45:26.45+00:00

Convert to a Spark dataframe and then...

spark-connector
Jeff vG 96 Reputation points

2021-07-25T22:22:15.203+00:00

this is all in scala? is there a method that exists for the Databricks default, python/pyspark?
Ryan Abbey 1,186 Reputation points

2021-07-25T23:59:54.737+00:00

Should work in pyspark too. I note it's an older version that I provided in the link... an updated version with some python logic in link below

connector
Jeff vG 96 Reputation points

2021-07-26T00:15:51.797+00:00

I also feel like since you can arrange Databricks and Python activities in ADF upstream from a dataflow that it should be a standard use case to operate on the output of those activities in the subsequent pipeline activities. Is that not the case?
Ryan Abbey 1,186 Reputation points

2021-07-26T01:15:29.107+00:00

Yes, you should be able to pass detail back to Data Factory, can't remember the syntax but should be able to find it on Google

Answer 2

PRADEEPCHEEKATLA 90,641 Moderator

Hi @Jeff vG ,

Thanks for the question and using MS Q&A platform.

If you want to write a pandas dataframe to Azure SQL Server using spark you should convert the pandas dataframe in a spark dataframe (using the spark.createDataframe method) and then there should not be any particular problem in writing it into a Azure SQL Server table.

The following document (SQL databases using JDBC - Azure Databricks - Workspace | Microsoft Learn). It shows how you can read/write to/from SQL databases using JDBC. It, however, uses Spark dataframes. I would argue that doing the same with pandas dataframes is not a core use case of Databricks. If you run pandas, your code will only run on driver. If you really wish to use pandas, you can use sqlalchemy to create connection and then leverage .to_sql() on your pandas dataframe.

Before criticizing pandas, it is important to understand that pandas may not always be the right tool for every task. Pandas lack multiprocessing support, and other libraries are better at handling big data.

For more details, refer to Loading large datasets in Pandas

Hope this helps. Do let us know if you any further queries.

---------------------------------------------------------------------------

Please "Accept the answer" if the information helped you. This will help us and others in the community as well.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-07-28T11:20:11.297+00:00

Hi @Jeff vG ,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

Jeff vG 96

Hi,

I am able to use spark.read to read data from this same table, but when I try to INSERT into it, I get syntax errors referencing characters I do not have in my queries. Please advise?

This works to read:

pushdown_query = "(select Id from dbo.MasterList) master"  
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)  
display(df)

But this produces com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'insert'.

pushdown_insert = "insert into [dbo].[MasterList] ([Id], [ImportFileName], [ImportFileDate]) values ('999B8D61FA2A6959D254B6FF5D0FB512249329097336A35568089933B49ABDDE', 'filename.csv', '2021-07-29')"  
d = {'Id': '999B8D61FA2A6959D254B6FF5D0FB512249329097336A35568089933B49ABDDE', 'ImportFileName':'filename.csv', 'ImportFileDate':'2021-07-29'}  
df = pd.DataFrame(d, index=[0])  
df  
sdf = spark.createDataFrame(df)  
sdf.write.jdbc(url=jdbcUrl, table=pushdown_insert, properties=connectionProperties)

I have tried many different ways of quoting/bracketing and none of them seem to work. This seems to be an issue with some CTE translation for the Spark SQL Connector?

Jeff vG 96

Alternatively, if I try this route, I get a different error:

sdf.write\
    .format("jdbc")\
    .mode("append")\
    .option("url", jdbcUrl)\
    .option("dbtable", "MasterList")\
    .option("user", user)\
    .option("password", password)\
    .save()
---> : com.microsoft.sqlserver.jdbc.SQLServerException: CREATE TABLE permission denied in database 'master'.

Yuanyue Liu 1 Reputation point

2021-11-14T06:16:16.827+00:00

Dear Sir/Madam,

I think your hints are very helpful, but I still have some questions in my case. I would appreciate it if you can help me:

I stored 20yrs data in Azure SQL(totally it is 4 tables with different columns). It is much fast to use 'spark.read.jdbc' to read them and join them into a large one(it is necessary to make them become one big table before further processing).
However, it is easy to load the whole with spark, but once I want to proceed with the data month by month, in which divide the data into small partitions manually, pandas should have a better performance.

So is there any method that I can load the specific data from the register SQL table in databricks via pandas? (the result could be pandas data frame directly

So that I can save much time for reading via spark and then transferring spark table to pandas table. I found that transferring spark tables to pandas tables could take a long long time.

Thanks

Share via

How do I pass a pandas DataFrame from a Databricks step to a later step in a Data Factory pipeline?

2 answers

Your answer