How do I pass a pandas DataFrame from a Databricks step to a later step in a Data Factory pipeline?

Jeff vG 96 Reputation points
2021-07-25T13:22:12.667+00:00

In reference to this question, is it preferred to pass these records to a DataFactory pipeline, and a dataflow activity that ends in a SQL Table sink?

If so, how can I pass the DataFrame from the Databricks activity step to a Dataflow?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,331 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,245 questions
SQL Server Integration Services
SQL Server Integration Services
A Microsoft platform for building enterprise-level data integration and data transformations solutions.
2,643 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Ryan Abbey 1,181 Reputation points
    2021-07-25T19:39:19.847+00:00

    Not entirely sure of the question but DataBricks should be able to load to a table so why the need to pass it to DataFactory?

    To pass to DF, I'd say put them to a file and pass a reference to the file back to DataFactory - if it were a small dataframe, then stringifying has potential (although not sure how easily DF can reconstruct)


  2. PRADEEPCHEEKATLA 90,541 Reputation points
    2021-07-27T10:41:06.913+00:00

    Hi @Jeff vG ,

    Thanks for the question and using MS Q&A platform.

    If you want to write a pandas dataframe to Azure SQL Server using spark you should convert the pandas dataframe in a spark dataframe (using the spark.createDataframe method) and then there should not be any particular problem in writing it into a Azure SQL Server table.

    The following document (SQL databases using JDBC - Azure Databricks - Workspace | Microsoft Learn). It shows how you can read/write to/from SQL databases using JDBC. It, however, uses Spark dataframes. I would argue that doing the same with pandas dataframes is not a core use case of Databricks. If you run pandas, your code will only run on driver. If you really wish to use pandas, you can use sqlalchemy to create connection and then leverage .to_sql() on your pandas dataframe.

    Before criticizing pandas, it is important to understand that pandas may not always be the right tool for every task. Pandas lack multiprocessing support, and other libraries are better at handling big data.

    For more details, refer to Loading large datasets in Pandas

    Hope this helps. Do let us know if you any further queries.

    ---------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.