A community member has associated this post with a similar question:
Is it better to filter in spark.sql text or from dataframe

Only moderators can edit this content.

Is it better to join tables in spark.sql text or as dataframes

Huzaifa Tapal 21 Reputation points
2022-12-05T17:59:16.683+00:00

We had a contractor setup the azure synapse analytics pipelines for us to ELT our source OLTP data using ADF through azure synapse analytics, store it in ADLS2 in databricks tables then ETL from databricks tables into our analytics store in snowflake schema form.

The notebooks created to perform the ETL of the data in azure databricks table to make dimension tables reads data from related tables separately into dataframes then joins the dataframes into one dataframe by the relation keys.

For example:

    lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
      
    # get departments  
    departmentsDF = spark.sql("select from departments")  
      
    departmentsDF = departmentsDF.select([col(i).alias("dept_"+i) for i in departmentsDF.columns])  
      
    # get employees updated in the past 1 day  
    employeesDF = spark.sql("select from employees")  
      
    employeesDF = employeesDF.where(col("LastUpdated") > lastUpdatedTime)  
      
    employeesDF = employeesDF.select([col(i).alias("emp_"+i) for i in employeesDF.columns])  
      
    concatEmployeesDF = employeesDF.join(departmentsDF, (employeesDF.emp_DepartmentID == departmentsDF.dept_ID) , how="left").drop(departmentsDF.dept_ID).dropDuplicates()  
      
    print(concatEmployeesDF.count())  

Is this the most efficient way of joining these two databrick tables? Would it be more efficient to do the joins in the SQL text when getting the employees?

.NET
.NET
Microsoft Technologies based on the .NET software framework.
3,367 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,362 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,916 questions
{count} votes