A community member has associated this post with a similar question:
Is it better to filter in spark.sql text or from dataframe

Only moderators can edit this content.

Is it better to join tables in spark.sql text or as dataframes

Huzaifa Tapal 21

We had a contractor setup the azure synapse analytics pipelines for us to ELT our source OLTP data using ADF through azure synapse analytics, store it in ADLS2 in databricks tables then ETL from databricks tables into our analytics store in snowflake schema form.

The notebooks created to perform the ETL of the data in azure databricks table to make dimension tables reads data from related tables separately into dataframes then joins the dataframes into one dataframe by the relation keys.

For example:

    lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
      
    # get departments  
    departmentsDF = spark.sql("select from departments")  
      
    departmentsDF = departmentsDF.select([col(i).alias("dept_"+i) for i in departmentsDF.columns])  
      
    # get employees updated in the past 1 day  
    employeesDF = spark.sql("select from employees")  
      
    employeesDF = employeesDF.where(col("LastUpdated") > lastUpdatedTime)  
      
    employeesDF = employeesDF.select([col(i).alias("emp_"+i) for i in employeesDF.columns])  
      
    concatEmployeesDF = employeesDF.join(departmentsDF, (employeesDF.emp_DepartmentID == departmentsDF.dept_ID) , how="left").drop(departmentsDF.dept_ID).dropDuplicates()  
      
    print(concatEmployeesDF.count())

Is this the most efficient way of joining these two databrick tables? Would it be more efficient to do the joins in the SQL text when getting the employees?

KranthiPakala-MSFT 46,422 Reputation points Microsoft Employee

2022-12-06T18:27:53.213+00:00

Hello @Huzaifa Tapal ,

Thanks for the question and using MS Q&A platform.

This is duplicate of anther thread which has been already answered. If there are any follow-up questions, please engage on other thread here - Is it better to filter in spark.sql text or from dataframe

Thank you