A community member has associated this post with a similar question:
Is it better to filter in spark.sql text or from dataframe
Only moderators can edit this content.
Is it better to join tables in spark.sql text or as dataframes
We had a contractor setup the azure synapse analytics pipelines for us to ELT our source OLTP data using ADF through azure synapse analytics, store it in ADLS2 in databricks tables then ETL from databricks tables into our analytics store in snowflake schema form.
The notebooks created to perform the ETL of the data in azure databricks table to make dimension tables reads data from related tables separately into dataframes then joins the dataframes into one dataframe by the relation keys.
For example:
lastUpdatedTime = datetime.utcnow() - timedelta(days=1)
# get departments
departmentsDF = spark.sql("select from departments")
departmentsDF = departmentsDF.select([col(i).alias("dept_"+i) for i in departmentsDF.columns])
# get employees updated in the past 1 day
employeesDF = spark.sql("select from employees")
employeesDF = employeesDF.where(col("LastUpdated") > lastUpdatedTime)
employeesDF = employeesDF.select([col(i).alias("emp_"+i) for i in employeesDF.columns])
concatEmployeesDF = employeesDF.join(departmentsDF, (employeesDF.emp_DepartmentID == departmentsDF.dept_ID) , how="left").drop(departmentsDF.dept_ID).dropDuplicates()
print(concatEmployeesDF.count())
Is this the most efficient way of joining these two databrick tables? Would it be more efficient to do the joins in the SQL text when getting the employees?