Is it better to filter in spark.sql text or from dataframe

Question

We had a contractor setup the azure synapse analytics pipelines for us to ELT our source OLTP data using ADF through azure synapse analytics, store it in ADLS2 in databricks tables then ETL from databricks tables into our analytics store in snowflake schema form.

The notebooks created to perform the ETL of the data in azure databricks table first reads what looks like all data from the databricks table into a dataframe then applies a where for only records that were updated in the past 1 day.

For example:

lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
  
concatEmployeesDF = spark.sql("select from employees")  
  
concatEmployeesDF= concatEmployeesDF.where(col("LastUpdated") > lastUpdatedTime)

Would it be more efficient to do the where in the spark SQL text itself like:

lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
      
concatEmployeesDF = spark.sql("select from employees where LastUpdated > {lu}".format(lu=lastUpdatedTime)))

Or are both the same and it doesn't matter?

Accepted Answer

Hello @Huzaifa Tapal ,
Thanks for the question and using MS Q&A platform.
As we understand the ask here is loading data to a dataframe is better Vs adding a filter while reading the data , please do let us know if its not accurate.
Dataframe is a two dimensional data structure and in SPARK , when you create a DF , it is stored in memory . Now I persoanlly think and if you do not need all the data , we should add that to the dataframe . We have seen out of memory exception for bigger dataset but in small dataset you may be see much different .
So I will go ahead with adding the where filter .

Please do let me if you have any queries.
Thanks
Himanshu

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Is it better to filter in spark.sql text or from dataframe

0 additional answers