Is it better to filter in spark.sql text or from dataframe

Huzaifa Tapal 21 Reputation points
2022-12-03T17:24:07.623+00:00

We had a contractor setup the azure synapse analytics pipelines for us to ELT our source OLTP data using ADF through azure synapse analytics, store it in ADLS2 in databricks tables then ETL from databricks tables into our analytics store in snowflake schema form.

The notebooks created to perform the ETL of the data in azure databricks table first reads what looks like all data from the databricks table into a dataframe then applies a where for only records that were updated in the past 1 day.

For example:

lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
  
concatEmployeesDF = spark.sql("select from employees")  
  
concatEmployeesDF= concatEmployeesDF.where(col("LastUpdated") > lastUpdatedTime)  

Would it be more efficient to do the where in the spark SQL text itself like:

lastUpdatedTime = datetime.utcnow() - timedelta(days=1)  
      
concatEmployeesDF = spark.sql("select from employees where LastUpdated > {lu}".format(lu=lastUpdatedTime)))  

Or are both the same and it doesn't matter?

.NET
.NET
Microsoft Technologies based on the .NET software framework.
3,369 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,364 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee
    2022-12-05T22:41:56.487+00:00

    Hello @Huzaifa Tapal ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is loading data to a dataframe is better Vs adding a filter while reading the data , please do let us know if its not accurate.
    Dataframe is a two dimensional data structure and in SPARK , when you create a DF , it is stored in memory . Now I persoanlly think and if you do not need all the data , we should add that to the dataframe . We have seen out of memory exception for bigger dataset but in small dataset you may be see much different .
    So I will go ahead with adding the where filter .

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful