reading delta-lake

arkiboys 9,686 Reputation points
2022-11-16T15:22:11.243+00:00

hello,
in databricks I get the data as follows:

column_1 _year _month _day

----------------------------------------------------

xxx 2022 11 14
yyy 2022 11 14
aaa 2022 11 15
...
ttt 2022 11 16
...

I would like to get the data into partitions in delta-lake as follows

delta
_delta_log
_year=2022
_month=11
_day=14
_day=15
...

to do this, I am using the following pyspark in databricks
folder_path = "/merged/presentation/delta/"
df_present.write.mode("overwrite").format("delta").partitionBy('_year', '_month', '_day').save(f"/mnt/storage_dev/curated{folder_path}")

so then I can query the data-lake as follows:
display(spark.read.format("delta").load(f'/mnt/storage_dev/curated{aspect_cu_folder_path}'))

The problem I have is that the data is displayed for the latest _day whereas I want to see all data in _day=14 and _day=15 and ...
do you see what I am doing wrong in the pyspark?
thank you

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,076 questions
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,061 Reputation points
    2022-11-18T17:31:32.137+00:00

    Hello @arkiboys ,
    Thanks for the question and using MS Q&A platform.

    As I understand you want to read from a range of partitions, not just a single partition.

    I admit, delta is my weakness. From what I see, reading a single file like that is not necessary because of the table metadata. Due to "Partition discovery," we don't have to specify individual files, and can use a different method. All this is done for you if you access through the table rather than file directly. We can verify this using .explain at the end.

    try

    spark.sql("select * from tablename where year = '2022' and month = '11' and day > 13")

    or

    spark.read.table("tablename").where("year = '2022' and month = '11' and day > 13")

    Excerpt from best-practices

    Load a single partition: Reading partitions directly is not necessary. For example, you don’t need to run spark.read.format("parquet").load("/data/date=2017-01-01"). Instead, use a WHERE clause for data skipping, such as spark.read.table("<table_name>").where("date = '2017-01-01'").

    Please do let me if you have any queries.

    Thanks
    Martin


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful