Data size of databricks delta tables

NIKHIL KUMAR 101 Reputation points
2024-05-02T09:39:01.4133333+00:00

It has been observed that the size of delta tables are much less as compared to when checked the underlying delta files in the storage account.

Suppose a databricks delta table raw.deltaTableA has size of 2MB if we check the size of underlying delta files directly in data lake it shows 50MB.

How this data size is being calculated ?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,480 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,212 questions
{count} votes

Accepted answer
  1. Vinodh247 22,951 Reputation points MVP
    2024-05-02T13:46:37.8+00:00

    Hi NIKHIL KUMAR,

    Thanks for reaching out to Microsoft Q&A.

    A delta table is a high-level abstraction that represents your data in a structured format. It includes metadata, schema information, and transaction logs. Delta files, on the other hand, are the actual data files stored in the underlying storage such as Azure Data Lake Storage. These files contain the raw data in a columnar format (Parquet or Delta format) hence the size you are seeing might be higher than the table.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. 2024-08-23T07:05:10.3+00:00

    The delta table size on Databricks will always show the size of the latest snapshot or version of that delta table.

    The under lying storage has all the data for the previous versions of the tables. This is to allow time travel to a previous version of the table which is a standard feature of Delta tables. When we do a regular vacuum, versions that are older than 7 days are deleted thereby limiting the time travel to 7 days in the past. This reduces the storage occupied and is a recommended approach.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.