Compute Cluster cache

Shambhu Rai 1,406 Reputation points
2023-08-24T12:42:49.7933333+00:00

Hi Expert,

How to cache compute cluster in databricks

tried but not working as expected

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,335 questions
Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
479 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,346 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,910 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,489 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 14,881 Reputation points
    2023-08-25T00:31:55.39+00:00

    I think you are missing some steps :

    It looks like you're trying to configure caching for a compute cluster in Azure Databricks. The code snippet you provided appears to be setting some configurations, but they don't look like valid Spark configurations, so that could be why it's not working as expected.

    You'll need to create one if you're not already working with a cluster.

    Databricks runtime has built-in support for caching, and you can configure it through Spark configurations. Here's an example snippet to set caching configurations using SparkConf:

       val conf = new SparkConf()
         .set("spark.databricks.io.cache.enabled", "true")
         .set("spark.databricks.io.cache.maxDiskUsage", "50g")
         .set("spark.databricks.io.cache.maxMetadataCache", "1g")
         .set("spark.databricks.io.cache.compression.enabled", "false")
       val spark = SparkSession.builder().config(conf).getOrCreate()
    

    You can cache DataFrames in Databricks using the cache() method :

       val df = spark.read.parquet("path/to/your/data")
       df.cache()
    

    You can use the Databricks UI to monitor and manage caching. This allows you to view the cache status, storage levels, and more.


0 additional answers

Sort by: Most helpful