Compute Cluster cache

Shambhu Rai 1,411 Reputation points
2023-08-24T12:42:49.7933333+00:00

Hi Expert,

How to cache compute cluster in databricks

tried but not working as expected

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
576 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-08-25T00:31:55.39+00:00

    I think you are missing some steps :

    It looks like you're trying to configure caching for a compute cluster in Azure Databricks. The code snippet you provided appears to be setting some configurations, but they don't look like valid Spark configurations, so that could be why it's not working as expected.

    You'll need to create one if you're not already working with a cluster.

    Databricks runtime has built-in support for caching, and you can configure it through Spark configurations. Here's an example snippet to set caching configurations using SparkConf:

       val conf = new SparkConf()
         .set("spark.databricks.io.cache.enabled", "true")
         .set("spark.databricks.io.cache.maxDiskUsage", "50g")
         .set("spark.databricks.io.cache.maxMetadataCache", "1g")
         .set("spark.databricks.io.cache.compression.enabled", "false")
       val spark = SparkSession.builder().config(conf).getOrCreate()
    

    You can cache DataFrames in Databricks using the cache() method :

       val df = spark.read.parquet("path/to/your/data")
       df.cache()
    

    You can use the Databricks UI to monitor and manage caching. This allows you to view the cache status, storage levels, and more.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.