Reading Data as Spark DataFrame using mltable

Abdelkhalek Hamdi 0 Reputation points
2025-02-04T09:34:46.4233333+00:00

The Microsoft documentation for mltable mentions support for reading data as a Spark DataFrame, but specific examples or references are hard to find.

Docs: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?view=azureml-api-2&tabs=cli#:~:text=Azure%20Machine%20Learning%20supports%20a%20Table%20type%20(mltable).%20This%20allows%20for%20the%20creation%20of%20a%20blueprint%20that%20defines%20how%20to%20load%20data%20files%20into%20memory%20as%20a%20Pandas%20or%20Spark%20data%20frame.%20In%20this%20article%20you%20learn%3A

Has anyone successfully implemented this?

Additionally, is it possible to use a Spark Serverless job to read data directly from an mltable YAML file?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,109 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 1,460 Reputation points Microsoft Employee
    2025-02-05T06:34:13.45+00:00

    Hi Abdelkhalek Hamdi,

    Thanks for troubleshooting on this.

    It looks like the error you're encountering is due to the MLTable object not having a method called to_spark_dataframe. Instead, you can convert the MLTable to a Pandas DataFrame and then convert it to a Spark DataFrame. Here's an example of how you can do this:

    from mltable import load
    from azureml.core import Workspace
    import pandas as pd
    from pyspark.sql import SparkSession
    
    # Initialize Spark session
    spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
    
    # Load MLTable
    ws = Workspace.from_config()
    path = "./mltable-test/"  # Path to your mltable YAML file
    mltable = load(path)
    
    # Convert MLTable to Pandas DataFrame
    pandas_df = mltable.to_pandas_dataframe()
    
    # Convert Pandas DataFrame to Spark DataFrame
    spark_df = spark.createDataFrame(pandas_df)
    spark_df.show()
    

    Regarding the installation issue, you got it right so you should use pip install mltable instead of azureml-mltable. The correct command is:

    pip install mltable pyspark
    

    Hope this should resolve the version error you encountered.

    If you have any further questions or need additional assistance, feel free to ask!

    Thanks.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.