Reading Data as Spark DataFrame using mltable

Question

Reading Data as Spark DataFrame using mltable

Abdelkhalek Hamdi 40

The Microsoft documentation for mltable mentions support for reading data as a Spark DataFrame, but specific examples or references are hard to find.

Docs: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?view=azureml-api-2&tabs=cli#:~:text=Azure%20Machine%20Learning%20supports%20a%20Table%20type%20(mltable).%20This%20allows%20for%20the%20creation%20of%20a%20blueprint%20that%20defines%20how%20to%20load%20data%20files%20into%20memory%20as%20a%20Pandas%20or%20Spark%20data%20frame.%20In%20this%20article%20you%20learn%3A

Has anyone successfully implemented this?

Additionally, is it possible to use a Spark Serverless job to read data directly from an mltable YAML file?

Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-02-04T10:20:59.9166667+00:00
Hi Abdelkhalek Hamdi,

Thanks for posting your question on Microsoft Q&A.

Yes, Azure Machine Learning's MLTable supports reading data into a Spark DataFrame. While specific examples are limited, the general approach involves defining your data schema using an MLTable YAML file and then loading this data within a Spark environment. Here is an example of how you can achieve this:

Loading MLTable as Spark DataFrame: MLTable YAML definitions can be loaded directly into Spark using the Azure ML SDK. Here’s a minimal sample implementation:
from mltable import load from azureml.core import Workspace ws = Workspace.from_config() path = "./mltable.yaml" # Path to your mltable YAML file # Load as Spark DataFrame spark_df = load(path).to_spark_dataframe() spark_df.show()

Reference Docs: MLTable Supported Operations → "Working with Spark"

is it possible to use a Spark Serverless job to read data directly from an mltable YAML file?

Yes, Spark Serverless (Azure Synapse) can read MLTable YAML files directly if:

Dependency Inclusion: Attach the azureml-mltable and PySpark libraries to your cluster/job:
pip install azureml-mltable pyspark

Structured Read: Use the MLTable SDK in PySpark script:
from pyspark.sql import SparkSession from mltable import load spark = SparkSession.builder.getOrCreate() mltable_path = "abfss://<filesystem>@<account>.dfs.core.windows.net/your-data/mltable.yaml" mltable = load(str(mltable_path)) # Supports Azure Blob/Azure Data Lake paths spark_df = mltable.to_spark_dataframe()

Key Challenges & Workarounds: Current docs focus on Pandas examples. For Spark:

Use to_spark_dataframe() explicitly (shown above).

Load via explicit Spark data source format:
spark.read.format("mltable").load(path)

(Requires azureml-mltable available in cluster environment)

For an example of a standalone Spark job configuration, see the Azure Machine Learning documentation on submitting Spark jobs.

Detailed instructions on submitting Spark jobs can be found in the Azure Machine Learning documentation.

If the reply was helpful, please don't forget to upvote and/or accept it as an answer. Let me know if you have any other queries.

Thank you!
Abdelkhalek Hamdi 40 Reputation points

2025-02-04T13:27:23.5533333+00:00
Hi Vikram, thank you for your answer.
First of all, running the following code from within an azureML compute instance returned the following error:

from mltable import load from azureml.core import Workspace ws = Workspace.from_config() path = "./mltable-test/" # Path to your mltable YAML file # Load as Spark DataFrame spark_df = load(path).to_spark_dataframe() spark_df.show()

Error:

The same error was returned when running on top of Spark Serverless instead of compute instance.

The other thing:
pip install azureml-mltable pyspark returned:
I guess you meant pip install mltable instead of azureml-mltable

Accepted answer

0 additional answers

Your answer

Abdelkhalek Hamdi 40 Reputation points

2025-02-04T13:27:23.5533333+00:00

Hi Vikram, thank you for your answer.
First of all, running the following code from within an azureML compute instance returned the following error:

from mltable import load from azureml.core import Workspace ws = Workspace.from_config() path = "./mltable-test/" # Path to your mltable YAML file # Load as Spark DataFrame spark_df = load(path).to_spark_dataframe() spark_df.show()

Error:

The same error was returned when running on top of Spark Serverless instead of compute instance.

The other thing:
pip install azureml-mltable pyspark returned:
I guess you meant pip install mltable instead of azureml-mltable

Answer 1

Vikram Singh 2,585 Microsoft Employee Moderator

Hi Abdelkhalek Hamdi,

Thanks for troubleshooting on this.

It looks like the error you're encountering is due to the MLTable object not having a method called to_spark_dataframe. Instead, you can convert the MLTable to a Pandas DataFrame and then convert it to a Spark DataFrame. Here's an example of how you can do this:

from mltable import load
from azureml.core import Workspace
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

# Load MLTable
ws = Workspace.from_config()
path = "./mltable-test/"  # Path to your mltable YAML file
mltable = load(path)

# Convert MLTable to Pandas DataFrame
pandas_df = mltable.to_pandas_dataframe()

# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()

Regarding the installation issue, you got it right so you should use pip install mltable instead of azureml-mltable. The correct command is:

pip install mltable pyspark

Hope this should resolve the version error you encountered.

If you have any further questions or need additional assistance, feel free to ask!

Thanks.

Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-02-07T03:58:39.3933333+00:00

Hi Abdelkhalek Hamdi

Greetings.

Just following up to check if my suggestion helped. Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Thank you
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-02-11T06:45:52.8866667+00:00

Hi Abdelkhalek Hamdi

Awaiting your reply.

If the response helped, please do click Accept Answer and Yes for was this answer helpful.

Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.

Thank You.
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-02-17T04:45:33.86+00:00

Hi Abdelkhalek Hamdi,

Greetings!

We haven’t heard from you on the last response and was just checking back to see if you got a chance to try above suggestions.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Thank you.

Share via

Reading Data as Spark DataFrame using mltable

0 additional answers

Your answer