Azure TabularDataset wrongly loads Parquet?

MaciejS 11 Reputation points
2022-03-12T21:49:21.043+00:00

Below I give a concrete example where azureml Python api fails to correctly read Parquet files.
More precisely, the data gets displaced. It may clarify this issue where data could not be publicly shared posted by @Kengo Wada .

Setup: Python 3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0

The data is attached182488-error.log

The code demonstrating the issue is given below.
It uses an input data to create a table of strings along with some None values, stores in Parquet format, and reads either directly or through TabularDataset.

from azureml.core import Workspace, Dataset  
import tempfile  
import pandas as pd  
import hashlib  
  
# prepare data: list of hashs with some None values  
df = pd.read_csv("error.log")  
mask = df.isna().any(1)  
df.id = df.id.map(lambda s:hashlib.sha512(str(s).encode()).hexdigest() if s else None)  
df.loc[mask,'id'] = None  

# configure Azure storage  
ws = Workspace.from_config()  
dstore = ws.datastores.get('my_datastore')  
dstore_path = 'my_path'  
target = (dstore,dstore_path)  

# write to Azure storage  
with tempfile.TemporaryDirectory() as tmpdir:  
    df.to_parquet(f'{tmpdir}/df.parquet')  
    ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)  
  
# read by two ways: download and open in pandas or use the Azure connector  
with tempfile.TemporaryDirectory() as tmpdir:  
    ds=Dataset.File.from_files(target)  
    ds.download(tmpdir)  
    df1 = pd.read_parquet(tmpdir)  
    ds = Dataset.Tabular.from_parquet_files(target)  
    df2 = ds.to_pandas_dataframe()  
  
# comparison fails, the data seems displaced :-(  
pd.testing.assert_frame_equal(df1,df2)  

FWD: @Ramr-msft

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,729 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,731 Reputation points
    2022-03-14T10:59:31.277+00:00

    @MaciejS Thanks for the question. Please raise an issue in the following link to check by product team.
    https://github.com/Azure/MachineLearningNotebooks/issues