Azure TabularDataset wrongly loads Parquet?

Question

Below I give a concrete example where azureml Python api fails to correctly read Parquet files.
More precisely, the data gets displaced. It may clarify this issue where data could not be publicly shared posted by @Kengo Wada .

Setup: Python 3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0

The data is attached182488-error.log

The code demonstrating the issue is given below.
It uses an input data to create a table of strings along with some None values, stores in Parquet format, and reads either directly or through TabularDataset.

from azureml.core import Workspace, Dataset  
import tempfile  
import pandas as pd  
import hashlib  
  
# prepare data: list of hashs with some None values  
df = pd.read_csv("error.log")  
mask = df.isna().any(1)  
df.id = df.id.map(lambda s:hashlib.sha512(str(s).encode()).hexdigest() if s else None)  
df.loc[mask,'id'] = None  

# configure Azure storage  
ws = Workspace.from_config()  
dstore = ws.datastores.get('my_datastore')  
dstore_path = 'my_path'  
target = (dstore,dstore_path)  

# write to Azure storage  
with tempfile.TemporaryDirectory() as tmpdir:  
    df.to_parquet(f'{tmpdir}/df.parquet')  
    ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)  
  
# read by two ways: download and open in pandas or use the Azure connector  
with tempfile.TemporaryDirectory() as tmpdir:  
    ds=Dataset.File.from_files(target)  
    ds.download(tmpdir)  
    df1 = pd.read_parquet(tmpdir)  
    ds = Dataset.Tabular.from_parquet_files(target)  
    df2 = ds.to_pandas_dataframe()  
  
# comparison fails, the data seems displaced :-(  
pd.testing.assert_frame_equal(df1,df2)

FWD: @Ramr-msft

Answer

@MaciejS Thanks for the question. Please raise an issue in the following link to check by product team.
https://github.com/Azure/MachineLearningNotebooks/issues

Share via

Azure TabularDataset wrongly loads Parquet?

1 answer