@MaciejS Thanks for the question. Please raise an issue in the following link to check by product team.
https://github.com/Azure/MachineLearningNotebooks/issues
Azure TabularDataset wrongly loads Parquet?
Below I give a concrete example where azureml Python api fails to correctly read Parquet files.
More precisely, the data gets displaced. It may clarify this issue where data could not be publicly shared posted by @Kengo Wada .
Setup: Python 3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0
The data is attached182488-error.log
The code demonstrating the issue is given below.
It uses an input data to create a table of strings along with some None values, stores in Parquet format, and reads either directly or through TabularDataset.
from azureml.core import Workspace, Dataset
import tempfile
import pandas as pd
import hashlib
# prepare data: list of hashs with some None values
df = pd.read_csv("error.log")
mask = df.isna().any(1)
df.id = df.id.map(lambda s:hashlib.sha512(str(s).encode()).hexdigest() if s else None)
df.loc[mask,'id'] = None
# configure Azure storage
ws = Workspace.from_config()
dstore = ws.datastores.get('my_datastore')
dstore_path = 'my_path'
target = (dstore,dstore_path)
# write to Azure storage
with tempfile.TemporaryDirectory() as tmpdir:
df.to_parquet(f'{tmpdir}/df.parquet')
ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)
# read by two ways: download and open in pandas or use the Azure connector
with tempfile.TemporaryDirectory() as tmpdir:
ds=Dataset.File.from_files(target)
ds.download(tmpdir)
df1 = pd.read_parquet(tmpdir)
ds = Dataset.Tabular.from_parquet_files(target)
df2 = ds.to_pandas_dataframe()
# comparison fails, the data seems displaced :-(
pd.testing.assert_frame_equal(df1,df2)
FWD: @Ramr-msft
1 answer
Sort by: Most helpful
-
Ramr-msft 17,731 Reputation points
2022-03-14T10:59:31.277+00:00