Python Azure SDK Having Trouble Importing tsv data; pandas error

Adrian Antico (TEKsystems, Inc.) 51 Reputation points
2022-02-10T23:38:29.46+00:00

I'm trying to get some cosmos data into an azure ml compute instance. I've done this a bunch of times but for some reason this particular set of data is giving me trouble, and I'm not sure why. I've removed all punctuation and special characters from the source data and the data is small enough for pandas to handle. I tried to download an entire stream but that failed so I downloaded a partial stream. The only glitch was a partial row of data on the final row but that is an artifact of a partial stream download, not the underlying data. Here's a link to one of the source files on cosmos that has caused issue:

https://aad.cosmos15.osdinfra.net/cosmos/xbox.quality.prod/shares/IEBKS.PartnerProd/cooked/xcloud/xCloudBi/AdrianAntico/Retention-
Engagement/LatencyServerFrameV2Raw_2022_01-12_14.tsv?property=info

Here's the code I'm running in a compute instance to transfer data from blob storage to the compute instance directory:

import os
import azureml
from azureml.core import Workspace, Dataset
import pandas as pd

# Root Path
RootPath = os.getcwd()

# MetaData # I hid the values below but they work
subscription_id = '' 
resource_group = ''
workspace_name = ''

# Create workspace 
workspace = Workspace(subscription_id, resource_group, workspace_name)

inlist = [
  'LatencyServerFrameV2Raw_2022_01-12_14',
  'LatencyServerFrameV2Raw_2022_01-15_18',
  'LatencyServerFrameV2Raw_2022_01-19_22',
  'LatencyServerFrameV2Raw_2022_01-23_26',
  'LatencyServerFrameV2Raw_2022_01-27_30',
  'LatencyServerFrameV2Raw_2022_01-31_03',
  'LatencyServerFrameV2Raw_2022_02-04_07']


# Import all data
for dd in inlist:  
  dataset = Dataset.get_by_name(workspace, name=f"{dd}.tsv")
  Path1 = RootPath + f"/Latency/NanoLatencyRawData/{dd}.csv"
  df = dataset.to_pandas_dataframe()                                                  # the error occurs on this step !!!!!
  del dataset
  df.to_csv(Path1)
  del df

UserErrorException: UserErrorException:
Message: Execution failed in operation 'to_pandas_dataframe' for Dataset(id='7cea1b1e-30df-4536-a859-d0931e52962a', name='LatencyServerFrameV2Raw_2022_02-04_07.tsv', version=3, error_code=ScriptExecution.StreamAccess.Validation,error_message=ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
Unable to read file using Unicode (UTF-8). Attempted read range 230686720:251658240. Lines read in the range 5597. Decoding error: [REDACTED]
Failed due to inner exception of type: DecoderFallbackException
| session_id=c27e97a1-bdc7-4216-ba31-c804c5570ae7) ErrorCode: ScriptExecution.StreamAccess.Validation
InnerException
Error Code: ScriptExecution.StreamAccess.Validation

>> dataset

{
"source": [
"('retention_engagement_dimention', '/local/data/cooked/xcloud/xCloudBi/AdrianAntico/Retention-Engagement/LatencyServerFrameV2Raw_2022_02-04_07.tsv')"
],
"definition": [
"GetDatastoreFiles",
"ParseDelimited",
"DropColumns",
"SetColumnTypes"
],
"registration": {
"id": "7cea1b1e-30df-4536-a859-d0931e52962a",
"name": "LatencyServerFrameV2Raw_2022_02-04_07.tsv",
"version": 3,
"workspace": "Workspace.create(name='xCloudML', subscription_id='09b5fdb3-165d-4e2b-8ca0-34f998d176d5', resource_group='xCloudData')"
}
}

Validation Error Code: InvalidEncoding
Validation Target: TextFile
Failed Step: 10a002a3-6c2b-4173-9b00-43cb4d8d0011
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
Unable to read file using Unicode (UTF-8). Attempted read range 230686720:251658240. Lines read in the range 5597. Decoding error: Unable to translate bytes [EF] at index 382 from specified code page to Unicode.
Unable to translate bytes [EF] at index 382 from specified code page to Unicode.
| session_id=c27e97a1-bdc7-4216-ba31-c804c5570ae7
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Execution failed in operation 'to_pandas_dataframe' for Dataset(id='7cea1b1e-30df-4536-a859-d0931e52962a', name='LatencyServerFrameV2Raw_2022_02-04_07.tsv', version=3, error_code=ScriptExecution.StreamAccess.Validation,error_message=ScriptExecutionException was caused by StreamAccessException.\n StreamAccessException was caused by ValidationException.\n Unable to read file using Unicode (UTF-8). Attempted read range 230686720:251658240. Lines read in the range 5597. Decoding error: [REDACTED]\n Failed due to inner exception of type: DecoderFallbackException\n| session_id=c27e97a1-bdc7-4216-ba31-c804c5570ae7) ErrorCode: ScriptExecution.StreamAccess.Validation"
}
}

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,558 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,611 Reputation points
    2022-02-14T04:48:26.917+00:00

    @Adrian Antico (TEKsystems, Inc.) Thanks for the question. Can you please add more details about the Azure SDK version that you are trying. Could you check what version of azureml-dataprep is installed in your python environment?