Error tokenizing data when uploading to blob storage

Quentin Tschofen 0 Reputation points
2024-04-18T15:43:32.0666667+00:00

I have a function that is used to download attachments from emails, upload them to blob storage, and apply a tag to the email after upload. Recently, one of the attachments has started erroring out when uploading to blob storage with the error "Error tokenizing data. C error: Expected 1 fields in line 3, saw 4". So far, I've determined that the error is happening in the final step where the blob is uploaded- every other step completes successfully without error. I've tried altering the function to upload the attachment as .xlsx instead of a .csv, but the same error occurs with this one file. No other file is affected. Interestingly, the file is uploaded successfully, even though it returns the error. When inspecting the file manually, I'm not finding any characters that would scramble the .csv; the file is read without issue by Synapse.

Any ideas?


for attachment in item.attachments:
                            if isinstance(attachment, FileAttachment) and attachment.name.endswith('.csv'):
                                logging.info("\n\nReading CSV\n\n")
                                df = pd.read_csv(BytesIO(attachment.content), sep='|')
                                logging.info("\n\nCleaning problem characters\n\n")                                
								df_cleaned = df.applymap(clean_problem_characters) ## removes double quotes, newlines, tabs, pipes, and carriage returns                                 								
								logging.info("\n\nAdding metadata\n\n")
                                df_export = addMetadata(df_cleaned, metadata={
                                    "__sent_time": metadata["__sent_time"],
                                    "__source_landing_time": str(item.datetime_received.isoformat()
                                                                 ).replace('+00:00', ''),
                                    "__source_file_path": attachment.name
                                })
                                logging.info("\n\nCreating buffer\n\n")
                                buffer = BytesIO()
                                logging.info("\n\nWrite file to buffer\n\n")
                                df_export.to_csv(buffer, index=False)
                                logging.info("\n\nUploading blob\n\n")
                                container_client.upload_blob(
                                    name=f"raw/email/{email_folder}/{attachment.name}", data=buffer.getvalue(), overwrite=True)
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,920 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,997 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 26,186 Reputation points
    2024-04-18T18:12:02.3266667+00:00

    I think the error that you have can be because the format of a CSV file doesn't match the expected schema, particularly during parsing operations. This can happen if the actual data contains more columns than expected or if the delimiter isn't consistently applied across the data.

    Since you mentioned the file reads correctly in other tools, it might still contain special characters or extra quotes that are not properly escaped, which could be interpreted incorrectly by the pandas parser.

    You can modify your read operation to handle potential parsing errors better or to skip over problematic lines or to issue warnings that can help identify the exact issue.

    
       df = pd.read_csv(BytesIO(attachment.content), sep='|', error_bad_lines=False, warn_bad_lines=True)
    
    

    Then add additional logging right before the parsing step to output the raw data of the first few lines.

    
       raw_content = BytesIO(attachment.content)
    
       print(raw_content.getvalue()[:500])  # print first 500 characters of the file
    
       df = pd.read_csv(raw_content, sep='|')
    
    

  2. Anand Prakash Yadav 7,795 Reputation points Microsoft Vendor
    2024-04-22T06:50:54.1533333+00:00

    Hello Quentin Tschofen,

    Thank you for posting your query here!

    Azure Blob Storage itself doesn’t parse the contents of the files being uploaded, so the error you’re seeing is likely not due to Blob Storage looking for something specific in the file. However, there are a few things that could potentially cause an upload to fail:

    · Azure Blob Storage has a limit on the size of the blocks being uploaded. If the block size is too large, the upload could fail.

    · Also, there’s also a limit on the number of uncommitted blocks that can be associated with a blob (100,000 blocks). If a previous upload operation was canceled or failed, there might be uncommitted blocks left over.

    · Even if PUT operations are occurring simultaneously for a blob, this could potentially cause issues.

    · If the file is large, the upload might be timing out. You could potentially solve this by setting the max_single_put_size to something smaller when you create the BlobClient.

    For reference: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-storage/blobs/connectivity/invalid-blob-or-block-content-or-invalid-block-list

    I hope this helps! Please let me know if the issue persists or if you have any other questions.

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members. 

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.