CSV string to dataframe

Ryan Abbey 1,181 Reputation points
2022-10-27T21:00:08.043+00:00

We receive a base64 encoded file that when decoded, is a CSV delimited file. It's not a particularly big file 40K rows = 7MB but converting to a dataframe is not going so well

as it's base64, I decode in to a string and then use pandas read_csv to parse it but the read_csv is taking quite a long time (5 minutes) before failing with

LivyHttpRequestFailure: Something went wrong while processing your request

There isn't much to the code

b = b64.b64decode(z).decode()  
c = pd.read_csv(b, sep=',', quotechar='"')  

Where "z" is the base64 string

Current feeling is that all is being done on the driver and for whatever reason overconsuming resources - although since it is small, it shouldn't have a problem with memory, surely?

Can I get some parallelisation in there? Is there a better way to parse the string?
TIA

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,675 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,073 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Ryan Abbey 1,181 Reputation points
    2022-11-07T22:01:28.933+00:00

    I did manage to get a performance improvement by passing it through StringIO
    io.StringIO(b)