I did manage to get a performance improvement by passing it through StringIO
io.StringIO(b)
CSV string to dataframe

We receive a base64 encoded file that when decoded, is a CSV delimited file. It's not a particularly big file 40K rows = 7MB but converting to a dataframe is not going so well
as it's base64, I decode in to a string and then use pandas read_csv to parse it but the read_csv is taking quite a long time (5 minutes) before failing with
LivyHttpRequestFailure: Something went wrong while processing your request
There isn't much to the code
b = b64.b64decode(z).decode()
c = pd.read_csv(b, sep=',', quotechar='"')
Where "z" is the base64 string
Current feeling is that all is being done on the driver and for whatever reason overconsuming resources - although since it is small, it shouldn't have a problem with memory, surely?
Can I get some parallelisation in there? Is there a better way to parse the string?
TIA
1 answer
Sort by: Most helpful
-
Ryan Abbey 1,136 Reputation points
2022-11-07T22:01:28.933+00:00