Azure Blob SDK (Python azure-storage-blob) does not parse rows with tab separated columns in txt file

Volochy Grigory 16 Reputation points
2020-10-22T17:08:04.31+00:00

I have .txt files pushed by Microsoft Academic Graph to Azure Blob storage.

And I'm building a python app that uses "azure-storage-blob" SDK for querying the .txt files to search certain entries by column values. For this, I'm using the following documentation:
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration-how-to?tabs=python%2Cpowershell

I tested it for .csv files - and it works just fine: the columns are searchable by using the "query_blob" method of the "BlobClient" class. The files have columns in each row that are separated by comma sign ','

But when I'm trying to use it for those .txt files that have columns separated by the '\t' sign. Then in response to the query, I'm getting each row as a single column.
For example, if the file contains a row like:

95198407 14607 helpage international HelpAge International

Then, I expect to get all for columns searchable and get in response an object with four columns as it working for similar .csv files.
But instead of that, I'm getting a single row in response as a single column.

The live example of what I have in code:
34414-screenshot-from-2020-10-22-19-49-12.png

And what I have in response:
34415-screenshot-from-2020-10-22-19-49-01.png

I made multiple tests with parameter "delimiter" queal to:
'\t'
'\t'
'/\t'
'\t\t\t\t'

And similar to those. But all time the result is either the same or some time it throws an error like:
34340-screenshot-from-2020-10-22-19-54-14.png

Then I tried to set parameter "delimiter" to '\t\t\t\tt' and got the following response:
34319-screenshot-from-2020-10-22-19-57-26.png

So, it looks like it does not matter how many '\t' signs I'm specifying for the "delimiter" parameter, they all are filtering out and the columns are treated as 't' characters separated in this case.

And it looks like I either can not figure out how to escape the '\t' sign properly and that is why it is filtering out and ignoring, or there is some another way to specify the 'tab separator". I checked the docs for the BlobClient class here:
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python

And even looked inside the source code, but can't figure out how to solve the issue.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,858 questions
{count} vote

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.