Azure data lake storage Java SDK to read from blob line by line

Anonymous
2024-01-09T09:05:38.99+00:00

Hi,

I am starting to get used to Azure Data lake storage. We are developing analytics application using Java with massive amount of data being stored in the data lake. The data is in the form of Json structure i.e. each event is a Json with different format and we store this as part of single blob file with the events separated by newline character or in another format we could just separate the events using commas. e.g.

{event1Json}

{event2Json}

OR

{event1Json},{event2Json}

This basically causes single blob storage file to store many thousands of events.

The real challenge is when we want to read this data using DataLakeFileClient Java SDK. This class provides read() API that reads entire blob file into an OutputStream which may crash the JVM due to large data being loaded.

The question I have is:

  1. Is it possible to read event json data line by line from single blob file without loading entire blob into OutputStream?
  2. Or do we need to store the events in small chunks in several multiple blob storage files if #1 is not possible using DataLake File client Java SDK?
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2024-01-11T01:13:33.7+00:00

    Hello Deole, Pushkar (Pushkar), I am not an expert on the Java SDK, but after further research, I found the below Microsoft document for your requirement. https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-java-sdk#read-a-file

    // Read File
    InputStream in = client.getReadStream(filename);
    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
    String line;
    while ( (line = reader.readLine()) != null) {
        System.out.println(line);
    }
    reader.close();
    System.out.println();
    System.out.println("File contents read.");
    

    The above code reads data from a single blob file line by line without loading the entire blob into memory. The getReadStream method returns an InputStream for the blob, and the BufferedReader reads the blob line by line. I hope this helps.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.