Working with AzureMachineLearningFileSystem and binary files

Jo Walsh 6 Reputation points
2022-11-21T15:04:04.797+00:00

https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-access-data-interactive?tabs=adls - this is the recommended route in the v2 API for interactive / exploratory data access - rather than mount() a FileDataset object, use this new filesystem-like interface. So far this works:

   uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'  
     
   fs = AzureMachineLearningFileSystem(uri)  
   fs.ls()  

fs.open('path/to/file.tif') returns a pystreaminfo_companion.StreamInfoFileObject which has io.BytesIO-like behaviours and apparently no documentation on the internet

In this case we are trying to work with the data using the rasterio python package which accepts a python file object or a path as input. This won't work, it throws a read buffer error:

   raster_data = rasterio.open(fs.open('path/to/image.tif'))  
   img_arr = raster_data.read()  

We can short-term work around this by reading the byte stream into a rasterio MemoryFile object, but it's inefficient - files could be very large

fs.get('path/to/image.tif', 'local_path.tif') throws a NotImplementedError

We know this interface is only in public preview but is it cooked? Is it mainly a documentation problem?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,306 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Jo Walsh 6 Reputation points
    2022-11-21T17:14:03.41+00:00

    What we found when digging into this further - I was reading relatively large files (>100Mb) from a Data Lake Gen2 and seeing the read() fail with a buffer size error like this:

       ERROR 1: TIFFReadEncodedStrip:Read error at scanline 4294967295; got 4600 bytes, expected 8000  
       ERROR 1: TIFFReadEncodedStrip() failed.  
       ERROR 1: /vsipythonfilelike/6c4028d8-2a05-4b4a-95fd-998f4395afb7/6c4028d8-2a05-4b4a-95fd-998f4395afb7, band 1: IReadBlock failed at X offset 0, Y offset  
    

    My colleague tried it on tiny files in blob storage and the behaviour of the StreamInfoFileObject withrasterio worked as you'd hope and expect

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.