Preparing ML object detction dataset for deep learning in PyTorch or similar

Question

Preparing ML object detction dataset for deep learning in PyTorch or similar

Joe Duncan 106

The intent of what I'm trying to achieve is:

Export data labelling project as a Dataset
Consume the Dataset in a notebook (converting to a Pandas dataframe)
Perform a custom train / test split that maintains particular file groupings
Register the resulting training and testing dataframes as Datasets
Use these Datasets to train and test a custom object detection model

I need help in preparing the data for that final step. I'm familiar with different deep learning libraries, but have never implemented them in the Azure environment before. I've managed to complete 1 to 4. For step 4, I ended up writing the data to csv files and uploading these to the datastore.

# define path for training data file and create new delimited file
train_path = './data/train.csv'
train_dataframe.to_csv(train_path, sep = ';', index = False)

# repeat for testing
test_path = './data/test.csv'
test_dataframe.to_csv(test_path, sep = ';', index = False)

# get the datastore to upload prepared data
datastore = Datastore.get(ws, datastore_name='learningdata')

# upload the local files from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='train-test', overwrite=True)

# create and register training dataset from datastore files
training_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/train.csv')], separator=';')
training_ds = training_ds.register(workspace=ws, name = 'train', description = 'training dataset sampled from labelled data', create_new_version=True)

# create and register testing dataset from datastore files
testing_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/test.csv')], separator=';')
testing_ds = testing_ds.register(workspace=ws, name = 'test', description = 'testing dataset sampled from labelled data', create_new_version=True)

The approach I was intending to use for step 5 was to use to_torchvision() to convert it into a Torchvision dataset. This doesn't work, I receive the following error:

UserErrorException: UserErrorException:
 Message: Cannot perform torchvision conversion on dataset without labeled columns defined
 InnerException None
 ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Cannot perform torchvision conversion on dataset without labeled columns defined"
    }
}

I suspect that the issue has to do with DataTypes. The original Dataset (exported from the data labelling project) has the DataTypes displayed below. By comparison, all column types in the train and test Datasets are parsed as strings. From my understanding, there's no way to convert to these data types.

image_url = Stream
label = List
label_confidence = List

Any advice on how to prepare this dataset for use in PyTorch or recommendation for an alternative approach would be greatly appreciated.

Update as per comment below:

I'm currently mounting the dataframe rather than downloading it due to data size.
I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.
I can't perform random split because I'm using clustered sampling.
I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)

Accepted answer

1 additional answer

Your answer

Answer 1

Joe Duncan 106

I modified the methodology and was able to successfully resolve this issue as follows:

Export data labelling project as Dataset
Consume the Dataset in the notebook by creating both a PyTorch dataset and a Pandas dataframe
Use the Pandas dataframe to determine indices for the train / test split based on required sampling
Use the indices as an input to torch.utils.data.Subset() to split the PyTorch dataset into train and test

Ramr-msft 17,826 Reputation points

2020-12-16T04:59:23.603+00:00

@Joe Duncan Great, Thanks for sharing the update.

Answer 2

@Joe Duncan Thanks for the great question. End-to-end image detection that leverages training/test datasets created from a Data Labeling project. you are well aware that you can also ‘solve’ this problem with CustomVision, but I’d like to showcase how a custom vision problem which may not be handle well enough by Custom Vision could be handled easily with Azure ML with full control of the underlying ML algorithms and the power of Data Labeling.

The best practices to get back to the images referenced by the dataset, i.e. leverage the DataStore / StreamInfo from the TabularDataset extracted DataFrame, to prepare the data for a model training.

This code here that I put together is probably the way to proceed to retrieve the original image assets from a labeled TabularDataset.

# azureml-core of version 1.0.72 or higher is required  
# azureml-contrib-dataset of version 1.0.72 or higher is required  
  
  
from azureml.core import Workspace, Dataset, Datastore  
import azureml.contrib.dataset  
import azureml.dataprep.native  
   
subscription_id = '_set_it_to_yours_'  
resource_group = '_set_it_to_yours_'  
workspace_name = '_set_it_to_yours_'  
   
workspace = Workspace(subscription_id, resource_group, workspace_name)  
   
# get dataset and extract as a DataFrame  
ds = Dataset.get_by_name(workspace, name=_set_it_to_yours_')  
df = ds.to_pandas_dataframe()  
   
# download images  
index = 0  
datastore = None  
while index < len(df):  
    # image_url is a azureml.dataprep.native.StreamInfo object, convert to dict with to_pod()  
    si = df.loc[index].image_url.to_pod()  
    if index == 0:  
        # retrieve datastore based on metadata from first row  
        # assuming all images come from the same store  
        # since they come from a single dataset  
        datastore = Datastore.get(workspace, si['arguments']['datastoreName'])  
    # download image locally  
    datastore.download(target_path='.',prefix=si['resourceIdentifier'],overwrite=True,show_progress=True)  
    index += 1  
   
# create training, test sets  
[training, test] = ds.random_split(0.8)

build model based on image assets and labels...
From there, build your train_x,y and test_x,y datasets…

We have checked in a sample notebook about labeled dataset to public github repo. You can find it here:

https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.ipynb

Joe Duncan 106 Reputation points

2020-12-10T20:58:58.517+00:00

Thanks @Ramr-msft . I'm familiar with that notebook. A few clarifications from me:

I'm currently mounting the dataframe rather than downloading it due to data size.

I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.

I can't perform random split because I'm using clustered sampling.

I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)

Can I point imread() directly to an image without downloading it if I can specify the datastore and the relative path?
Ramr-msft 17,826 Reputation points

2020-12-16T04:58:43.247+00:00

@Joe Duncan Thanks for the update.
hernandoZ 46 Reputation points

2021-06-17T15:16:13.267+00:00

Hi Sample notebook repo is not longer available , can you please share the new location.

Thanks
Oliver 20 Reputation points

2023-08-04T13:45:57.2966667+00:00

The tutorial file is no longer available. Are there any updates on this? Because your answer using the to_pod() funciton is quite unique. What is the best way to train a model using a custom labelled dataset?

Share via

Preparing ML object detction dataset for deep learning in PyTorch or similar

1 additional answer

Your answer