Preparing ML object detction dataset for deep learning in PyTorch or similar

Joe Duncan 106 Reputation points
2020-12-10T02:30:44.3+00:00

The intent of what I'm trying to achieve is:

  1. Export data labelling project as a Dataset
  2. Consume the Dataset in a notebook (converting to a Pandas dataframe)
  3. Perform a custom train / test split that maintains particular file groupings
  4. Register the resulting training and testing dataframes as Datasets
  5. Use these Datasets to train and test a custom object detection model

I need help in preparing the data for that final step. I'm familiar with different deep learning libraries, but have never implemented them in the Azure environment before. I've managed to complete 1 to 4. For step 4, I ended up writing the data to csv files and uploading these to the datastore.

# define path for training data file and create new delimited file
train_path = './data/train.csv'
train_dataframe.to_csv(train_path, sep = ';', index = False)

# repeat for testing
test_path = './data/test.csv'
test_dataframe.to_csv(test_path, sep = ';', index = False)

# get the datastore to upload prepared data
datastore = Datastore.get(ws, datastore_name='learningdata')

# upload the local files from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='train-test', overwrite=True)

# create and register training dataset from datastore files
training_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/train.csv')], separator=';')
training_ds = training_ds.register(workspace=ws, name = 'train', description = 'training dataset sampled from labelled data', create_new_version=True)

# create and register testing dataset from datastore files
testing_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/test.csv')], separator=';')
testing_ds = testing_ds.register(workspace=ws, name = 'test', description = 'testing dataset sampled from labelled data', create_new_version=True)

The approach I was intending to use for step 5 was to use to_torchvision() to convert it into a Torchvision dataset. This doesn't work, I receive the following error:

UserErrorException: UserErrorException:
 Message: Cannot perform torchvision conversion on dataset without labeled columns defined
 InnerException None
 ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Cannot perform torchvision conversion on dataset without labeled columns defined"
    }
}

I suspect that the issue has to do with DataTypes. The original Dataset (exported from the data labelling project) has the DataTypes displayed below. By comparison, all column types in the train and test Datasets are parsed as strings. From my understanding, there's no way to convert to these data types.

  • image_url = Stream
  • label = List
  • label_confidence = List

Any advice on how to prepare this dataset for use in PyTorch or recommendation for an alternative approach would be greatly appreciated.


Update as per comment below:

  • I'm currently mounting the dataframe rather than downloading it due to data size.
  • I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.
  • I can't perform random split because I'm using clustered sampling.
  • I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,231 questions
0 comments No comments
{count} votes

Accepted answer
  1. Joe Duncan 106 Reputation points
    2020-12-15T22:30:13.793+00:00

    I modified the methodology and was able to successfully resolve this issue as follows:

    1. Export data labelling project as Dataset
    2. Consume the Dataset in the notebook by creating both a PyTorch dataset and a Pandas dataframe
    3. Use the Pandas dataframe to determine indices for the train / test split based on required sampling
    4. Use the indices as an input to torch.utils.data.Subset() to split the PyTorch dataset into train and test

1 additional answer

Sort by: Most helpful
  1. Ramr-msft 17,821 Reputation points
    2020-12-10T13:06:31.287+00:00

    @Joe Duncan Thanks for the great question. End-to-end image detection that leverages training/test datasets created from a Data Labeling project. you are well aware that you can also ‘solve’ this problem with CustomVision, but I’d like to showcase how a custom vision problem which may not be handle well enough by Custom Vision could be handled easily with Azure ML with full control of the underlying ML algorithms and the power of Data Labeling.

    The best practices to get back to the images referenced by the dataset, i.e. leverage the DataStore / StreamInfo from the TabularDataset extracted DataFrame, to prepare the data for a model training.

    This code here that I put together is probably the way to proceed to retrieve the original image assets from a labeled TabularDataset.

    # azureml-core of version 1.0.72 or higher is required  
    # azureml-contrib-dataset of version 1.0.72 or higher is required  
      
      
    from azureml.core import Workspace, Dataset, Datastore  
    import azureml.contrib.dataset  
    import azureml.dataprep.native  
       
    subscription_id = '_set_it_to_yours_'  
    resource_group = '_set_it_to_yours_'  
    workspace_name = '_set_it_to_yours_'  
       
    workspace = Workspace(subscription_id, resource_group, workspace_name)  
       
    # get dataset and extract as a DataFrame  
    ds = Dataset.get_by_name(workspace, name=_set_it_to_yours_')  
    df = ds.to_pandas_dataframe()  
       
    # download images  
    index = 0  
    datastore = None  
    while index < len(df):  
        # image_url is a azureml.dataprep.native.StreamInfo object, convert to dict with to_pod()  
        si = df.loc[index].image_url.to_pod()  
        if index == 0:  
            # retrieve datastore based on metadata from first row  
            # assuming all images come from the same store  
            # since they come from a single dataset  
            datastore = Datastore.get(workspace, si['arguments']['datastoreName'])  
        # download image locally  
        datastore.download(target_path='.',prefix=si['resourceIdentifier'],overwrite=True,show_progress=True)  
        index += 1  
       
    # create training, test sets  
    [training, test] = ds.random_split(0.8)  
    

    build model based on image assets and labels...
    From there, build your train_x,y and test_x,y datasets…

    We have checked in a sample notebook about labeled dataset to public github repo. You can find it here:
    47267-image.png
    https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.ipynb


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.