The intent of what I'm trying to achieve is:
- Export data labelling project as a Dataset
- Consume the Dataset in a notebook (converting to a Pandas dataframe)
- Perform a custom train / test split that maintains particular file groupings
- Register the resulting training and testing dataframes as Datasets
- Use these Datasets to train and test a custom object detection model
I need help in preparing the data for that final step. I'm familiar with different deep learning libraries, but have never implemented them in the Azure environment before. I've managed to complete 1 to 4. For step 4, I ended up writing the data to csv files and uploading these to the datastore.
# define path for training data file and create new delimited file
train_path = './data/train.csv'
train_dataframe.to_csv(train_path, sep = ';', index = False)
# repeat for testing
test_path = './data/test.csv'
test_dataframe.to_csv(test_path, sep = ';', index = False)
# get the datastore to upload prepared data
datastore = Datastore.get(ws, datastore_name='learningdata')
# upload the local files from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='train-test', overwrite=True)
# create and register training dataset from datastore files
training_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/train.csv')], separator=';')
training_ds = training_ds.register(workspace=ws, name = 'train', description = 'training dataset sampled from labelled data', create_new_version=True)
# create and register testing dataset from datastore files
testing_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/test.csv')], separator=';')
testing_ds = testing_ds.register(workspace=ws, name = 'test', description = 'testing dataset sampled from labelled data', create_new_version=True)
The approach I was intending to use for step 5 was to use to_torchvision() to convert it into a Torchvision dataset. This doesn't work, I receive the following error:
UserErrorException: UserErrorException:
Message: Cannot perform torchvision conversion on dataset without labeled columns defined
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Cannot perform torchvision conversion on dataset without labeled columns defined"
}
}
I suspect that the issue has to do with DataTypes. The original Dataset (exported from the data labelling project) has the DataTypes displayed below. By comparison, all column types in the train and test Datasets are parsed as strings. From my understanding, there's no way to convert to these data types.
- image_url = Stream
- label = List
- label_confidence = List
Any advice on how to prepare this dataset for use in PyTorch or recommendation for an alternative approach would be greatly appreciated.
Update as per comment below:
- I'm currently mounting the dataframe rather than downloading it due to data size.
- I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.
- I can't perform random split because I'm using clustered sampling.
- I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)