How can I transfer a csv file on an Azure Machine Learning compute instance directory back to the Datastore?

Adrian Antico (TEKsystems, Inc.) 51 Reputation points
2021-09-20T17:24:01.093+00:00

I posted a similar question last week and didn't get a response to that yet so I'm posting another one now.

The code below is what I use to pull data into the compute instance from the Datastore. I transfer data from a Datastore to the compute instance and then save the data to my directory as a csv. The data originates from a SCOPE script and is transferred from Cosmos to the Datastore via Azure Data Factory.

Once the data is in the directory as a csv, I then utilize R to pull in the data into an RStudio session and then I run various tasks that create new data sets. I also save these new data sets to the compute instance directory as csv's. These new data sets are the ones I'd like to push back to the Datastore so they can be transferred elsewhere via Azure Data Factory and later consumed by a PowerBI app we're looking to create.

I tried using Designer and it ran for 4 days without completing before I cancelled the job and started looking for an alternative route. I don't know if it would have completed or if it ran into memory issues and simply didn't fail. When I pull data into the compute instance from the datastore it takes less than a few minutes to complete so I'm not sure why it would take Designer multiple days to attempt to do the reverse operation.

I've looked through a bunch of documentation and I am not able to find anything that tells us how we can transfer data from the compute instance back to the Datastore aside from Designer which is too slow or unable to handle.

This task seems like one that should be obvious for use and a major selling point of Azure Machine Learning so I'm a bit dumbfounded to see that this is a challenge figuring out how to do and that the documentation doesn't clearly show users how to achieve this task, assuming it's even possible. If it's not possible then I need to figure out a whole new system to use to get my work done. If it's not possible, the Azure Machine Learning team should enable this functionality as soon as possible.

# Azure management
from azureml.core import Workspace, Dataset

# MetaData
subscription_id = '09b5fdb3-165d-4e2b-8ca0-34f998d176d5'
resource_group = 'xCloudData'
workspace_name = 'xCloudML'

# Create workspace 
workspace = Workspace(subscription_id, resource_group, workspace_name)

# 1. Retention_Engagement_CombinedData
dataset = Dataset.get_by_name(workspace, name='retention-engagement-combineddata')

# Save data to file
df = dataset.to_pandas_dataframe()
df.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/v-aantico1/code/RetentionEngagement_CombinedData.csv')

# 2. TitleNameJoin
dataset = Dataset.get_by_name(workspace, name='TitleForJoiningInR')

# Save data to file
df = dataset.to_pandas_dataframe()
df.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/v-aantico1/code/TitleNameJoin.csv')
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,728 questions
0 comments No comments
{count} vote

Accepted answer
  1. romungi-MSFT 43,696 Reputation points Microsoft Employee
    2021-09-21T08:00:14.827+00:00

    @Adrian Antico (TEKsystems, Inc.) Have you tried the following to upload data to your datastore?

    from azureml.core import Workspace  
    ws = Workspace.from_config()  
    datastore = ws.get_default_datastore()  
      
    datastore.upload(src_dir='./data',  
                     target_path='datasets/',  
                     overwrite=True)  
    

    I think datastore.upload() should work for you to upload the required datafiles from your compute instance to datastore.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful