Run a pipeline using cvs files from a folder in the datastore

MEZIANE Yani 11 Reputation points
2021-09-01T11:11:12.977+00:00

I want to run a model using as input CVS files in a folder (UI/date) in the default datastore. I want the model to train based on the CVS files and to pick a random between them as each file represents an object to be randomly selected.

I already have in design the pipeline I want to use; is just that I want to run it with the files of the datastore and not from a tabular dataset. I have tried to call these folder by a python script using os.listdir and then read_cvs, however the path for this folder doesn’t seem valid. I have done this activity in python using the path of folder in my computer and it works. But I don’t know how to proceed in python.

Thank you for your help.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,728 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. MEZIANE Yani 11 Reputation points
    2021-09-15T12:34:47.89+00:00

    Hello,

    I have managed to pre-process the data on the files with the following code:

                 run = Run.get_context()  
    		ws = run.experiment.workspace  
    		datastore = Datastore.get(ws, 'workspaceblobstore')  
    		data_paths = [(datastore, 'UI/08-26-2021_014718_UTC/**/*.csv')]  
    		tabular = Dataset.Tabular.from_delimited_files(path=data_paths)  
    		dataframe1 = tabular.to_pandas_dataframe()  
    

    And like this I can modify and clean the data as necessary, however this is the same as creating a tabular dataset that will take random rows for the training of the model (random selection per frame) while I need to train according to the csv files (random selection per well/file), which again is very simple with python but have yet to manage with azure specially since my already designed workflow is in design (where data is pre-processed, trained with tunning hyper-parameters and evaluated).

    The code from python I want to recreate:

                         for file in listOfFile:  
    		 	   new_well=pd.read_csv(os.path.join(path,file))    
    

    So, I can train with new well that represents the csv files.

    I am attaching an example of the csv files I have to processed (in total I have over 2000 documents).

    132329-capture.png

    1 person found this answer helpful.

  2. Ramr-msft 17,731 Reputation points
    2021-09-03T02:18:56.367+00:00

    @MEZIANE Yani Thanks for the question. Can you please share the code that you are trying. Create a filedataset referencing to the root folder. Mount the filedataset on CI, and use pandas to read each file from the mounted path. If you're trying to read data into a Pandas dataframe, you can do so directly with Pandas from Azure storage including Blob, ADLSv1, and ADLSv2. Every pandas.read_* takes in storage_options, for instance see: pandas.read_table — pandas 1.2.1 documentation (pydata.org).
    Typically you can retrieve these storage options from your Azure ML Datastore, i.e. for the default datastore:

       python  
       from azurmel.core import Workspace   
          
       ws = Workspace.from_config()  
       ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]  
          
       storage_options = {"account_name": ds.account_name, "account_key": ds.account_key}  
          
       data_path = f"az://mycontainer/path/to/data.csv"  
          
       df = pd.read_csv(data_path, storage_options=storage_options)  
    
     
    

    If you want to list each file in the storage account and read sequentially into Pandas, you could easily do that as well. You'll need to adjust the code for ADLSv1 (the storage options and protocol to "adl").