Convert File Dataset into a Dataframe to use in a pipeline

MEZIANE Yani 11 Reputation points
2021-09-01T13:57:17.667+00:00

Hello,

I would like to convert a file dataset into a dataframe using a python script to use the data in a pipeline. I need to use the file dataset as i want to train my model using the files and not the table.

Thank you!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,565 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,560 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. romungi-MSFT 42,191 Reputation points Microsoft Employee
    2021-09-02T10:17:48.457+00:00

    @MEZIANE Yani I think you could try this to use the filedataset as pandas dataframe, download and use it for your experiment's training.

    from azureml.core import Dataset  
    from azureml.opendatasets import MNIST  
    import pandas as pd  
    import os  
    
    data_folder = os.path.join(os.getcwd(), 'data')  
    os.makedirs(data_folder, exist_ok=True)  
      
    #Download the dataset  
    mnist_file_dataset = MNIST.get_file_dataset()  
    mnist_file_dataset.download(data_folder, overwrite=True)  
    
    #Use the files in dataframe  
    df = pd.DataFrame(mnist_file_dataset.to_path())  
    print(df)  
      
    #Register the dataset for training  
    mnist_file_dataset = mnist_file_dataset.register(workspace=ws,  
                                                     name='mnist_opendataset',  
                                                     description='training and test dataset',  
                                                     create_new_version=True)  
    
      
    

  2. MEZIANE Yani 11 Reputation points
    2021-09-06T14:45:29.03+00:00

    My aim is to run a pipeline (pre-process data and tune model hyperparameters) that I already have with design using as input data not each row of a table as it does with a tabular dataset but rather for each CVS file that represents an object (its information with a lot of rows) as input since the random selection per frame is amplifying the performance of the model. I have the data as tabular and files in a dataset. I have managed to get the path of each cvs file; but cannot read them as part of a new dataframe. I have the data in a datastore and dataset, so I don’t know if to accomplish this I should store the data elsewhere (I have not been working long with Azure so I am not acquainted with all the storage possibilities and the interactions between these and the ML studio.)

    I manage to do this in python with the following code:
    listOfFile = os.listdir(path)
    for file in listOfFile:
    new_well=pd.read_csv(os.path.join(path,file))

    And in Azure this is as far as I have gotten without result:

      ds = Dataset.get_by_name(ws, name='well files')
     ds.download(data_folder, overwrite=True)
     df = pd.DataFrame(ds.to_path())
     df= dirr+df
    files = pd.DataFrame(df)
    well = map(pd.read_csv, files)
    

    but I cannot use this output of well into the design pipeline due to being class map.

    Thank you very much for your help. It is greatly appreciated as I really have no clue whatsoever on how to proceed or solve this.


  3. MEZIANE Yani 11 Reputation points
    2021-09-09T14:19:01.737+00:00

    @romungi-MSFT

    Is there a way to do this with multiples .cvs documents?
    I have a folder full of cvs files I need to read, is there a way to give the path of the folder and for the program to read all of the cvs files within that folder?
    There are a lot so not really feasible to do them one by one.

    0 comments No comments

  4. MEZIANE Yani 11 Reputation points
    2021-09-10T08:24:33.5+00:00

    Ok I managed with this very simple line:

                         tabular_dataset_3 = Dataset.Tabular.from_delimited_files(path=(datastore,'weather/**/*.csv'))
    

    However, I’m afraid this will not help me accomplish my objective as all the files are now in the same tabular dataset and now, I need the training of a model to be done considering the files and not all the rows, meaning that there will be random selection per frame and not per document as I desired. I need to pre-process the data and split the training and test dataset based on the csv documents, not on a table containing all the data points.

    Thank you for your help!

    0 comments No comments