Share via


Tutorial: Prepare Data Using Microsoft.DataPrep .Net SDK

In this tutorial, you will

  • Download the Visual Studio project with sample codes.
  • Load two datasets with different field names.
  • Cleanse data to remove anomalies.
  • Merge and write datasets for further machine learning training.

Prerequisites

  • Microsoft .NET Core 2.1
  • Visual Studio 2017 with latest updates

Download sample Visual Studio project

  1. Create a local copy of code sample
git clone https://github.com/Microsoft/DataPrep.Net
  1. Access the Samples folder
cd DataPrep.Net\Samples
  1. Launch DataPrepSample.sln in the sample project.
DataPrepSample.sln

After launching DataPrepSample.sln in the sample project, you will see sample C# code in Program.cs.

Set up the input and output data path

    string FMInfoPath = Path.Combine(Directory.GetCurrentDirectory(), "farmers-markets-info.csv");
    string FMProductsPath = Path.Combine(Directory.GetCurrentDirectory(), "farmers-markets-products.csv");
    string resultPath = Path.Combine(Directory.GetCurrentDirectory(), "result.csv");

We have set up three file paths:

  1. FMInfoPath - A sample CSV file that contains farmers market information.
  2. FMProductsPath - A sample CSV file that contains farmers market produce information.
  3. resultPath - An file path where we will write the clean data to.

Auto Read Input Data and Detect the Column Data Type

    DataFlow dataFlowMarketInfo = Reader.AutoReadFile(FMInfoPath);
    DataFlow dataFlowProducts = Reader.AutoReadFile(FMProductsPath);

AutoReadFile() attemps to analyze the file(s) at the specified path and returns a new DataFlow containing the operations required to read and parse them. The type of the file and the arguments required to read it are inferred automatically.

Select Columns from Input Data Set

    dataFlowMarketInfo = dataFlowMarketInfo.KeepColumns(
        columnNames: new string[] {"FMID", "MarketName", "State", "zip", "Location"});
    dataFlowProducts = dataFlowProducts.KeepColumns(
        columnNames: new string[] {"Credit", "Organic", "Flowers"});

These operations select necessary columns from farmers market data. For farmers market info data, the "FMID", "MarketName", "State", "zip", and "Location" columns are selected. For farmers market products, the "Credit", "Organic" and "Flowers" columns are selected. The remain columns in these two farmers market data sets are dropped.

Convert Column Data Type

    dataFlowMarketInfo = dataFlowMarketInfo.ToString(columnName: "FMID");

This operation sets up the column data type conversion for "FMID" to string data type.

    dataFlowProducts = dataFlowProducts.ToBool(
        columnName: "Organic",
        trueValues: new List<string> {"Y"},
        falseValues: new List<string> {"N"},
        mismatchAs: MismatchAsOption.AsFalse);

This code converts the column "Organic" data type to boolean. For value "Y", it will be converted to true and all others value as "N", or mismathed column value, it will be converted to false.

Replace Invalid Column Values.

    dataFlowMarketInfo = dataFlowMarketInfo.ReplaceNa(
        columnName: "Location",
        customNaList: "Other");

For farmers market information data, this code will set up the conversion for column value as "Other" to NA in the "Location" column.

    dataFlowMarketInfo = dataFlowMarketInfo.ReplaceNa(columnName: "zip");

For farmers market information data, this code will set all invalid values in zip column to NA.

Combine Two DataFlows Into One DataFlow by Appending Columns

    DataFlow dataFlowCombined = dataFlowMarketInfo.AppendColumns(
        dataflows: new List<DataFlow> { dataFlowProducts });

This code will combine two data flow objects into one dataflow object (dataFlowCombined) by appending columns from dataFlowPorducts to dataFlowMarketInfo.

Sort the Combined DataFlow

    List<Tuple<string, bool>> sortOrder = new List<Tuple<string, bool>>
    {
        new Tuple<string, bool>("State", false)
    };
    dataFlowCombined = dataFlowCombined.Sort(sortOrder);

The code above will set up the sorting for the combined dataFlow object acoording the state column in ascending order.

Write the Transformed Data

    DataFlow result = dataFlowCombined.WriteDelimitedFile(resultPath);

This code will set up the output and write the cleaned up data into resultPath.

Execute the dataflow

    IDataView dataView = result.ToDataView();

This code will trigger the dataflow pipeline execution and return a IDataView object which can be used for model training. See IDataView for reference.

Review the Output

FMID MarketName State zip Location Credit Organic Flowers
0 1000632 Helena Market Days Alabama 35080 NA False False ERROR
1 1008842 Tuscaloosa Farmers' Market at the River Market at Manderson Landing Alabama 35401 NA True True True
2 1000113 Tuscaloosa County Truck Growers Association Alabama 35476 NA False False True
3 1002572 GEORGIANA FARMERS MARKET Alabama 36033 NA False False ERROR
4 1006730 Guntersville Farmers Market Alabama 35976 NA True False True

Clean up resources

If you don't need the result.csv file any more, you can delete it in your DataPrepSample project bin folder now.

Next steps

In this tutorial, you learned:

  • Downloaded the Visual Studio project with sample codes.
  • Loaded two datasets with different field names.
  • Cleansed data to remove anomalies.
  • Merged and wrote datasets for further machine learning training.

You're ready to use the Microsoft Data Prep .NET SDK to prepare your data.