Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this tutorial, you will
- Download the Visual Studio project with sample codes.
- Load two datasets with different field names.
- Cleanse data to remove anomalies.
- Merge and write datasets for further machine learning training.
Prerequisites
- Microsoft .NET Core 2.1
- Visual Studio 2017 with latest updates
Download sample Visual Studio project
- Create a local copy of code sample
git clone https://github.com/Microsoft/DataPrep.Net
- Access the
Samples
folder
cd DataPrep.Net\Samples
- Launch DataPrepSample.sln in the sample project.
DataPrepSample.sln
After launching DataPrepSample.sln
in the sample project, you will see sample C# code in Program.cs.
Set up the input and output data path
string FMInfoPath = Path.Combine(Directory.GetCurrentDirectory(), "farmers-markets-info.csv");
string FMProductsPath = Path.Combine(Directory.GetCurrentDirectory(), "farmers-markets-products.csv");
string resultPath = Path.Combine(Directory.GetCurrentDirectory(), "result.csv");
We have set up three file paths:
- FMInfoPath - A sample CSV file that contains farmers market information.
- FMProductsPath - A sample CSV file that contains farmers market produce information.
- resultPath - An file path where we will write the clean data to.
Auto Read Input Data and Detect the Column Data Type
DataFlow dataFlowMarketInfo = Reader.AutoReadFile(FMInfoPath);
DataFlow dataFlowProducts = Reader.AutoReadFile(FMProductsPath);
AutoReadFile()
attemps to analyze the file(s) at the specified path and returns a new DataFlow containing the operations required to read and parse them. The type of the file and the arguments required to read it are inferred automatically.
Select Columns from Input Data Set
dataFlowMarketInfo = dataFlowMarketInfo.KeepColumns(
columnNames: new string[] {"FMID", "MarketName", "State", "zip", "Location"});
dataFlowProducts = dataFlowProducts.KeepColumns(
columnNames: new string[] {"Credit", "Organic", "Flowers"});
These operations select necessary columns from farmers market data. For farmers market info data, the "FMID", "MarketName", "State", "zip", and "Location" columns are selected. For farmers market products, the "Credit", "Organic" and "Flowers" columns are selected. The remain columns in these two farmers market data sets are dropped.
Convert Column Data Type
dataFlowMarketInfo = dataFlowMarketInfo.ToString(columnName: "FMID");
This operation sets up the column data type conversion for "FMID" to string data type.
dataFlowProducts = dataFlowProducts.ToBool(
columnName: "Organic",
trueValues: new List<string> {"Y"},
falseValues: new List<string> {"N"},
mismatchAs: MismatchAsOption.AsFalse);
This code converts the column "Organic" data type to boolean. For value "Y", it will be converted to true and all others value as "N", or mismathed column value, it will be converted to false.
Replace Invalid Column Values.
dataFlowMarketInfo = dataFlowMarketInfo.ReplaceNa(
columnName: "Location",
customNaList: "Other");
For farmers market information data, this code will set up the conversion for column value as "Other" to NA in the "Location" column.
dataFlowMarketInfo = dataFlowMarketInfo.ReplaceNa(columnName: "zip");
For farmers market information data, this code will set all invalid values in zip column to NA.
Combine Two DataFlows Into One DataFlow by Appending Columns
DataFlow dataFlowCombined = dataFlowMarketInfo.AppendColumns(
dataflows: new List<DataFlow> { dataFlowProducts });
This code will combine two data flow objects into one dataflow object (dataFlowCombined) by appending columns from dataFlowPorducts to dataFlowMarketInfo.
Sort the Combined DataFlow
List<Tuple<string, bool>> sortOrder = new List<Tuple<string, bool>>
{
new Tuple<string, bool>("State", false)
};
dataFlowCombined = dataFlowCombined.Sort(sortOrder);
The code above will set up the sorting for the combined dataFlow object acoording the state column in ascending order.
Write the Transformed Data
DataFlow result = dataFlowCombined.WriteDelimitedFile(resultPath);
This code will set up the output and write the cleaned up data into resultPath.
Execute the dataflow
IDataView dataView = result.ToDataView();
This code will trigger the dataflow pipeline execution and return a IDataView object which can be used for model training. See IDataView for reference.
Review the Output
FMID | MarketName | State | zip | Location | Credit | Organic | Flowers | |
---|---|---|---|---|---|---|---|---|
0 | 1000632 | Helena Market Days | Alabama | 35080 | NA | False | False | ERROR |
1 | 1008842 | Tuscaloosa Farmers' Market at the River Market at Manderson Landing | Alabama | 35401 | NA | True | True | True |
2 | 1000113 | Tuscaloosa County Truck Growers Association | Alabama | 35476 | NA | False | False | True |
3 | 1002572 | GEORGIANA FARMERS MARKET | Alabama | 36033 | NA | False | False | ERROR |
4 | 1006730 | Guntersville Farmers Market | Alabama | 35976 | NA | True | False | True |
Clean up resources
If you don't need the result.csv file any more, you can delete it in your DataPrepSample project bin folder now.
Next steps
In this tutorial, you learned:
- Downloaded the Visual Studio project with sample codes.
- Loaded two datasets with different field names.
- Cleansed data to remove anomalies.
- Merged and wrote datasets for further machine learning training.
You're ready to use the Microsoft Data Prep .NET SDK to prepare your data.