다음을 통해 공유


C#: Processing CSV files (Part 1)

Introduction

This series will explore various aspects for importing a CSV file with comma-separated-values (.csv) into a SQL-Server database.  CSV files are a common way to share data in plain text format from sources such as a database table(s) to another database e.g. from SQL-Server to an Oracle database. 

The accompanying source code and code blocks have been kept very simply so that following along and learning the basics is not overwhelming as this generally can happen the deeper into the import process you the develop go. 

When data exported from a database to a client database that has a matching database table(s) with matching columns the process is not always simple, for example business rules may indicate new incoming data can’t overwrite existing data or incoming data needs to be merged with existing data.

In the wild rarely is a simple import possible as database data types all have the same basic types but are handled differently from database to database. Couple this with a flat CSV file may need to be split up into multiple database tables.

Part 1-A part of the series

Considerations for processing data

The following should always be considered when importing CSV files.

  • All columns are suspect to be missing altogether or missing in one or more rows.
  • Mixed data types, consider a column with dates where some rows may have malformed dates, dates setup for a different culture, columns that should be numeric were some rows have no value or unexpected format etc.
  • Columns which have values that are not valid to your business e.g. a list of products that need to map to a product table where there are products that you don’t handle.
  • Column values out of range e.g. a numeric column has a range of 1 through 10 but incoming data has values 1 through 100.
  • The file is in use by another process and is locked.
  • The file is extremely large and processing time may take hours, have a plan such as to run a nightly job.
  • Handling rows/columns that don’t fit into the database, have a plan to handle them as several examples will be shown in this series.
  • Offering clients, a method(s) to review suspect data, modify or reject the data.
  • Consider an intermediate database table so that processing suspect data can be done over time especially when there is a large data set that may take hours or days to process.

Consider working with CSV files as a puzzle no matter what the structure should be and that parsing different files usually has their own quirks.

Part one goals

To read a simple CSV file just over 7,500 records, nine columns with types ranging from integer, float, date time and strings with malformed data.

To parse data a TextFieldParser will be used to read and parse data. Alternates to a TextFieldParser are reading data using a Stream (StreamReader) or OleDb when sticking with pure Microsoft classes. Outside of this there are several libraries that can handle reading CSV files yet as stated this series is solely for working with Microsoft classes.

During parsing assertion is performed to validate data is the proper types, not empty and if in validate ranges. Data read in is placed into a list of a class designed to handle the data read in from the CSV file.

The TextFieldParser class does a great job at processing incoming data which is why this class was selected. As with any class there can be unknowns which become known once you have worked with them and learn them. With the TextFieldParser when looping though lines in a file, empty lines are skipped. In the code sample nothing is done but the line count will be off by the amount of empty lines encountered to what might be learned from opening the file in Notepad++ or similar text editor.  Using OleDb or a Steam lines are not ignored but nothing is truly gained if the record count is correct e.g. there are 150 lines were 50 lines are empty and you expect 100 lines of valid data. This means you have received the correct amount of data, just that there are empty lines to filter out.

Requires

Microsoft Visual Studio 2017 (or higher)
Microsoft SQL-Server Express or the full SQL-Server.

Visual interface

The interface is done using Windows forms project as these types of projects are easy to setup then setting a web project up coupled with a Windows form project need not be installed on a user’s machine but instead may be executed from a shared location.

File selection

In the code samples below a hard-coded file is used, in the wild a file may be selected by a file selection dialog, by reading one or more files from a directory listing. If the process were to be from a directory listing then the results would go directly to a intermediate table for review while in the code samples provided here they are sent directly to a DataGridView.

Parsing data using StreamReader 

First check to ensure the file to parse exists. In the following code block mHasException and mLastException are from a base exception class which the class for parsing inherits. The return type is a ValueTuple (installed using NuGet Package Manager).

if (!File.Exists(_inputFileName))
{
    mHasException = true;
    mLastException = new  FileNotFoundException($"Missing {_inputFileName}");
    return (mHasException, new List<DataItem>(),new List<DataItemInvalid>() );
}

If the file exists the next step is to setup several variables which will be used for validation purposes and return types which will contain valid and if presented invalid data when read in data from the CSV file.

var validRows = new  List<DataItem>();
var invalidRows = new  List<DataItemInvalid>();
var validateBad = 0;
 
int index = 0;
 
int district = 0;
int grid = 0;
int nCode = 0;
float latitude = 0;
float longitude = 0;

The following code block follows the code block above. 

A while statement is used to loop through each line in the CSV file. For each line, split the line by comma in this case which is the most common delimiter. Next validate there are nine elements in the string array. If there are not nine elements in the array then place them into a possible reject container.

Note that the first line contains column names which is skip by checking the index/line number stored in the variable index.

Following the check for nine elements int a line seven elements in the string array are checked to ensure they can be converted to the expected data type ranging from date to numerics and also empty string values.

Passing the type check above the section under the comment Questionable fields will do several more checks e.g. does the NICIC field contain data that is not in an expected range. Note all data should be checked here such as the data in part[3] as this can be subjective to the data in other elements in the array so this is left to the review process which will provides a grid with a dropdown of validate selections to select from. If there are issues to review a record a property is set to flag the data for a manual review process and loaded into a list.

try
{
    using (var readFile = new StreamReader(_inputFileName))
    {
        string line;
        string[] parts;
 
        while ((line = readFile.ReadLine()) != null)
        {
            parts = line.Split(',');
            index += 1;
 
            if (parts == null)
            {
                break;
            }
 
            index += 1;
            validateBad = 0;
 
            if (parts.Length != 9)
            {
                invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
                continue;
 
            }
 
            // Skip first row which in this case is a header with column names
            if (index <= 1) continue;
            /*
             * These columns are checked for proper types
             */
            var validRow = DateTime.TryParse(parts[0], out  var d) &&
                           float.TryParse(parts[7].Trim(), out  latitude) &&
                           float.TryParse(parts[8].Trim(), out  longitude) &&
                           int.TryParse(parts[2], out  district) &&
                           int.TryParse(parts[4], out  grid) &&
                           !string.IsNullOrWhiteSpace(parts[5]) &&
                           int.TryParse(parts[6], out  nCode);
 
            /*
             * Questionable fields
             */
            if (string.IsNullOrWhiteSpace(parts[1]))
            {
                validateBad += 1;
            }
            if (string.IsNullOrWhiteSpace(parts[3]))
            {
                validateBad += 1;
            }
 
            // NICI code must be 909 or greater
            if (nCode < 909)
            {
                validateBad += 1;
            }
 
            if (validRow)
            {
 
                validRows.Add(new DataItem()
                {
                    Id = index,
                    Date = d,
                    Address = parts[1],
                    District = district,
                    Beat = parts[3],
                    Grid = grid,
                    Description = parts[5],
                    NcicCode = nCode,
                    Latitude = latitude,
                    Longitude = longitude,
                    Inspect = validateBad > 0
                });
 
            }
            else
            {
                // fields to review in specific rows
                invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
            }
        }
    }
}
catch (Exception ex)
{
    mHasException = true;
    mLastException = ex;
}

Once the above code has completed the following line of code returns data to the calling form/window which is a ValueTupler.

return (IsSuccessFul, validRows, invalidRows);

Parsing data using TextFieldParser

This example uses a TextFieldParser to process data. Rather then splitting lines manually as done above TextFieldParser.ReadFields method handles the splitting by the delimiter assigned in parser.Delimiters. The remainder for validating data is no different then done with StreamReader. One major difference is empty lines are ignored unlike with SteamReader.

public (bool Success, List<DataItem>, List<DataItemInvalid>,  int  EmptyLineCount) LoadCsvFileTextFieldParser()
{
    mHasException = false;
 
    var validRows = new  List<DataItem>();
    var invalidRows = new  List<DataItemInvalid>();
    var validateBad = 0;
 
    int index = 0;
 
    int district = 0;
    int grid = 0;
    int nCode = 0;
    float latitude = 0;
    float longitude = 0;
 
    var emptyLineCount = 0;
    var line = "";
 
    try
    {
        /*
         * If interested in blank line count
         */
        using (var reader = File.OpenText(_inputFileName))
        {
            while ((line = reader.ReadLine()) != null) // EOF
            {
                if (string.IsNullOrWhiteSpace(line))
                {
                    emptyLineCount++;
                }
            }
        }
 
        using (var parser = new TextFieldParser(_inputFileName))
        {
             
            parser.Delimiters = new[] { "," };
            while (true)
            {
 
                string[] parts = parser.ReadFields();
 
                if (parts == null)
                {
                    break;
                }
 
                index += 1;
                validateBad = 0;
 
                if (parts.Length != 9)
                {
                    invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
                    continue;
 
                }
 
                // Skip first row which in this case is a header with column names
                if (index <= 1) continue;
 
                /*
                 * These columns are checked for proper types
                 */
                var validRow = DateTime.TryParse(parts[0], out  var d) &&
                               float.TryParse(parts[7].Trim(), out  latitude) &&
                               float.TryParse(parts[8].Trim(), out  longitude) &&
                               int.TryParse(parts[2], out  district) &&
                               int.TryParse(parts[4], out  grid) &&
                               !string.IsNullOrWhiteSpace(parts[5]) &&
                               int.TryParse(parts[6], out  nCode);
 
                /*
                 * Questionable fields
                 */
                if (string.IsNullOrWhiteSpace(parts[1]))
                {
                    validateBad += 1;
                }
                if (string.IsNullOrWhiteSpace(parts[3]))
                {
                    validateBad += 1;
                }
 
                // NICI code must be 909 or greater
                if (nCode < 909)
                {
                    validateBad += 1;
                }
 
                if (validRow)
                {
 
                    validRows.Add(new DataItem()
                    {
                        Id = index,
                        Date = d,
                        Address = parts[1],
                        District = district,
                        Beat = parts[3],
                        Grid = grid,
                        Description = parts[5],
                        NcicCode = nCode,
                        Latitude = latitude,
                        Longitude = longitude,
                        Inspect = validateBad > 0
                    });
 
                }
                else
                {
                    // fields to review in specific rows
                    invalidRows.Add(new DataItemInvalid() { Row = index, Line = string.Join(",", parts) });
                }
 
            }
        }
 
    }
    catch (Exception ex)
    {
        mHasException = true;
        mLastException = ex;
    }
 
    return (IsSuccessFul, validRows, invalidRows,emptyLineCount);
 
}

Parsing data using OleDb

This method "reads" lines from a CSV file with the disadvantage of all fields are not typed and carry more baggage than needed for processing lines from the CSV file which will make a difference in time to process with larger CSV files.

public DataTable LoadCsvFileOleDb()
{
    var connString = $@"Provider=Microsoft.Jet.OleDb.4.0;.....";
 
    var dt = new  DataTable();
 
    try
    {
        using (var cn = new OleDbConnection(connString))
        {
            cn.Open();
 
            var selectStatement = "SELECT * FROM ["  + Path.GetFileName(_inputFileName) + "]";
 
            using (var adapter = new OleDbDataAdapter(selectStatement, cn))
            {
                var ds = new  DataSet("Demo");
                adapter.Fill(ds);
                ds.Tables[0].TableName = Path.GetFileNameWithoutExtension(_inputFileName);
                dt = ds.Tables[0];
            }
        }
    }
    catch (Exception ex)
    {
        mHasException = true;
        mLastException = ex;
    }
 
    return dt;
}

Reviewing

The following window has several buttons at the bottom. The Process button executes reading the CSV file using in this case StreamReader. The dropdown will contain any line number which needs to be inspected, pressing the inspect button moves to that line in the grid, this would be for a small amount of lines with issues or to get a visual on a possible larger problem. The button labeled Review will popup a child window to permit edits that will update the main window below.

Child window shown when pressing the "Review" button.

The only true validation done on this window is to provide a list of valid values for the beat field using a Dropdown from a static list. As this series continues a database reference table will replace the static list.

Code for validating through a Drop-down in the DataGridView.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using WindowsFormsApp1.Classes;
 
namespace WindowsFormsApp1
{
    public partial  class ReviewForm : Form
    {
        private BindingSource _bs = new BindingSource();
        private List<DataItem> _data;
 
        /// <summary>
        /// Provide access by the calling form to the data presented
        /// </summary>
        public List<DataItem> Data  
        {
            get { return _data; }
        }
 
        /// <summary>
        /// Acceptable values for beat field. In part 2 these will be read from a database reference table.
        /// </summary>
        private List<string> _beatList = new  List<string>()
            {
                "1A", "1B", "1C", "2A", "2B", "2C", "3A", "3B", "3C", "3M", "4A",
                "4B", "4C", "5A", "5B", "5C", "6A", "6B", "6C"
            } ;
 
        public ReviewForm()
        {
            InitializeComponent();
        }
 
        public ReviewForm(List<DataItem> pData)
        {
            InitializeComponent();
 
            _data = pData;
            Shown += ReviewForm_Shown;
        }
 
        private void  ReviewForm_Shown(object sender, EventArgs e)
        {
            dataGridView1.AutoGenerateColumns = false;
 
            // ReSharper disable once PossibleNullReferenceException
            ((DataGridViewComboBoxColumn) dataGridView1.Columns["beatColumn"]).DataSource = _beatList;
 
            _bs.DataSource = _data;
            dataGridView1.DataSource = _bs;
            dataGridView1.ExpandColumns();
 
            dataGridView1.EditingControlShowing += DataGridView1_EditingControlShowing;
 
        }
        /// <summary>
        /// Setup to provide access to changes to the current row, here we are only interested in the beat field.
        /// Other fields would use similar logic for providing valid selections.
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void  DataGridView1_EditingControlShowing(object sender, DataGridViewEditingControlShowingEventArgs e)
        {
            if (dataGridView1.CurrentCell.IsComboBoxCell())
            {
                if (dataGridView1.Columns[dataGridView1.CurrentCell.ColumnIndex].Name == "beatColumn")
                {
                    if (e.Control is ComboBox cb)
                    {
                        cb.SelectionChangeCommitted -= _SelectionChangeCommitted;
                        cb.SelectionChangeCommitted += _SelectionChangeCommitted;
                    }
                }
            }
 
        }
        /// <summary>
        /// Update current row beat field
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void  _SelectionChangeCommitted(object sender, EventArgs e)
        {
            if (_bs.Current !=null)
            {
                if (!string.IsNullOrWhiteSpace(((DataGridViewComboBoxEditingControl)sender).Text))
                {
                    var currentRow = (DataItem) _bs.Current;
                    currentRow.Beat = ((DataGridViewComboBoxEditingControl) sender).Text;
                    currentRow.Inspect = false;
                }
            }
        }
    }
}

Extension methods used in the above code blocks.

namespace WindowsFormsApp1.Classes
{
    public static  class DataGridViewExtensions
    {
        /// <summary>
        /// Expand all columns excluding in this case Orders column
        /// </summary>
        /// <param name="sender"></param>
        public static  void ExpandColumns(this DataGridView sender)
        {
            sender.Columns.Cast<DataGridViewColumn>().ToList()
                .ForEach(col => col.AutoSizeMode = DataGridViewAutoSizeColumnMode.AllCells);
        }
 
        /// <summary>
        /// Used to determine if the current cell type is a ComboBoxCell
        /// </summary>
        /// <param name="sender"></param>
        /// <returns></returns>
        public static  bool IsComboBoxCell(this DataGridViewCell sender)
        {
            var result = false;
            if (sender.EditType != null)
            {
                if (sender.EditType == typeof(DataGridViewComboBoxEditingControl))
                {
                    result = true;
                }
            }
            return result;
        }
    }
}

Data classes to contain data read from the CSV file.

Good/questionable data class

namespace WindowsFormsApp1.Classes
{
    public class  DataItem
    {
        public int  Id { get; set; }
        public DateTime Date { get; set; }
        public string  Address { get; set; }
        public int  District { get; set; }
        public string  Beat { get; set; }
        public int  Grid { get; set; }
        public string  Description { get; set; }
        public int  NcicCode { get; set; }
        public float  Latitude { get; set; }
        public float  Longitude { get; set; }
        public bool  Inspect { get; set; }
        public string  Line => $"{Id},{Date},{Address},{District},{Beat}," + 
                              $"{Grid},{Description},{NcicCode},{Latitude},{Longitude}";
        public override  string ToString()
        {
            return Id.ToString();
        }
    }
}

Invalid data class.

namespace WindowsFormsApp1.Classes
{
    public class  DataItemInvalid
    {
        public int  Row { get; set; }
        public string  Line { get; set; }
        public override  string ToString()
        {
            return $"[{Row}] '{Line}'";
        }
    }
}

Up coming in part 2

  • Storing data in a SQL-Server table which will then be presented in a grid rather than showing read data into a grid.
  • Storing changes performed in the review process back to the database intermediate table.
  • Storing references in SQL-Server rather than static variables.
  • Presenting a move to production table in SQL-Server.
  • Presenting several options to push data from data classes to SQL-Server.

Summary

In this article thoughts/ideas along with suggestions have been presented to dealing with CSV files which is to be considered a building block which continues in part 2 of this series.

See also

Source code

https://github.com/karenpayneoregon/ImportingCsvFiles