다음을 통해 공유


C#: Processing CSV files (Part 1A)

Introduction

This article is a continuation of a series on parsing delimited text file leading off with part 1 of the series.  This article will discuss prevalidation ideas rather than performing all parsing and validation together.

What should be considered when parsing delimited files that the unexpected should always be considered, never think too much assertion has been performed.

Base data

Throughout this article the file is delimited by a comma as per below where each line in the file is consider a customer.

Alana,Petty,AF7Y7VM47Z7J2TMF2BXKRWV8PPJ,Fremont,Missouri,Andorra,5442.20,1956-08-31 17:01:39.3333997

Which can be represented in a class for a matching table in a SQL-Server database table.

public class Customer
{
    public int Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public string Street { get; set; }
    public string City { get; set; }
    public string State { get; set; }
    public string Country { get; set; }
    public decimal? Balance { get; set; }
    public DateTime? LastPaid { get; set; }
    public override string ToString()
    {
        return $"{FirstName}, {LastName}";
    }
}

Perfect world parsing

When information in a delimited text file "always" perfected, there may be empty lines while all non-empty lines the data is correct data types a file may be parses in one line passing a file name to a method which in turn returns a list of (in this case) a Customer.

public List<Customer> ReadCustomersPerfectWorld(string fileName)
{
 
    return (
        from line in  File.ReadAllLines(fileName)
        where line.Length > 0
        let lineParts = line.Split(',')
        select new  Customer()
        {
            FirstName = lineParts[0],
            LastName = lineParts[1],
            Street = lineParts[2],
            City = lineParts[3],
            State = lineParts[4],
            Country = lineParts[5],
            Balance = Convert.ToDecimal(lineParts[6]),
            LastPaid = Convert.ToDateTime(lineParts[7])
        }).ToList();
 
}

Imperfect world parsing

When there is imperfection a developer may enclose read logic in a try/catch statement.

public List<Customer> ReadCustomersImperfectWorld(string fileName)
{
 
    var customerList = new  List<Customer>();
 
    try
    {
        customerList = (
            from line in  File.ReadAllLines(fileName)
            where line.Length > 0
            let lineParts = line.Split(',')
            select new  Customer()
            {
                FirstName = lineParts[0],
                LastName = lineParts[1],
                Street = lineParts[2],
                City = lineParts[3],
                State = lineParts[4],
                Country = lineParts[5],
                Balance = Convert.ToDecimal(lineParts[6]),
                LastPaid = Convert.ToDateTime(lineParts[7])
            }).ToList();
    }
    catch (Exception ex)
    {
        mHasException = true;
        mLastException = ex;
    }
 
    return customerList;
 
}

Although this might seem okay there is no way to deal with issues like invalid data types and may only resolve a missing file or no permissions to read a file. A missing file should be asserted prior to calling this method while insufficient permissions are to be handled outside of the application usually by moving the file to were permissions are granted or to have the user permissions set in Active Directory. 

Using assertion parsing/processing

There are third party libraries for handling validation and processing of delimited files while many time no matter how good they are a developer must spend time learning them, instead consider writing in house classes for validation.

If the validation is done through a user interface the process should always present both good and invalidated data where in this case good deal is shown in a window which started processing an a child window for invalid data which permits the user to identify what actions to take to correct the issues. In a service which runs without a user interface write to a log file.
 
In the following section valid data is displayed in a grid and invalid data in a secondary window. Although this is done in a desktop application the same applies in web applications where a link would be present to view incorrect data.

The following class performs all validation, stored valid customer data in a list and issues in a separate list. Step one is to validate after splitting a line that the array has a specific column count. If not do not continue as the assertions to follow rely on a field count.

var parts = customerLine.Split(',');
 
const int  fieldCount = 8;
 
if (parts.Length != fieldCount)
{
    IssueList.Add($"Line: {lineIndex} Field count {parts.Length}, expected {fieldCount}.");
    return;                
}

Business rules require first and last name, elements (columns) index 0 and index 1. This is done using string.IsNullOrWhiteSpace.

if (!string.IsNullOrWhiteSpace(parts[0]))
{
    Customer.FirstName = parts[0];
}
else
{
    IssueList.Add($"Line: {lineIndex} FirstName required");
}
 
if (!string.IsNullOrWhiteSpace(parts[1]))
{
    Customer.LastName = parts[1];
}
else
{
    IssueList.Add($"Line: {lineIndex} LastName required");
}

Address columns in indexes 1,3,4 and 5 are optional.

Customer.Street = parts[2] ?? "";
Customer.City = parts[3] ?? "";
Customer.State = parts[4] ?? "";
Customer.Country = parts[5] ?? "";

In column 6, a numeric decimal is expected. Here assertion is performed using decimal.TryParse using C# 7 our var.

if (!string.IsNullOrWhiteSpace(parts[6]))
{
    if (decimal.TryParse(parts[6], out  var balanceResult))
    {
        Customer.Balance = balanceResult;
    }
    else
    {
        IssueList.Add($"Line: {lineIndex} Balance value '{parts[6]}' is invalid decimal");
    }
}
else
{
    IssueList.Add($"Line: {lineIndex} Balance is required");
}

Column 7 is a datetime, similar to the last column TryParse is used against a date.

if (!string.IsNullOrWhiteSpace(parts[7]))
{
    if (DateTime.TryParse(parts[7], out var paidResult))
    {
        Customer.LastPaid = paidResult;
    }
    else
    {
        IssueList.Add($"Line: {lineIndex} Last paid '{parts[7]}' is not valid date");
    }
}
else
{
    IssueList.Add($"Line: {lineIndex} Last paid is required");
}

Finally a constructed Customer and issue list are set.

CustomerResults.Customers.Add(Customer);
CustomerResults.IssuesList.AddRange(IssueList);

 
Entire class for processing a line in a text file.

public class  CustomerCreator
{
    public Customer Customer { get; set; }
    public CustomerResults CustomerResults { get; set; }
    public List<string> IssueList { get; set; }
    public bool  HasIssues => IssueList.Count > 0;
    public string  Issues => string.Join(",", IssueList.ToArray());
    public CustomerCreator(string customerLine, int lineIndex)
    {
        IssueList = new  List<string>();
        Customer = new  Customer();
        CustomerResults = new  CustomerResults();
 
        Build(customerLine, lineIndex);
    }
 
    private void  Build(string  customerLine, int  lineIndex)
    {
        var parts = customerLine.Split(',');
 
        const int  fieldCount = 8;
 
        if (parts.Length != fieldCount)
        {
            IssueList.Add($"Line: {lineIndex} Field count {parts.Length}, expected {fieldCount}.");
            return;                
        }
 
        if (!string.IsNullOrWhiteSpace(parts[0]))
        {
            Customer.FirstName = parts[0];
        }
        else
        {
            IssueList.Add($"Line: {lineIndex} FirstName required");
        }
 
        if (!string.IsNullOrWhiteSpace(parts[1]))
        {
            Customer.LastName = parts[1];
        }
        else
        {
            IssueList.Add($"Line: {lineIndex} LastName required");
        }
 
        Customer.Street = parts[2] ?? "";
        Customer.City = parts[3] ?? "";
        Customer.State = parts[4] ?? "";
        Customer.Country = parts[5] ?? "";
 
        if (!string.IsNullOrWhiteSpace(parts[6]))
        {
            if (decimal.TryParse(parts[6], out  var balanceResult))
            {
                Customer.Balance = balanceResult;
            }
            else
            {
                IssueList.Add($"Line: {lineIndex} Balance value '{parts[6]}' is invalid decimal");
            }
        }
        else
        {
            IssueList.Add($"Line: {lineIndex} Balance is required");
        }
 
        if (!string.IsNullOrWhiteSpace(parts[7]))
        {
            if (DateTime.TryParse(parts[7], out var paidResult))
            {
                Customer.LastPaid = paidResult;
            }
            else
            {
                IssueList.Add($"Line: {lineIndex} Last paid '{parts[7]}' is not valid date");
            }
        }
        else
        {
            IssueList.Add($"Line: {lineIndex} Last paid is required");
        }
 
        CustomerResults.Customers.Add(Customer);
        CustomerResults.IssuesList.AddRange(IssueList);
 
    }
}

The above is called from a method responsible for iterating a file using a stream reader which in this case is called from a button click event in a window. Since there is nothing indicating the window type the project type may be windows forms or WPF.

The return type is a ValueTuple which has been installed using NuGet. Note the last three line prior to return statement would be for a developer to see if there were variations on the total line count, how many empty lines for instance.

public (List<Customer> customers, List<string> issues) ReadCustomersPerfectWorldCreator(string fileName)
{
 
    var customerList = new  List<Customer>();
    var issueList = new  List<string>();
 
 
    var counterValidLines = 0;
    var counterEmptyLines = 0;
    var currentLineIndex = 1;
 
    string line;
 
 
    var file = new  StreamReader(fileName);
    try
    {
        while ((line = file.ReadLine()) != null)
        {
 
            if (!string.IsNullOrWhiteSpace(line))
            {
                counterValidLines++;
                var result = new  CustomerCreator(line, currentLineIndex);
                if (result.HasIssues)
                {
                    issueList.AddRange(result.IssueList);
                }
                else
                {
                    customerList.Add(result.Customer);
                }
 
            }
            else
            {
                counterEmptyLines++;
            }
 
            currentLineIndex++;
 
        }
    }
    catch (Exception exception)
    {
        mHasException = true;
        mLastException = exception;
    }
    finally
    {
        file.Close();
    }
 
 
    Console.WriteLine("There were {0} valid lines.", counterValidLines);
    Console.WriteLine("There were {0} empty lines.", counterEmptyLines);
    Console.WriteLine(counterEmptyLines + counterValidLines);
 
    return (customerList, issueList);
}

In the calling button click event FileOperations class is created, the file is identified then a call is made to the method above. The first assertion is to check if there were unexpected issues, runtime exceptions followed by checking to see if the issue list returned by the method ReadCustomersPerfectWprldCreator has issues.

If there is good data show this data and prompt to view invalid data. If the next step were to import data the imperfect data would be excluded as all lines with issues has been excluded from the good data list.

private void  RealWorldGoodButton_Click(object sender, EventArgs e)
{
    var fileOperations = new  FileOperations();
    var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
 
    var (customers, issueList) = fileOperations.ReadCustomersPerfectWorldCreator(fileName);
 
    if (fileOperations.IsSuccessFul)
    {
        _bindingSource.DataSource = customers;
        dataGridView1.DataSource = _bindingSource;
    }
    else
    {
        MessageBox.Show(fileOperations.LastExceptionMessage);
    }
 
    if (issueList.Count <= 0) return;
 
    if (Question("There were imperfect data, review?"))
    {
        var issuesForm = new  IssuesForm(issueList);
        issuesForm.Text = "Encountered these issues";
        try
        {
            issuesForm.ShowDialog();
        }
        finally
        {
            issuesForm.Dispose();
        }
    }
}

Preprocessing data

Suppose there may be no empty lines, using a method below would assert there are no empty lines. If the int list has elements then there are one or more lines which are empty, the int list hold line numbers which are empty in the event this needs to be pointed out to the entity supplying the file.

Empty lines

/// <summary>
/// For validating each line in a file has information
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<int> EmptyLines(string fileName)
{
    var lineList = new  List<int>();
    var line = "";
    int lineIndex = 1;
 
    using (var reader = new StreamReader(fileName))
    {
        while (!reader.EndOfStream)
        {
            line = reader.ReadLine();
            if (string.IsNullOrWhiteSpace(line))
            {
                lineList.Add(lineIndex);
            }
 
            lineIndex++;
        }
    }
 
    return lineList;
}

Field count

The following method checks to ensure there are the expected field count for each line which would run after the EmptyLines method above. If the returning list has one or more elements they represent the line number in the file which can be used to either manually inspect or report to the entity supplying the file.

/// <summary>
/// Validate each line has the proper field count
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<int> HasValidFieldCount(string fileName)
{
    var lineList = new  List<int>();
    var line = "";
    int lineIndex = 1;
 
    const int  fieldCount = 8;
 
    using (var reader = new StreamReader(fileName))
    {
        while (!reader.EndOfStream)
        {
            line = reader.ReadLine();
 
            if (line != null && line.Split(',').Length != fieldCount)
            {                        
                lineList.Add(lineIndex);
            }
 
            lineIndex++;
        }
    }
 
    return lineList;
}

Checking data types

When a value is any other type than a string there needs to be assertion to ensure the value can be stored. In the example below the last element must be a valid date. This would be called after the last two methods.

/// <summary>
/// Used for validating each line has a valid date field
/// </summary>
/// <param name="fileName"></param>
/// <returns></returns>
public List<string> ValidatePaidDate(string fileName)
{
    const int  fieldCount = 8;
    var issues = new  List<string>();
    var line = "";
    int lineIndex = 1;
 
    using (var reader = new StreamReader(fileName))
    {
 
        while (!reader.EndOfStream)
        {
            line = reader.ReadLine();
 
            if (!string.IsNullOrWhiteSpace(line))
            {
                var parts = line.ProcessSpacings().Split(',');
 
                if (parts.Length != fieldCount)
                {
                    issues.Add($"Line {lineIndex} has no data");
                }
 
                if (!string.IsNullOrWhiteSpace(parts[7]))
                {
                    if (!DateTime.TryParse(parts[7], out var _ ))
                    {
                        issues.Add($"Line {lineIndex} '{parts[7]}' is not a valid date");
                    }
                }
            }
 
            lineIndex++;
        }
 
    }
 
    return issues;
}

Dealing with unwanted spaces

Rather than

Alana,Petty,AF7Y7VM47Z7J2TMF2BXKRWV8PPJ,Fremont,Missouri,Andorra,5442.20,1956-08-31 17:01:39.3333997

Each line has unwanted spaces.

Alana,   Petty,      AF7Y7VM47Z7J   ,           Fremont,Missouri,Andorra
Patricia,   Cooke,B60BIJLS36KKU,Louisville,   Utah  ,China   
   Jami,Browning,UY99GBG0YE9LRCQP1XYG889B7L1XU78,             Indianapolis,  Tennessee,Madagascar

This needs to be considered and is easy to fix. The proper way to know is when it's first encountered. To fix this read each line in the file, call ProcessSpaces (shown below) extension and write to a StringBuilder which save the results to the same or different file name.

using System;
using System.Linq;
 
namespace Example1Library.Extensions
{
    public static  class StringExtension
    {
        public static  string ProcessSpacings(this string  line)
        {
            try
            {
                return string.Join(",", Array.ConvertAll(line.Split(','), 
                    field => field.Trim()).Select(items => $"{items}"));
            }
            catch (Exception)
            {
                return line;
            }
        }
    }
}

Example usage while iterating lines via a StreamReader.

while (!reader.EndOfStream)
{
    line = reader.ReadLine();
 
    if (!string.IsNullOrWhiteSpace(line))
    {
        var parts = line.ProcessSpacings().Split(',');

Uneven columns

In some cases one or more lines may have too much or too little data, for example, in the following file the first three columns are to be imported into a DataGridView with three defined columns and the first line is the column/field names.

DataGridView at design time.

DataGridView at runtime

The following code assertions

  • Does the file exists?
    • No, exit method with an empty array, in an actual application the file exists should be done by the caller of the method.
    • Yes, continue to read lines in the file
  • Checks for empty lines, if present ignore.
  • Each line 
    • On splitting a line
      • If length of the line after split is under three or exactly three elements add the array to the list.
      • If the length of the line is greater than expected take only the first three elements

Note more assertion would be required if one or more elements need to be converted to another type other than string which means returning an object array or a list of T for a class which represents the data to be displayed.

Since the first line are used for header text of the DataGridView additional logic should be used to ensure the array length matches the count of DataGridView columns in the DataGridView. Header array is sent back to the caller (the form) using a custom delegate implemented through LineDataArgs class  which is subscribed too after creating an instance of the class FileOperations.

FileOperations class

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
 
namespace ForumQuestionTabbed.Classes
{
    public class  FileOperations
    {
        public delegate  void LineHandler(object sender,LineDataArgs myArgs);
        public event  LineHandler OnLineHandler;
 
        public List<string[]> ReadLines(string fileName)
        {
            var linesList = new  List<string[]>();
 
            if (!File.Exists(fileName))
            {
                return linesList;
            }
 
            string line;
            var isFirstLine = true;
 
            using (var reader = new StreamReader(fileName))
            {
 
                while (!reader.EndOfStream)
                {
                    line = reader.ReadLine();
 
                    if (isFirstLine)
                    {
                        isFirstLine = false;
                        if (!string.IsNullOrWhiteSpace(line))
                        {
                            var lineDataArgs = new  LineDataArgs(
                                Array.ConvertAll(
                                    line.Split('\t'), p => p.Trim()));
 
                            OnLineHandler(this, lineDataArgs);
 
                        }
 
                        continue;
                    }
 
                    if (string.IsNullOrWhiteSpace(line)) continue;
 
                    var parts = Array.ConvertAll(line.Split('\t'), p => p.Trim());
 
                    if (parts.Length <= 3)
                    {
                        linesList.Add(parts);
                    }
                    else if  (parts.Length > 3)
                    {
                        linesList.Add(parts.Take(3).ToArray());
                    }
                     
                }
 
            }
 
            return linesList;
        }
 
    }
}

Calling form
Creates an instance of FileOperations in form Shown event, using a for-each to iterate lines returned from the file into the DataGridView and on the first line set header text in the following event.

using System;
using System.IO;
using System.Windows.Forms;
using ForumQuestionTabbed.Classes;
 
namespace ForumQuestionTabbed
{
    public partial  class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
            dataGridView1.AllowUserToAddRows = false;
            Shown += Form1_Shown;
        }
 
        private void  Form1_Shown(object  sender, EventArgs e)
        {
            var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
 
            var fileOperations = new  FileOperations();
            fileOperations.OnLineHandler += FileOperations_OnReadingFirstLine;
 
            var data = fileOperations.ReadLines(fileName);
 
            foreach (var rowData in data)
            {
                dataGridView1.Rows.Add(rowData);
            }
 
            ActiveControl = dataGridView1;
            dataGridView1.CurrentCell = dataGridView1.Rows[0].Cells[0];
            dataGridView1.Rows[0].Selected = true;
        }
        /// <summary>
        /// Set column header text to the array elements containing the
        /// first line read from the file.
        /// </summary>
        /// <param name="myObject"></param>
        /// <param name="args"></param>
        private void  FileOperations_OnReadingFirstLine(object myObject, LineDataArgs args)
        {
            for (int index = 0; index < args.LineArray.Length; index++)
            {
                dataGridView1.Columns[index].HeaderText = args.LineArray[index];
            }
        }
    }
}

As with prior examples this one may or may not need additional assertions but the rule of thumb is to assert when unsure, wrap code in try-catch where the catch handles the issue or in some cases (rare) does nothing.

Different containers

In all the examples presented a List<Customer> has been used, if a DataTable is required then replace the code to use a list with code to use a DataTable or simply convert the list to a DataTable using the following method

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
 
namespace Example1Library.HelperClasses
{
    public class  ListHelpers
    {
        public static  DataTable ConvertToDataTable<T>(IList<T> data)
        {
            var properties = TypeDescriptor.GetProperties(typeof(T));
            var dataTable = new  DataTable();
 
            foreach (PropertyDescriptor prop in properties)
            {
                dataTable.Columns.Add(prop.Name, 
                    Nullable.GetUnderlyingType(prop.PropertyType) 
                    ?? prop.PropertyType);
            }
 
            foreach (var item in data)
            {
                var row = dataTable.NewRow();
                foreach (PropertyDescriptor prop in properties)
                {
                    row[prop.Name] = prop.GetValue(item) ?? DBNull.Value;
                }
 
                dataTable.Rows.Add(row);
            }
            return dataTable;
 
        }
    }
}

Example usage, ReadCustomerPerfectWorld returns a list<Customer>.

/// <summary>
/// Example where there are zero issues with reading in a delimited file
/// into a strong typed list then finally into a DataTable.
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void  PerfectWorldExampleDataTableButton_Click(object sender, EventArgs e)
{
    var fileOperations = new  FileOperations();
    var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Import1.txt");
 
    _bindingSource.DataSource = ListHelpers.ConvertToDataTable(
        fileOperations.ReadCustomersPerfectWorld(fileName));
 
    dataGridView1.DataSource = _bindingSource;
 
}

Visual Studio solution

In the source code class projects are used to do the file operations while a form calls these methods which allows the class project code to be used in other projects. In the Windows Form project there are several text files used to demonstrate good and bad file operations.

Summary

This article has expanded on part 1 of the series for working with delimited files with preprocessing and extension methods. 

See also

C# Processing CSC files part 1

Source code

https://github.com/karenpayneoregon/ImportingCsvFiles1