March 2016

Volume 31 Number 3

[Modern Apps]

Parsing CSV Files in UWP Apps

By Frank La

Parsing a Comma Separated Value (CSV) file sounds easy enough at first. Quickly, however, the task becomes more and more intri­cate as the pain points of CSV files become clear. If you’re not familiar with the format, CSV files store data in plain text. Each line in the file constitutes a record. Each record has its fields typically delineated by a comma, hence the name.

Developers today enjoy standards among data exchange formats. The CSV file “format” harkens to an earlier time in the software industry before JSON, before XML. While there’s a Request for Comments (RFC) for CSV files (bit.ly/1NsQlvw), it doesn’t enjoy official status. Additionally, it was created in 2005, decades after CSV files started to appear in the 1970s. As a result, there exists quite a bit of variation to CSV files and the rules are a bit murky. For instance, a CSV file might have fields separated by tabs, a semicolon or any character.

In practical terms, the Excel implementation of CSV import and export has become the de-facto standard and is seen most widely in the industry, even outside the Microsoft ecosystem. Accordingly, the assumptions I make in this article about what constitutes “correct” parsing and formatting will be based upon how Excel imports/exports CSV files. While most CSV files will fall in line with the Excel implementation, not every file will. Toward the end of this column, I introduce a strategy to handle such uncertainty.

A fair question to ask is, “Why even write a parser for a decades-old quasi-format in a very new platform?” The answer is simple: Many organizations have legacy data systems. Thanks to the file format’s long life span, nearly all of these legacy data systems can export to CSV. Furthermore, it costs very little in terms of time and effort to export data to CSV. Accordingly, there are plenty of CSV-formatted files in larger enterprise and government data sets.

Designing an All-Purpose CSV Parser

Despite the lack of an official standard, CSV files typically share some common traits.

Generally speaking, CSV files: are plain text, contain one record per line, have records in each line separated by a delimiter, have one-character delimiters and present fields in the same order.

These common traits outline a general algorithm, which would consist of three steps:

  1. Split a string along the line delimiter.
  2. Split each line along the field delimiter.
  3. Assign each field value to a variable.

This would be fairly easy to implement. The code in Figure 1 parses the CSV input string into a List<Dictionary<string, string>>.

Figure 1 Parsing the CSV Input String into List<Dictionary<string,string>>

var parsedResult = new List<Dictionary<string, string>>();
var records = RawText.Split(this.LineDelimiter);
foreach (var record in records)
  {
    var fields = record.Split(this.Delimiter);
    var recordItem = new Dictionary<string, string>();
    var i = 0;
    foreach (var field in fields)
    {
      recordItem.Add(i.ToString(), field);
      i++;
    }
    parsedResult.Add(recordItem);
  }

This approach works great using an example like the following office divisions and their sales numbers:

East, 73, 8300
South, 42, 3000
West, 35, 4250
Mid-West, 18, 1200

To retrieve values from the string, you would iterate through the List and pull out values in the Dictionary using the zero-based field index. Retrieving the office division field, for example, would be as simple as this:

foreach (var record in parsedData)
{
  string fieldOffice = record["0"];
}

While this works, the code isn’t as readable as it could be.

A Better Dictionary

Many CSV files include a header row for the field name. The parser would be easier for developers to consume if it used the field name as a key for the dictionary. As any given CSV file might not have a header row, you should add a property to convey this information:

public bool HasHeaderRow { get; set; }

For instance, a sample CSV file with a header row might look something like this:

Office Division, Employees, Unit Sales
East, 73, 8300
South, 42, 3000
West, 35, 4250
Mid-West, 18, 1200

Ideally, the CSV parser would be able to take advantage of this piece of metadata. This would make the code more readable. Retrieving the office division field would look something like this:

foreach (var record in parsedData)
{
  string fieldOffice = record["Office Division"];
}

Blank Fields

Blank fields occur commonly in data sets. In CSV files, a blank field is represented by having an empty field in a record. The delimiter is still required. For example, if there were no Employee data for the East office, the record would look like this:

East,,8300

If there were no Unit Sales data, as well as no Employee data, the record would look like this:

East,,

Every organization has its own data-quality standards. Some may choose to place a default value in a blank field to make the CSV file more human-readable. Default values typically would be 0 or NULL for numbers and “” or NULL for strings.

Staying Flexible

Given all the ambiguities surrounding the CSV file format, the code can’t make any assumptions. There’s no guarantee that field delimiter will be a comma and there’s no guarantee that the record delimiter will be a new line.

Accordingly, both will be properties of the CSVParser class:

public char Delimiter { get; set; }
public char LineDelimiter { get; set; }

To make it easier for developers consuming this component, you want to make default settings that will apply in most cases:

private const char DEFAULT_DELIMITER = ',';
private const char DEFAULT_LINE_DELIMITER = '\n';

Should someone wish to change the default delimiter to a tab character, the code is quite simple:

CsvParser csvParser = new CsvParser();
csvParser.Delimiter = '\t';

Escaped Characters

What would happen if the field itself contains the delimiter character, like a comma? For example, instead of referring to sales by region, what if the data had city and state? Typically, CSV files work around this by encasing the entire field in quotes, like so:

Office Division, Employees, Unit Sales
"New York, NY", 73, 8300
"Richmond, VA", 42, 3000
"San Jose, CA", 35, 4250
"Chicago, IL", 18, 1200

This algorithm would turn the one field value “New York, NY” into two discrete fields with values split along the comma, “New York” and “NY.”

In this case, separating the values of city and state might not be detrimental, but there are still extra quote characters polluting the data. While they’re easy enough to remove here, more complex data might not be so easy to clean up.

Now It Gets Complicated

This method of escaping commas inside fields introduces another character to be escaped: the quote character. What if, for example, there were quotes in the original data as shown in Figure 2?

Figure 2 Original Data with Quotes

Office Division Employees  Unit Sales Office Motto
New York, NY 73 8300 “We sell great products”
Richmond, VA 42 3000 “Try it and you'll want to buy it”
San Jose, CA 35 4250 “Powering Silicon Valley!”
Chicago, IL 18 1200 “Great products at great value”

The raw text in the CSV file itself would look like this:

Office Division, Employees, Unit Sales, Office Motto
"New York, NY",73,8300,"""We sell great products"""
"Richmond, VA",42,3000,"""Try it and you'll want to buy it"""
"San Jose, CA",35,4250,"""Powering Silicon Valley!"""
"Chicago, IL",18,1200,"""Great products at great value"""

The one quotation mark (“) gets escaped into three quotation marks (“””), which adds an interesting twist to the algorithm. Of course, the first reasonable question to ask is this: Why did one quote turn into three? Just as in Office Division field, the contents of the field get surrounded by quotes. In order to escape quote characters that are part of the content, they’re doubled up. Therefore, “ becomes “”.

Another example (Figure 3) might demonstrate the process more clearly.

Figure 3 Quote Data

Quote
"The only thing we have to fear is fear itself." -President Roosevelt
"Logic will get you from A to B. Imagination will take you everywhere." -Albert Einstein

The data in Figure 3 would be represented in CSV as this:

Quote

"""The only thing we have to fear is fear itself."" -President Roosevelt"
"""Logic will get you from A to B. Imagination will take you everywhere."" -Albert Einstein"

It might be clearer now that the field is wrapped in quotation marks and that the individual quotation marks in the field’s content are doubled up.

Edge Cases

As I mentioned in the opening section, not all files will adhere to the Excel implementation of CSV. The lack of a true specification for CSV makes it difficult to write one parser to handle every CSV file in existence. Edge cases will most certainly exist and that means the code has to leave a door open to interpretation and customization.

Inversion of Control to the Rescue

Given the CSV format’s hazy standard, it’s not practical to write a comprehensive parser for all imaginable cases. It might be more ideal to write a parser to suit a particular need of an app. Using Inversion of Control lets you customize a parsing engine for a particular need.

To accomplish this, I’ll create an interface to outline the two core functions of parsing: extracting records and extracting fields. I decided to make the IParserEngine interface asynchronous. This makes sure any app using this component will remain responsive no matter how large the CSV file is: 

public interface IParserEngine
{
  IAsyncOperation<IList<string>> ExtractRecords(char lineDelimiter, string csvText);
  IAsyncOperation<IList<string>> ExtractFields(char delimiter, char quote,
    string csvLine);
}

Then I add the following property to the CSVParser class:

public IParserEngine ParserEngine { get; private set; }

I then offer developers a choice: use the default parser or inject their own. To make it simple, I’ll overload the constructor:

public CsvParser()
{
  InitializeFields();
  this.ParserEngine = new ParserEngines.DefaultParserEngine();
}
public CsvParser(IParserEngine parserEngine)        
{
  InitializeFields();
  this.ParserEngine = parserEngine;
}

The CSVParser class now provides the basic infrastructure, but the actual parsing logic is contained within the IParserEngine interface. For convenience of developers, I created the DefaultParserEngine, which can process most CSV files. I took into account the most likely scenarios developers will encounter.

Reader Challenge

I have taken into account the bulk of scenarios developers will encounter with CSV files. However, the indefinite nature of the CSV format makes creating a universal parser for all cases impractical. Factoring all the variations and edge cases would add significant cost and complexity of the cost along with impacting performance.

I’m certain that there are CSV files out “in the wild” that the DefaultParserEngine will not be able to handle. This is what makes the dependency injection pattern a great fit. If developers have a need for a parser that can handle an extreme edge case or write something more performant, they certainly are welcome to do so. Parser engines could be swapped out with no changes to the consuming code.

The code for this project is available at bit.ly/1To1IVI.

Wrapping Up

CSV files are a leftover from days gone by and, despite the best efforts of XML and JSON, are still a commonly used data exchange format. CSV files lack a common specification or standard and, while they often have common traits, are not certain to be in place in any given file. This makes parsing a CSV file a non-trivial exercise.

Given a choice, most developers would probably exclude CSV files from their solutions. However, their widespread presence in legacy enterprise and government data sets may preclude that as an option in many scenarios.

Simply put, there is a need for a CSV parser for Universal Windows Platform (UWP) apps and a real-world CSV parser has to be flexible and robust. Along the way, I demonstrated here a practical use for dependency injection to provide that flexibility. While this column and its associated code target UWP apps, the concept and code apply to other platforms capable of running C#, such as Microsoft Azure or Windows desktop development.


Frank La Vigne is a technology evangelist on the Microsoft Technology and Civic Engagement team, where he helps users leverage technology in order to create a better community. He blogs regularly at FranksWorld.com and has a YouTube channel called Frank’s World TV (youtube.com/FranksWorldTV).

Thanks to the following technical expert for reviewing this article: Rachel Appel