Writing Large Excel Files with the Open XML SDK

In my last post, I showed you guys how to use the Open XML SDK to read and parse large Excel files. Today, I am going to show you how to write large Excel files with the Open XML SDK. Like in the case of reading, the Open XML SDK provides two approaches for writing Open XML files: 1. using the DOM, 2. using the SAX-like writing capability. All of my examples on this blog have been around using the Open XML SDK DOM to write Open XML files. The DOM makes it easy because everything is strongly typed. Unfortunately, the DOM approach requires loading entire Open XML parts in memory, which can result in Out of Memory exceptions, especially when dealing with very large files. Let's compare the DOM and SAX-like approaches.

If you want to jump straight into the code, feel free to download the solution here.

Scenario

Imagine a scenario where I work for a financial company that deals with very large amounts of data. This data gets updated on a daily basis and is stored in a database. The company's analysts use Excel as a means of mining and understanding the data. To support the analysts' needs, the company needs to create and update Excel spreadsheets every morning. These spreadsheets typically grow to sizes exceeding hundreds of megabytes.

Solution

The scenario described above uses many of the general concepts described in a previous post on writing data from a database into an Excel file. Today, I will show you how to accomplish the scenario using both the DOM and SAX-like approaches. To accomplish the scenario we will need to take the following actions:

  1. Create an empty Spreadsheet template using Excel
  2. Open the template using the Open XML SDK
  3. Connect to the database that contains all the relevant data (for the sake of simplicity, this blog post will instead write random values to cells)
  4. Get the main workbook part
  5. Get the appropriate worksheet
  6. Get the sheetData element, which will contain the data for a given spreadsheet
  7. For the number of rows and columns that represent the dimensions of data to be inserted, add a new cell with appropriate data
  8. Append rows to the sheetData element
  9. Save and close

Step 1 – Setup the right template

As is the case for most of my blog posts, the easiest way to start any Open XML SDK solution is to create the right starting template. In this case, I am going to create a blank Excel spreadsheet with one sheet in Excel. The template will look like the following:

image

Open XML SDK DOM Approach

The SDK DOM approach is pretty straight forward. The method below will write random values to cells within a given set of dimensions specified by the number of rows and columns. The random values will be calculated by Excel using the RAND() formula, which is specified by the CellFormula class.

static void WriteRandomValuesDOM(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();   SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();   for (int row = 0; row < numRows; row++) { Row r = new Row(); for (int col = 0; col < numCols; col++) { Cell c = new Cell(); CellFormula f = new CellFormula(); f.CalculateCell = true; f.Text = "RAND()"; c.Append(f); CellValue v = new CellValue(); c.Append(v); r.Append(c); } sheetData.Append(r); } } }

The method above will definitely work for writing cell values within a worksheet. However, if the dimension specified is quite large then the program's memory footprint will also be quite large. In fact, you are left at the mercy of the garbage collector, which may result in the program throwing an Out of Memory exception.

Open XML SDK SAX-Like Approach

The OpenXmlReader and OpenXmlWriter can be used together to alleviate some of the performance and/or memory concerns when developing an Open XML solution. You can create an OpenXmlReader and an OpenXmlWriter object on an entire part without loading the part in memory. The OpenXmlReader class allows you to iterate the entire xml tree of a part one node at a time. Similarly, the OpenXmlWriter class allows you to write xml nodes within a part one node at a time.

In the case of this solution, we will need to read the original worksheet part and then update the worksheet with the appropriate data. We are going to use an OpenXmlReader object to read the original worksheet part and find the place in the file where we will add data. We are then going to use an OpenXmlWriter object to write values to the worksheet part. You may be tempted to create an OpenXmlReader object and an OpenXmlWriter object on the same part. I would recommend against using this strategy. Writing data to a part that you are reading using the OpenXmlReader and the OpenXmlWriter may result in the OpenXmlReader losing its place in the xml tree. Instead, you should create a temporary worksheet part where you can write values to using the OpenXmlWriter class. The following code creates an OpenXmlReader object and an OpenXmlWriter object for worksheet parts:

static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); string origninalSheetId = workbookPart.GetIdOfPart(worksheetPart);   WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>(); string replacementPartId = workbookPart.GetIdOfPart(replacementPart);   OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); OpenXmlWriter writer = OpenXmlWriter.Create(replacementPart);   ... } }

Once we have the appropriate objects created, the next step is to write everything we read into the new worksheet part until we reach the SheetData object. Once we reach the SheetData object, we can start writing cell values within a given set of dimensions specified by the number of rows and columns. The following code accomplishes this task:

static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { ...   Row r = new Row(); Cell c = new Cell(); CellFormula f = new CellFormula(); f.CalculateCell = true; f.Text = "RAND()"; c.Append(f); CellValue v = new CellValue(); c.Append(v);   while (reader.Read()) { if (reader.ElementType == typeof(SheetData)) { if (reader.IsEndElement) continue; writer.WriteStartElement(new SheetData());   for (int row = 0; row < numRows; row++) { writer.WriteStartElement(r); for (int col = 0; col < numCols; col++) { writer.WriteElement(c); } writer.WriteEndElement(); }   writer.WriteEndElement(); } else { if (reader.IsStartElement) { writer.WriteStartElement(reader); } else if (reader.IsEndElement) { writer.WriteEndElement(); } } }   ... } }

Notice that the OpenXmlWriter class requires us to write both the start and end elements. Invoking the method WriteEndElement() off of the OpenXmlWriter class ensures that the appropriate start element is closed.

We're almost done. The last step is to close the reader and writer and replace the original worksheet part with the replacement part. Remember that the main workbook part contains the list of worksheets within the workbook, which means we need to update the xml contained in that part. The following code accomplishes these tasks:

static void WriteRandomValuesSAX(string filename, int numRows, int numCols) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { ...   reader.Close(); writer.Close();   Sheet sheet = workbookPart.Workbook.Descendants<Sheet>() .Where(s => s.Id.Value.Equals(origninalSheetId)).First(); sheet.Id.Value = replacementPartId; workbookPart.DeletePart(worksheetPart); } }

Conclusion

This post gives you more insight into reading and writing large Open XML files, while avoiding performance or Out of Memory exceptions. Let me know if you guys have other Open XML SDK related post requests.

Zeyad Rajabi