The XML Files

XML Data Migration Case Study: GEDCOM

Aaron Skonnard

Code download available at:XMLFiles0405.exe(143 KB)

Contents

GEDCOM 101
Mapping GEDCOM to XML
Designing the Transformation
GedcomReader Implementation
Parsing GEDCOM 5.5
Using GedcomReader
XSLT for GEDCOM 6.0
Where Are We?

XML's ubiquity and continually improving tool support has created a magnetism that attracts organizations everywhere. As organizations move to XML, they must also provide a coherent data migration strategy that allows their users to bring old files forward. This is a nontrivial problem that typically requires tedious code to implement the transformation process. The System.Xml namespace in the Microsoft® .NET Framework, however, can greatly simplify these data migration challenges through its extensible APIs and support for XSLT.

An example of data migration is what's currently happening around genealogy data formats. Genealogists have long relied on the Genealogical Data Communications (GEDCOM) 5.5 format for sharing genealogical information. GEDCOM 5.5 is text based but not XML based. A beta version of GEDCOM 6.0 is available and is completely based on XML. But what about the gigabytes of genealogical information that can still be found in GEDCOM 5.5 format? This presents an interesting data migration challenge that really should not be ignored.

In this column I'll walk you through solving this data migration problem using System.Xml. This process can serve as a blueprint for other data migration problems you may face.

GEDCOM 101

GEDCOM was developed to facilitate exchanging genealogical data across different genealogy programs and systems. A common format like GEDCOM allows users to share their work with others regardless of the program they're using.

Version 5.5 is the most widely used version of the various GEDCOM specifications (see The GEDCOM Standard Release 5.5). GEDCOM 5.5 relies on a simple text-based grammar that leverages line delimiters and level numbers to structure family tree information. Figure 1 provides a sample GEDCOM 5.5 file.

Figure 1 GEDCOM 5.5

0 HEAD 1 SOUR AcmeGen 2 VERS 5.0 2 NAME Acme Genealogy 1 DATE 13 February 2004 2 TIME 03:50:51 1 GEDC 2 VERS 5.5 2 FORM LINEAGE-LINKED 1 CHAR ANSEL 1 SUBM @SUB01@ 1 SUBN @N01@ 0 @SUB01@ SUBM 1 NAME Aaron Skonnard 1 ADDR 456 Main 2 CONT Salt Lake City, Utah 84150 0 @N01@ SUBN 1 DESC 1 1 ORDI N 0 @I100294220729@ INDI 1 NAME Bernt Olaf /Skonnord/ 1 SEX M 1 BIRT 2 DATE 14 JUN 1860 2 PLAC Snertingdal, Oppland, Norway 1 FAMC @F306458249@ 1 SOUR @S01@ •••

Each GEDCOM line contains a level number, a tag, and an optional value. Multiple lines constitute a GEDCOM record. A level of 0 marks the beginning of a new GEDCOM record. Every line that follows is part of the record until you reach another line with a level of 0. The tag name conveys meaning about the information on the line as defined in the specification.

In the example shown in Figure 1, HEAD is the first record and it contains six children (SOUR, DATE, GEDC, CHAR, SUBM, SUBN). SOUR contains two children (VERS and NAME) while DATE contains one child (TIME). The increasing level numbers indicate parent-to-child relationships. A line may also contain a unique identifier, as shown in the following snippet:

0 @SUB01@ SUBM 1 NAME Aaron Skonnard 1 ADDR 456 Main 2 CONT Salt Lake City, Utah 84150

In this example, the SUBM record has a unique identifier of SUB01. This ID must be unique within the scope of the document. Identifiers make it possible to establish links between records. For example, the HEAD record referenced this SUBM record on the following line of code:

1 SUBM @SUB01@

These are the basics of the GEDCOM 5.5 grammar. GEDCOM 5.5 is capable of representing sophisticated tree structures through its use of level numbers, identifiers, and cross-referencing capabilities. I don't have enough space in this column to cover GEDCOM semantics in more detail. Suffice it to say that GEDCOM 5.5 makes it possible to express a wide range of genealogy-related information.

Since GEDCOM was primarily designed to represent tree structures (family trees) in an interoperable manner, moving GEDCOM to an XML format seems like a perfect fit.

Mapping GEDCOM to XML

The GEDCOM 5.5 grammar maps nicely to XML. One way to define a simple XML mapping is to convert each GEDCOM line into an XML element with the same name. The line's optional data, if any, can be placed in an attribute named "value". Identifiers can be placed in an attribute named "id" and references to other records can be placed in an attribute named "idref". Figure 2 may help you visualize the mapping. Applying this simple mapping to the GEDCOM 5.5 file shown in Figure 1 produces the XML document shown in Figure 3.

Figure 3 A GEDCOM XML Sample

<GEDCOM> <HEAD> <SOUR value="AcmeGen"> <VERS value="5.0"/> <NAME value="Acme Genealogy"/> </SOUR> <DATE value="13 February 2004"> <TIME value="03:50:51"/> </DATE> <GEDC> <VERS value="5.5"/> <FORM value="LINEAGE-LINKED"/> </GEDC> <CHAR value="ANSEL"/> <SUBM idref="SUB01"/> <SUBN idref="N01"/> </HEAD> <SUBM id="SUB01"> <NAME value="Aaron Skonnard"/> <ADDR value="456 Main"> <CONT value="Salt Lake City, Utah 84150"/> </ADDR> </SUBM> <SUBN id="N01"> <DESC value="1"/> <ORDI value="N"/> </SUBN> <INDI id="I100294220729"> <NAME value="Bernt Olaf /Skonnord/"/> <SEX value="M"/> <BIRT> <DATE value="14 JUN 1860"/> <PLAC value="Snertingdal, Oppland, Norway"/> </BIRT> <FAMC idref="F306458249"/> <SOUR idref="S01"/> </INDI> •••

Figure 2 Mapping GEDCOM 5.5 to GEDCOM XML

Figure 2** Mapping GEDCOM 5.5 to GEDCOM XML **

The XML version conveys the same information, but now it can be processed by a much wider range of tools and technologies. For example, you could process the XML file with your favorite XML API (such as DOM, SAX, XmlTextReader, or XPathNavigator), a query language like XPath or XQuery, or a transformation language like XSLT. Ultimately, once you have GEDCOM data in XML format, you can do just about anything with it.

Designing the Transformation

So how do you transform the GEDCOM 5.5 file into XML format? You can't use XSLT because the input format, GEDCOM 5.5, is not XML based. Instead, you have to write code that manually parses the GEDCOM 5.5 format and produces the XML format shown in Figure 3. Using the .NET Framework, the easiest way to accomplish both of these tasks at the same time is to write a custom XmlReader class.

I've provided a complete sample that illustrates how this is done (you can download the sample from the link at the top of this article). The sample contains several classes that work together in translating the GEDCOM 5.5 information into a logical XML document. To start, I defined an interface called IGedcomNode that several other classes implement. This interface models a logical XML node that contains information from a GEDCOM source. It allows the consumer to access the node's XML name, value, and type, as well as the GEDCOM 5.5 level number that appears on each GEDCOM line. I'll need the level number later on in order to figure out when a node ends and when a new node begins.

Then I defined several concrete classes to model the different types of XML nodes that will make up the resulting XML document (see Figure 4). These include GedcomElement, GedcomAttribute, and GedcomText. Each of these classes implements the IGedcomNode interface. I also defined a class called IGedcomLine whose purpose is to manage a sequence of IGedcomNode objects that came from a particular GEDCOM line.

Figure 4 GEDCOM 5.5 to XML Classes

// standard interface for all XML nodes that we'll be // creating from the GEDCOM 5.5 file public interface IGedcomNode { string Name {get;} string Value {get;} int LineNumber {get;} XmlNodeType NodeType {get;} } public class GedcomAttribute : IGedcomNode { ... // represents a value from the GEDCOM file that will // now become an XML attribute node } public class GedcomElement : IGedcomNode { ... // represents a tag from the GEDCOM file that will // now become an XML element node } public class GedcomEndElement : IGedcomNode { ... // a node that marks the end of an XML element node } public class GedcomText : IGedcomNode { ... // represents a value from the GEDCOM file that will // now become an XML text node } public class GedcomLine { ... // represents a GEDCOM 5.5 line that has been parsed // into an array of IGedcomNode objects (see above) }

GedcomReader Implementation

The meat of the implementation is found in the GedcomReader class. GedcomReader derives from System.Xml.XmlReader and overrides its abstract members. The fundamental purpose of GedcomReader is to take care of parsing an underlying GEDCOM 5.5 file and turn it into an XML document. Hence, the consumer doesn't need to know anything about GEDCOM 5.5. The consumer only needs to be familiar with the XML format as shown in the code in Figure 3.

My implementation of GedcomReader has a single constructor that takes a GEDCOM 5.5 file name. The constructor opens the supplied GEDCOM 5.5 file with a System.IO.StreamReader object and prepares to begin parsing. It also provides a Close method that closes the underlying file stream when the user is finished.

The Read method (see Figure 5) does the bulk of the work because it advances the logical cursor through the document. Therefore, the implementation of Read must deal with the translation between the GEDCOM 5.5 format and the target XML format that I'm trying to simulate (see Figure 3).

Figure 5 GedcomReader Implementation with GEDCOM 5.5 Parsing

public class GedcomReader : XmlReader { ••• private GedcomLine ParseGedcomLine(string lineText) { string xref_id="", tag="", pointer="", linevalue=""; int nextPart = 0; // split GEDCOM line into parts string[] lineParts = lineText.Split(' '); // first part is always the line number int lineNumber = int.Parse(lineParts[nextPart++]); // check to see if line has an ID and get tag name if (lineParts[nextPart].StartsWith("@")) { xref_id = lineParts[nextPart++].Replace("@", ""); tag = lineParts[nextPart++]; } else tag = lineParts[nextPart++]; // check to see if value is a pointer or text if (lineParts.Length > nextPart) if (lineParts[nextPart].StartsWith("@")) pointer = lineParts[nextPart++].Replace("@", ""); if (lineParts.Length > nextPart) linevalue = GetRemainingValue(lineParts, nextPart); GedcomLine line = new GedcomLine(lineNumber); GedcomElement e = new GedcomElement(tag, lineNumber); if (xref_id != "") e.AddAttribute(new GedcomAttribute("id", xref_id, lineNumber)); if (pointer != "") e.AddAttribute(new GedcomAttribute("idref", pointer, lineNumber)); line.Add(e); if (linevalue != "") e.AddAttribute(new GedcomAttribute("value", linevalue, lineNumber)); return line; } public override bool Read() { try { switch (readState) { case ReadState.Initial: { nodeStack.Push(new GedcomElement("GEDCOM", -1)); readState = ReadState.Interactive; return true; } case ReadState.Interactive: { // deal with attributes and end element nodes ... if (currentLineNodes != null && !currentLineNodes.EOF) { currentLineNodes.MoveNext(); nodeStack.Push(currentLineNodes.Current); } else { // need to parse a new GEDCOM line currentLineText = fileReader.ReadLine(); if (currentLineText == null) { readState = ReadState.EndOfFile; return false; } currentLineNodes = ParseGedcomLine(currentLineText); ... nodeStack.Push(currentLineNodes.Current); } return true; } default: return false; } } catch(Exception e) { readState = ReadState.Error; throw e; } } ... // remaining methods omitted }

The implementation uses a stack to keep track of the current XML node and its ancestors. Read will periodically push nodes onto the stack and eventually pop them off as the cursor moves through the underlying document. The node at the top of the stack is considered the current node.

The implementation of Read first checks the ReadState to see if this is the first call to Read. If so, it creates an XML element named "GEDCOM" (using a GedcomElement object) and pushes it onto the stack—this is the root element of the target format (see Figure 3). Then it sets the ReadState to "Interactive".

On future calls to Read, the implementation will read a line from the underlying GEDCOM 5.5 file and call ParseGedcomLine, which then produces a GedcomLine object containing several IGedcomNode objects. The GedcomLine object serves as a cache. This is necessary because each call to Read only processes a single XML node. So after a line has been parsed, future calls will process the remaining nodes in the GedcomLine object until the end of the cached collection is reached. At this point, a new line is parsed and the process repeats itself.

The other tricky thing that Read has to deal with is inserting XmlNodeType.EndElement nodes in the correct sequence.

Parsing GEDCOM 5.5

The parsing code (see ParseGedcomLine) takes care of the GEDCOM 5.5 parsing details and generating the proper XML format using the various IGedcomNode-derived classes. You can change the code in this method to customize the target XML format. The code shown in Figure 5 produces the XML format shown in the code in Figure 3.

With this code in place, overriding the remaining XmlReader members becomes quite simple (see Figure 6). For example, the implementations of LocalName and Name pull the current node from the top of the stack and ask for its name through the IGedcomNode interface. The override for NodeType does the same thing. As you can see, the core transformation functionality takes place in the implementation of Read. You can download the complete sample for more details on the remaining implementations.

Figure 6 XmlReader Property Implementation

public class GedcomReader : XmlReader { ... // remaining methods omitted public override string LocalName { get { if (readState != ReadState.Interactive) return String.Empty; IGedcomNode node = nodeStack.Peek() as IGedcomNode; return node.Name; } } public override string Name { get { if (readState != ReadState.Interactive) return String.Empty; IGedcomNode node = nodeStack.Peek() as IGedcomNode; return node.Name; } } public override XmlNodeType NodeType { get { if (readState != ReadState.Interactive) return XmlNodeType.None; IGedcomNode node = nodeStack.Peek() as IGedcomNode; return node.NodeType; } } public override string Value { get { IGedcomNode node = nodeStack.Peek() as IGedcomNode; return node.Value; } } }

Using GedcomReader

With the completed GedcomReader implementation, you can now process GEDCOM 5.5 documents as if they were really XML documents structured like the one shown in Figure 3. For example, you can begin writing XmlReader-based code that streams through the file and inspects each XML node one at a time:

GedcomReader gr = new GedcomReader("skonnord.ged"); while (gr.Read()) Console.WriteLine("{0}: {1}", gr.NodeType, gr.Name);

Or you could load the document into an XmlDocument and process it using XPath. For example, the following code snippet identifies the INDI elements that have a NAME element containing 'Bernt' in its value:

GedcomReader gr = new GedcomReader("skonnord.ged"); XmlDocument doc = new XmlDocument(); doc.Load(gr); gr.Close(); // done using GedcomReader XmlNode indi = doc.SelectSingleNode( "/GEDCOM/INDI[contains(NAME/@value, 'Bernt')]");

After loading the GEDCOM 5.5 source into an XmlDocument object, you can serialize it to XML 1.0 using the Save method:

GedcomReader gr = new GedcomReader("skonnord.ged"); XmlDocument doc = new XmlDocument(); doc.Load(gr); gr.Close(); // done using GedcomReader doc.Save("skonnord.xml");

The generated skonnord.xml file will look like the file shown in Figure 3. Loading the GEDCOM 5.5 file and saving it achieves the complete file transformation. You can even take the transformation one step further by using XSLT to translate the intermediate XML format into something else.

XSLT for GEDCOM 6.0

The latest GEDCOM specification, version 6.0, defines a full-fledged XML format for GEDCOM information. The specification even provides a Document Type Definition (DTD) that defines the elements and attributes that make up the complete GEDCOM 6.0 vocabulary.

The GEDCOM 6.0 format is much different from the one my GedcomReader simulates. In order to migrate to GEDCOM 6.0, you have to either modify the GedcomReader implementation in order to simulate the new format or write an XSLT that performs the transformation in a subsequent step.

The GEDCOM 6.0 format is much more complex than the simple mapping I simulated in GedcomReader. Consequently, trying to simulate GEDCOM 6.0 in the GedcomReader code would be extremely difficult and error prone. Using an XSLT transformation to accomplish this step is a more tractable problem.

In Figure 7 I've provided an XSLT in the sample project that illustrates how to generate a GEDCOM 6.0 file from the intermediate XML format shown in Figure 3. The XSLT covers the most common GEDCOM 6.0 use cases, and it produces files that pass validation against the GEDCOM 6.0 DTD.

Figure 7 XSLT Transformation to GEDCOM 6.0

<xsl:stylesheet version="1.0" xmlns:xsl="https://www.w3.org/1999/XSL/Transform"> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:template match="GEDCOM"> <GEDCOM> ... <xsl:apply-templates select="INDI"/> ... </GEDCOM> </xsl:template> <xsl:template match="INDI"> <IndividualRec Id="{@id}"> <IndivName> <xsl:call-template name="OutputLang"/> <xsl:value-of select="NAME/@value"/> </IndivName> <Gender><xsl:value-of select="SEX/@value"/></Gender> <xsl:if test="DEAT|BURI|CREM"> <DeathStatus>dead</DeathStatus> </xsl:if> <xsl:call-template name="TransformAttributes"/> <xsl:for-each select="ASSO"> <AssocIndiv> <Link Target="IndividualRec" Ref="{@idref}"> <xsl:choose> <xsl:when test="TYPE='FAM'"> <xsl:attribute name="Target">FamilyRec </xsl:attribute> </xsl:when> <xsl:when test="TYPE='INDI'"> <xsl:attribute name="Target">IndividualRec </xsl:attribute> </xsl:when> ... </xsl:choose> </Link> <Association> <xsl:value-of select="RELA/@value"/> </Association> <xsl:call-template name="TransformNote"/> </AssocIndiv> </xsl:for-each> <xsl:call-template name="TransformRecordCom"/> </IndividualRec> </xsl:template> ... <!-- remainder omitted for clarity, download file for more details --> </xsl:stylesheet>

You can use this XSLT by taking advantage of the System.Xml.Xsl.XslTransform class, as shown here:

GedcomReader gr = new GedcomReader(gedcomFileName); XmlDocument doc = new XmlDocument(); doc.Load(gr); gr.Close(); // done using GedcomReader XslTransform tx = new XslTransform(); tx.Load("gedcom6.xsl"); FileStream fs = new FileStream("skonnord6.xml", FileMode.Create); tx.Transform(doc, null, fs, null);

The output file, skonnord6.xml, will now be GEDCOM 6.0-compliant. At this point it's also possible to write other XSLT transformations that change the intermediate XML format into another format of your choosing. For example, you could write an XSLT transformation that produces human-readable HTML pages for viewing and navigating the GEDCOM family tree information.

As you can see, it didn't take much code to implement a complete GEDCOM migration path. The ability to programmatically move between GEDCOM 5.5 and GEDCOM 6.0 greatly simplifies the migration scenarios that are involved in building a complete genealogy system around the new GEDCOM 6.0 data model.

Where Are We?

I've walked through a real-world data migration scenario using the System.Xml classes in .NET. I started with the GEDCOM 5.5 format and moved to an intermediate XML format that can be processed in a variety of ways. This is made possible by a custom XmlReader implementation called GedcomReader.

The intermediate XML format can also be transformed into a variety of other formats. I was able to migrate to GEDCOM 6.0 by using an XSLT transformation (see Figure 8).

Figure 8 Migrating to GEDCOM 6.0

Figure 8** Migrating to GEDCOM 6.0 **

The extensibility points provided by System.Xml facilitate dealing with a variety of data migration scenarios like this one. If you have old data formats lying around that would benefit from migrating to XML, follow the approach discussed in this column to implement your own migration path.

Send your questions and comments for Aaron to  xmlfiles@microsoft.com.

Aaron Skonnard teaches at Northface University in Salt Lake City. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000), and frequently speaks at conferences. Reach him at https://www.skonnard.com.