Inferring Schemas from Well-Formed XML Documents with Whidbey Beta 2

2007-09-04

Radhakrishnan Srikanth
Microsoft Corporation

May 2005

Applies to:
XML
Schema

Summary: Radhakrishnan Srikanth explains how to infer and tune schemas starting from XML documents. The Microsoft Visual Studio 2005 Beta 2 release of the System.XML Schema Inference API is used to illustrate how to automatically infer a schema from an XML instance document. (10 printed pages)

Introduction
Scenario
Generating a Schema from a Well-Formed XML Document
Fine-Tuning the Schema
Putting It All Together
Conclusion

Introduction

In the past few years XML has emerged as one of the prominent data exchange formats. The features of XML, namely the semantic markup and the human readable nature, simplified data exchange, allowed users to break free from proprietary data formats. Schemas are used to represent the structure of the data to be exchanged, and they can be used to validate XML documents. By using a schema to validate XML instance documents, applications are able to enforce the constraints on the values of the elements and attributes, and the order of elements in the document. However, there are many applications in use where well-formed XML documents are used without any XML schema associated with them. The XML schema inference API in the Microsoft .NET Framework allows automatic generation and tuning of schemas given instances of XML documents. This article illustrates the API with the use of example XML documents and walks through the C# code used to generate and tune the schema.

Scenario

ABC Inc.'s customer has told them that their order information would be sent in XML. However, a schema that describes the data is not supplied; instead ABC Inc. is supplied with sample XML documents. ABC Inc. would like to use schemas to validate the XML documents that are sent to them. In addition, ABC Inc. would like to bind the incoming XML data to C# objects (XML data binding). The best way to auto generate C# objects is by means of a schema. As a first step, ABC Inc. wants to infer the schema from the XML documents supplied to them.

Other scenarios can be envisaged where internal customers or different departments that have varying levels of expertise in XML may decide to send data in XML. These users may not have the expertise to create schemas. The schema inference API can be used as a tool to generate an XML schema from a set of XML instance documents.

Another typical scenario would be the use of XML documents as configuration files. Validation of the data in these files would improve the security of the application and remove the onus of validating data from the application programmer. Schemas can be used as convenient mechanisms for these kinds of data validation.

Schemas can also be used as data contracts between partners and, as mentioned earlier, schemas can also be used to validate the constraints and the structure of the data.

Generating a Schema from a Well-Formed XML Document

Consider the following XML document (sample.xml)

            <?xml version="1.0" encoding="utf-8" ?>      <!-- sample.xml -->
               <item xmlns="http://www.anyURI.com/items" productID='123098'>
                  <name>hammer</name>
                  <price>9.95</price>
                  <supplierID>1929</supplierID>
             </item>

This document has a root element item, an associated XML namespace, and an attribute called productID. It has the child element's name, price, and supplierID.

Looking at the document, the following can be inferred.

The root element name is item, and the attribute name is productID.
The target namespace is http://www.anyURI.com/items.
name, price, and supplierID are child elements.
It is unclear if the child elements are mandatory or optional.
It looks like:
1. name is of type string.
2. price is of type float/double.
3. supplierID is of type integer.
4. productID is of type integer.

While for this sample it is easy to visually inspect and infer salient features about schema, it may be impractical and tedious to visually inspect and infer the schema for larger documents. It may be impractical even for simple and small documents if there are several in the set to be inspected.

The XmlSchemaInference API can be used to automatically infer and tune the inferred schema.

To infer the schema, we first instantiate an XmlSchemaInference object and a schema set to store the schema once the schema has been inferred.

            //declare and instantiate a schema inferencer
            XmlSchemaInference infer = new XmlSchemaInference();
            //declare and instantiate a schema set to hold the inferenced schema
            XmlSchemaSet sc = null;

The method InferSchema on the schema inference object takes in a XML reader (which is instantiated with a corresponding XML file), and returns a schema set object with the inferred schema in the schema set.

sc = infer.InferSchema(XmlReader.Create("sample.xml"));

For the sample above (sample.xml), the following schema is inferred and stored in the schema set object sc. By default, it assumes that all the elements are required.

              <?xml version="1.0" encoding="utf-8"?>
              <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://www.anyURI.com/items" xmlns:xs="http://www.w3.org/2001/XMLSchema">
                  <xs:element name="item">
                      <xs:complexType>
                          <xs:sequence>
                              <xs:element name="name" type="xs:string" />
                              <xs:element name="price" type="xs:decimal" />
                              <xs:element name="supplierID" type="xs:unsignedShort" />
                          </xs:sequence><xs:attribute name="productID" type="xs:unsignedInt" use="required" />
                      </xs:complexType>
                  </xs:element>
              </xs:schema>

Fine-Tuning the Schema

Inferring Occurrences

Now consider that there are other samples of the XML document that implicitly indicate one of the elements in the schema is not mandatory. For instance, consider the following sample of the XML data (sample2.xml).

            <?xml version="1.0" encoding="utf-8" ?>   <!—sample1.xml -->
             <item xmlns="http://www.anyURI.com/items" productID='A53-246'>
                <name>paint</name>
                  <price>12.50</price>
              </item>

The supplierID element is missing in this sample (sample1.xml). We can fine tune the inferred schema by updating it with the new document. Note the schema set object sc is passed in as a parameter.

            //update schema with another xml fragment
            sc = infer.InferSchema(new XmlReader("sample2.xml"),sc);

The schema is now updated to indicate that supplierID is not a mandatory element. The updated schema is shown below; note that minOccurs is set to zero for the supplierID element.

              <?xml version="1.0" encoding="utf-8"?>   
              <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://www.anyURI.com/items" xmlns:xs="http://www.w3.org/2001/XMLSchema">
                <xs:element name="item">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element name="name" type="xs:string" />
                      <xs:element name="price" type="xs:decimal" />
                      <xs:element minOccurs="0" name="supplierID" type="xs:unsignedShort" />
                    </xs:sequence>
                    <xs:attribute name="productID" type="xs:string" use="required" />
                  </xs:complexType>
                </xs:element>

Inferring Choice

Now consider sample2.xml, where the order if the element's name, price, and supplierID is changed from sample.xml (as shown below).

<?xml version="1.0" encoding="utf-8" ?>
<!-- sample2.xml -->
<item xmlns="http://www.anyURI.com/items" productID='A53-246'>
   <name>paint</name>
   <supplierID>1929</supplierID>
   <price>12.50</price>
</item>

On visual inspection it is clear that the schema should have a choice content model in it. Now update with the new XML fragment, again passing in the schema set object as a parameter.

//update schema with another xml fragment
sc = infer.InferSchema(XmlReader.Create("sample2.xml"), sc);
sc.CopyTo(schemas, 0);

//This addition will generate a choice in the schema
//write the new schema to file
w = XmlWriter.Create(new StreamWriter("sampleSchema2.xsd"), settings);
schemas[0].Write(w);

The resulting schema is (choice is bolded)

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://www.anyURI.com/items" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="item">
    <xs:complexType>
      <xs:sequence>
        <xs:choice maxOccurs="unbounded">
          <xs:element name="name" type="xs:string" />
          <xs:element name="price" type="xs:decimal" />
          <xs:element name="supplierID" type="xs:unsignedShort" />
        </xs:choice>
      </xs:sequence>
      <xs:attribute name="productID" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Occurrence and TypeInference Options

While by default the inference engine sets the element occurrence as required, the API allows you to set the occurrence inference option to relaxed, and the schema sets the elements to optional as opposed to required.

(to set the occurrence option to relaxed to the following)
            infer.Occurrence=XmlSchemaInference.InferenceOption.Relaxed;

Furthermore, if you notice in the inferred schema, a type for each of the elements is also inferred by matching it to potential types. If you set the type inference option to relaxed, the element types willl default to string as opposed to inferred xsd datatypes.

(set inference option to relaxed do the following)
            infer.TypeInference = XmlSchemaInference.InferenceOption.Relaxed;

Putting It All Together

The following code sample illustrates an example of inferring the schema and updating it. It also illustrates the Occurrence and the Type Inference options.

#region Using directives

using System;
using System.Xml;
using System.Xml.Schema;
using System.IO;

#endregion


namespace HowTo.Samples.Xml
{
    public class SchemaInferenceSample
    {
        public static void Main(string[] args)
        {
            //declare and instantiate a schema inferencer
            XmlSchemaInference infer = new XmlSchemaInference();
            //declare and instantiate a schema set to hold the inferenced schema
            XmlSchemaSet sc = null;

            //XmlSchemaInference, given an XML fragment, infers the XML Schema and adds it to the Schema set
            sc = infer.InferSchema(XmlReader.Create("sample.xml"));

            XmlSchema[] schemas = new XmlSchema[sc.Count];
            sc.CopyTo(schemas, 0);
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.Indent = true;
            XmlWriter w = XmlWriter.Create(new StreamWriter("sampleSchema.xsd"),settings);
            Console.WriteLine("Writing sampleSchema");
            schemas[0].Write(w);
            w.Close();


            //update schema with another xml fragment
            sc = infer.InferSchema(XmlReader.Create("sample1.xml"),sc);

            //update the schema -- notice the supplierID element is missing in this XML fragment. This is interpreted
            //to mean that the supplierID element is an optional field
            sc.CopyTo(schemas, 0);

            //write the new schema to file
            w = XmlWriter.Create(new StreamWriter("sampleSchema1.xsd"),settings);
            Console.WriteLine("Writing sampleSchema1, updating previous schema to reflect new data, optional element");
            schemas[0].Write(w);
            w.Close();

            //update schema with another xml fragment
            sc = infer.InferSchema(XmlReader.Create("sample2.xml"), sc);
            sc.CopyTo(schemas, 0);

            //This addition will generate a choice in the schema
            //write the new schema to file
            w = XmlWriter.Create(new StreamWriter("sampleSchema2.xsd"), settings);
            Console.WriteLine("Writing sampleSchema2, updating the previously generated schema to reflex the new sample with choise");
            schemas[0].Write(w);
            w.Close();


            //set the occurrence inference option to relaxed and the schema sets the elements to optional
            //This opiton will generate the elements to be optional elements as opposed to required.
            //Compare sampleSchema1.xsd which had by default elements to be required and sampleSchema3.xsd 
            //(generated below), where the elements are optional. This option is a global setting and cannot 
            //be used to set the option of each element

            infer.Occurrence=XmlSchemaInference.InferenceOption.Relaxed;
            //infer schema for sample 1 again
            //Note: this does not refine the schema but overwrites the existing schema

            sc = infer.InferSchema(XmlReader.Create("sample.xml"));
            sc.CopyTo(schemas, 0);

            //write the new schema to file
            w = XmlWriter.Create(new StreamWriter("sampleSchema3.xsd"),settings);
            Console.WriteLine("Writing sampleSchema3 where the occurence option is set to relaxed");
            schemas[0].Write(w);
            w.Close();

            //set the type inference option to relaxed. If this option is set, the types default to string
            //as oppossed to inferend xsd datatypes
            //Note: Now both TypeInference and Occurence is set to relaexed.

            infer.TypeInference = XmlSchemaInference.InferenceOption.Relaxed;
            
            //infer schema for sample 1 again 
            //Note: this does not refine the schema but overwrites the existing schema

            sc = infer.InferSchema(XmlReader.Create("sample.xml"));
            sc.CopyTo(schemas, 0);

            //write the new schema to file
            w = XmlWriter.Create(new StreamWriter("sampleSchema4.xsd"),settings);
            Console.WriteLine("Writing sampleSchema4 inferring sample.xml with inference and occurence options relaxed");
            schemas[0].Write(w);
            w.Close();

            Console.WriteLine("Press enter to exit");
            Console.ReadLine();
        }
    }
}

Conclusion

XmlSchemaInference is an API that will allow you to infer and fine tune a schema from XML instance documents. Schemas allow you to not only validate XML documents, but also can help in XML data binding. For example, one can use xsd.exe to generated related C# classes from an XML schema.

Share via