Advice on creating a code generator

I have been spending some time recently creating a custom code generator that outputs C++ and C# code from a custom XML format. This blog is about some of what I have learned while working on this.

Advice #1: Use '#line'

There is a really cheap way to get a debugging experience for your custom language: #line. With #line, you get to change the debug info that ilasm/csc/vbc/cl writes into the PDB so that the debug info points back to the original source file instead of the one that your code generator creates. This is quite handy if you custom language has at least parts where there is a strong connection between the generated code and the custom language. Notes:

  1. There is not a similar technique to allow you to re-associate variables, so I would advise using the original names of variables if possible.
  2. If a line of the original source expands to multiple lines of the generated code, just repeat the #line before each line of the generated code.
  3. There is an issue with this technique if your source language is XML-based in VS 2008 RTM. Visual Studio hopes to correct this problem in SP1.
  4. In C#, you can use this technique with #pragma checksum to set the checksum of the source file. This is useful if you expect to have a bunch of source files with the same name (ex: default.aspx) or if you are already grabbing the checksum for some other reason. Otherwise, it is probably overkill for a custom code generator. Note that the GUID with #pragma checksum identifies the hash algorithm (ex: MD5 is { 406ea660-64cf-4c82-b6f0-42d48172a799}).

Example:

Generated file:

    static void Main(string[] args)

    {

#line 1 "HelloWorld.ExampleLanguage"

        Console.WriteLine("Hello World");

#line default
}

HelloWorld.ExampleLanguage (custom language source file):

Hello World

Advise #2: Use XML

XML is a great way to do a custom language these days because at no cost, you get a lexer, a parser, syntax validation, a language sevice, and your compiler gets to work with deserialized classes instead of with text. Here is the procedure that I would recommend:

  1. Figure out what you would like your language to look like by writing a bunch of examples.
  2. Run xsd.exe over your examples to create the start of a schema.
  3. Open up the generated schema and start making changes. You might need to make changes because the generated schema wasn’t specific enough (ex: you have an enumerator that it has represented as a string), or because your examples had bugs in it, or because you want to add documentation.
  4. Once you have the schema vaguely correct, it is time to use xsd.exe to create class files from your schema. This allows you to use XmlSerializer.Deserialize to create classes from the input XML with little work. You want to create a build step to do this as you will be changing your schema often.
  5. Hookup the schema file so that you can run it as part of compilation. This provides you with a pretty good set of validation without effort. I did this by embedding the schema as a resource in my compiler, but obviously you could also leave it as a file that your compiler opens.
  6. (Optional) Use sgen.exe to generate the serialization assembly -- XmlSerializer depends on a generated assembly to perform the serialization/deserialization. By default, this assembly is generated dynamically, but you can also use sgen.exe to generate this assembly up front.
  7. When you edit your language, edit in Visual Studio and make sure that the XML editor has your schema open. The XML editor will pick up schema items that are in your project, and it will also pick up schemas that are in the Visual Studio 'schema directory'.

Example target for running xsd.exe: 
  <!--Generate Example.cs using xsd.exe -->
  <Target Name="GenerateXSDClasses"
    Inputs="Example.xsd"
    Outputs="$(IntermediateOutputPath)\Example.cs">
    <Exec Command="$(RunManagedToolPath) xsd.exe Example.xsd /classes /fields /namespace:ExampleCompiler /out:$(IntermediateOutputPath)"/>
  </Target>

  <ItemGroup>

    <Compile Include="$(IntermediateOutputPath)\Example.cs"/>

  </ItemGroup>

You would also need to wire the target into a property group that runs before compiling

Example of embedding the schema as a resource:

<EmbeddedResource Include="Example.xsd" />

Example of using the schema:

    public static void InitializeSchema()

    {

        if (s_schemaSet != null)

        {

            throw new InvalidOperationException();

        }

       

        System.Reflection.Assembly ThisAssembly = typeof(MyType).Assembly;

        Stream stream = ThisAssembly.GetManifestResourceStream("ExampleCompiler.Example.xsd");

        XmlReader schemaDocument = XmlReader.Create(stream);

        s_schemaSet = new System.Xml.Schema.XmlSchemaSet();

        s_schemaSet.Add("https://schemas.microsoft.com/vstudio/Example/2008", schemaDocument);

        s_schemaSet.Compile();

    }

    XmlReaderSettings settings = new XmlReaderSettings();

    settings.ConformanceLevel = ConformanceLevel.Document;

    settings.Schemas = s_schemaSet;

    settings.ValidationEventHandler += MyValidationEventHandler;

    settings.ValidationType = ValidationType.Schema;

    using (XmlReader reader = XmlReader.Create(filename, settings))

Advise #3: Check in the custom compiler output

If you are writing a very high quality compiler that you are trying to productize then, when people use your compiler, you would have them wire it into their build process such that they would input your custom language and get back a dll or exe.
But this is not necessarily the correct bar for a custom code generator. In my case, I am creating a truly custom compiler. Very few people are going to author the input language, and it doesn’t make sense to expend valuable QA resources directly testing the compiler (rather they would test the generated code). So rather than taking the output of my compiler and directly building these files, I instead checkin the compiler output as a baseline, and the build process runs the custom compiler and compares the output to the baseline. If they differ, it issues a build error.

There are a number of valuable properties that I get out of this 'baseline' approach:

  1. If I edit the compiler, testing the compiler becomes very simple. I just do a build and see if I got the same result that I expected.
  2. If I edit the input, I get confirmation that the compiler did what I expected. I need to diff the baseline output against the new output and validate that I got expected changes.
  3. Compiler bugs turn into build breaks. If I had a bug in my custom compiler where the output was incorrect, rather than going silently unnoticed instead the bug turns into a build break.