It's full of stars: XNA 2-D Shader Instancing

A lot of my XNA-based programs draw many instances of the same mesh (such as RoundLines), with each instance having different position, rotation, etc. I had been submitting these meshes to Direct3D one instance at a time, which is not great for performance. Instanced drawing is a good way to get many copies of simple meshes on-screen faster, but it takes a bit of extra work. It's not too difficult, though.

The Instanced Model Sample is a great place to start learning about instancing. It's a little intimidating, though, to learn that there are three ways to do instancing, and they all require slightly different code and have different hardware requirements.

Hardware Instancing

Shader Instancing

VFetch Instancing

Works on Xbox?

No

Yes

Yes

Works on Windows?

Yes (if vs_3_0)

Yes

No

Need multiple streams?

Yes

No

No

Need modified vdecl?

Yes

Yes

No

Need replicated VB?

No

Yes

No

Need replicated IB?

No

Yes

Yes

Need inline asm?

No

No

Yes

Max instances/draw?

Thousands

50-200

50-200

If you want maximum performance in every possible situation, you should probably implement all three methods. But, being a bit of a minimalist, I'd rather pick one method that works pretty well in all situations. The "happy medium" method seems to be shader instancing. It works on both Windows and Xbox, it doesn't require multiple vertex streams or inline assembly, and gives good performance. Hardware instancing and VFetch instancing have some advantages on their respective platforms, but shader instancing is giving me the speed boost I need without a lot of extra code paths.

In the Instanced Model Sample, a full 4x4 transformation matrix is specified for each instance. Since there are only 256 float4 registers available to the vertex shader, this limits the number of instances per draw to about 60. You can often get away with much less per-instance data, especially in 2-D. In my test program, I pass just four float values per instance: x, y, rotation, and hue. That all fits in a single float4, so I am submitting 200 instances per draw (I could have done closer to 256, but I want to leave room for some non-instance constant data). The per-instance data is going to vary depending on what functionality you want -- you might want a scaling factor or an alpha value, for example.

For shader instancing, the vertex declaration needs to have an element for the instance number, stored in a texture coordinate. The only other element I need for my program is position, so it is a pretty small vdecl:

public static VertexElement[] VertexElements = new VertexElement[]
{
    new VertexElement(0, 0, VertexElementFormat.Vector3, VertexElementMethod.Default, VertexElementUsage.Position, 0),
    new VertexElement(0, 12, VertexElementFormat.Single, VertexElementMethod.Default, VertexElementUsage.TextureCoordinate, 0),
};

When creating a mesh, I need to repeat the vertex data for each instance, for a total of verticesPerInstance * numInstances vertices. Each instance's copy specifies its instance index. Here's the setup for quads:

int iVertex = 0;
for (int iInstance = 0; iInstance < numInstances; iInstance++)
{
vertices[iVertex++] = new MyVertexElement(new Vector3(-1, -1, 0), iInstance);
vertices[iVertex++] = new MyVertexElement(new Vector3(+1, -1, 0), iInstance);
vertices[iVertex++] = new MyVertexElement(new Vector3(+1, +1, 0), iInstance);
vertices[iVertex++] = new MyVertexElement(new Vector3(-1, +1, 0), iInstance);
}

The index data needs to be repeated as well, though you don't need to store the instance index here:

int iIndex = 0;
for (int iInstance = 0; iInstance < numInstances; iInstance++)
{
    int iVertexBase = iInstance * 4;
indices[iIndex++] = (short)(iVertexBase + 1);
indices[iIndex++] = (short)(iVertexBase + 2);
indices[iIndex++] = (short)(iVertexBase + 3);
indices[iIndex++] = (short)(iVertexBase + 3);
indices[iIndex++] = (short)(iVertexBase + 0);
indices[iIndex++] = (short)(iVertexBase + 1);
}

The replication feels kind of wasteful, but for these tiny shapes you still end up with a very small VB and IB.

Before drawing, set the per-instance data into an array, then hand it to the instanceData effect parameter.

In the vertex shader, use the index to pull the per-instance data out of the instanceData array:

VS_OUTPUT MyVS(
float4 Pos : POSITION,
float Index : TEXCOORD0 )
{
VS_OUTPUT Out = (VS_OUTPUT)0;
float tx = instanceData[Index].x;
float ty = instanceData[Index].y;
float rot = instanceData[Index].z;
float hue = instanceData[Index].w;
...

}

That's basically it! You can see in my Draw() function how I break up the shapeList into batches of 200, since that's the maximum number I can draw at a time. This is still a huge speedup over drawing them one at a time.

In the demo, I draw hexagrams just to show you can do something more interesting than quads. And I do some translation, rotation, and color cycling to exercise the per-instance information.

I'm using GameTime.IsRunningSlowly to tell if I'm doing so much work per frame that the system can't keep up with the target rate (60fps). On the Xbox I can maintain 60fps with up to around 15,000 instances. I suspect that there is more graphics power still going unused...perhaps I am causing too much garbage collection to render more. On the PC, I can do over 40,000 instances -- your mileage may vary depending on your hardware configuration, of course.

Instancing is great...I'll be using it a lot in my drawing code from now on.

-Mike

InstancedShapeDemo_1_0.zip