Age of Ascent
An epic space game on Microsoft Azure
By: John Donnelly, Anko Duizer, David Gristwood, and Neil Kidd
HTML5 | WebGL | Azure
Age of Ascent from Illyriad Games is a massively multiplayer online (MMO) epic space game, powered by Microsoft Azure. An important and distinctive aspect of the game, which sets it apart from others of its genre, is its ability to have tens of thousands of players in a single world, engaged in epic battles and exploration. This has been made possible by designing the game from the ground up allowing it to run across hundreds of computers, yet create a single, fluid, cohesive universe. Age of Ascent will usher in a new era of ultra-MMOs.
While this scale of battle may not happen every day, it is a fundamental proposition of the game that it can and will occur. The underlying engine that powers it needs to be able to deliver this degree of massive scale, as and when needed, without any advance notice from the game community.
The ability to deliver this scale will be built into the final version of the game, currently under development. The space combat gameplay described here represents an important part of the game that will be released in 2015, but it's only one part.
This article describes the game’s architecture, the journey through its early development, and some of the design decisions that were made along the way to help support these goals.
More details of Age of Ascent can be found at https://www.ageofascent.com.
MMO games, such as EVE Online and World of Warcraft, allow game designers the freedom to create immersive and exciting worlds, from historical and mythical backdrops to vast unexplored galaxies and planets. In such online MMO worlds, users can explore, battle, undertake missions, create teams and clans, and define their own destinies and adventures.
Creating such large and open worlds is difficult, and so games designers put in place game mechanisms to limit the openness of the worlds they create. Typically these either constrain the number of users in an area, or reduce the ability to roam and explore freely. In some cases time needs to be slowed down to prevent the system from overloading.
Often these are driven by hardware constraints, particularly the processing power of an individual machine. The classic approach has been to find ever bigger and more powerful machines, a strategy that only works up to a certain point — there is always an upper limit. It’s also not a cost-effective model, as larger machines have more complex subsystems and are correspondingly more expensive.
Creating a game that can scale out across hundreds or thousands of machines is the only way to break these constraints, but it comes at the cost of a potentially more complex design with more moving parts. A game of this scope requires a mechanism to distribute the world across these machines, combined with a communications system to manage the messages that need to flow quickly and seamlessly between them.
And it all needs to be presented to users as a single, contiguous world.
The Microsoft Azure public cloud provided Illyriad with an economically viable means to create such a world. It has the ability to stand up and tear down hardware on demand, in real time, in response to the number of users in a game and to do so at a very low cost.
Age of Ascent is the first in a new generation of MMO games that break through these barriers, and create truly open worlds on a scale never before seen. We'll look at some of the key design and developed aspects of the game, and how it was built from the ground up, by a small development team, to run on Microsoft Azure.
Using Microsoft Azure is a huge win for us and our players.
Illyriad Games evaluated a number of hosting options and felt Azure was a perfect fit for Age of Ascent. They designed the game from the outset to fully exploit the platform. One of the key factors that influenced its decision was the Microsoft Azure platform as a service offering that automates many of the aspects of the developer operations process, enabling small teams to deliver big ideas, quickly and easily. Platform as a service solutions are usually faster to develop because there is less work for developers, so they can move from idea to working system more quickly. There is less administration and management work, so lower ongoing support costs. And there is less risk, as the platform does more for you, so there are fewer opportunities for error and creating and running applications becomes more reliable.
As Ben Adam, CTO at Illyriad, notes, “Microsoft Azure’s always patched, immediately deployed platform as a service, its on-demand scaling and its per-minute billing mean we pay only for what we need and have no speculative costs. At Illyriad, time not spent on server administration is directly added to game development, so using Microsoft Azure is a huge win for us, and our players.”
Because Illyriad Games was planning to build a game of such ambitious size and scope, it worked closely with the Microsoft Technology Centre and the Microsoft Developer eXperience Technical Evangelism & Development team. These teams have had extensive experience working with customers on a wide range of projects using the Microsoft Azure platform. They are familiar with the challenges facing the design of highly scalable systems and were able to provide real-world guidance and advice as well as interaction with the relevant product teams within Microsoft.
Several architectural design sessions were run during the project to design the key aspects of the architecture. Considerable thought was given to the scale out design needed, the ways in which the universe could be modelled to facilitate that design, and how best to implement it in a robust and resilient fashion.
Age of Ascent leverages many aspects of the Microsoft Azure platform to build and deploy a system on such a scale. In particular:
- The game needed to be responsive, so the ability to partition the game across a number of datacenters, geographically distributed across the globe, offering high-bandwidth, low-latency network access, was critical.
- The game designers didn’t want to explicitly manage the process of connecting gamers to the best point of presence for them, so they used the Microsoft Azure Traffic Manager, which provides geo-routing to the datacenter that offers the best game experience, based on the fastest response time.
- It used the Microsoft Azure Content Delivery Network (CDN) that provides additional “edge” nodes across the globe, to cache frequently downloaded files and higher latency shared data (e.g., distant events).
- To persist game and user data it used the Microsoft Azure storage system, which offers a high-availability, triple-copied, geo-replicated store, to ensure that once data was committed to the system, it would never be lost and would always be available.
Age of Ascent is a browser-based game, in which users connect via a Web socket over SSL to Microsoft Azure. In the game, as users initiate actions, such as moving or firing, they are transmitted down the Web socket. Similarly, external events that affect the users, such as being fired upon or a spaceship coming into view, are transmitted back via the Web socket.
The server-side game logic runs across multiple Microsoft Azure datacenters, and is responsible for managing the universe, sending those messages across the different systems to ensure that the universe is fully connected and acts as a single coherent unit.
When a user joins the game, the browser creates a Web socket that links that instance of the browser with Microsoft Azure. This Web socket is then used as a bidirectional full-duplex link between the browser and the backend system, feeding in user commands and receiving the data packets needed to render the surrounding universe.
Unlike more traditional HTTP traffic, which is stateless, the Web socket model is stateful, and so it needs to be anchored to a specific Web server, where the Web socket context information is maintained. The routing service understands the “sticky” nature of Web sockets, and helps manage the way new requests will be balanced across available servers, and how to cope if the Web socket breaks and needs to reconnect. The routing service runs as a set of worker roles in Microsoft Azure, and uses the standard Microsoft Azure load balancer to round-robin requests to an individual role. As users connect to the routing service, any necessary security checks are made at that point, and as messages enter the system, they are checked to ensure they are correctly structured and valid to help handle potential data loss and possible cheat tampering.
Interest management is the process of reducing the vast size of the universe, and the objects that inhabit it, down to those aspects that are directly relevant to a specific user, at a given point in time and space. Focusing on those objects, such as spaceships, explosions and bullets, based on their closeness and movement, makes for a better user experience. A ship in the far distance does not need to be reported back to the user with the same urgency as one close up, so Age of Ascent uses higher and lower latency events to work out how best to display the universe. Rendering and managing the few thousand spaceships that were closest to a user in high fidelity gave the game a very fluid feel, while more distant objects, such as a distant explosion, could be handled with less urgency.
The interest management service implements this divide and conquer approach, making it possible to render and process the data that is of interest to a specific user in a vast and busy universe. Each worker role owns a particular area of the universe, with awareness of objects and events in nearby areas. Each of these interest management areas of space overlap slightly, and as users move through space, they may be handed off from interest management worker role to another, based on where they are in space. There is a one-to-one mapping between the routing service and the interest management context for each user, and overlapping areas of space share high-fidelity, time-sensitive data units with their neighbors.
Microsoft Azure traffic manager
The Microsoft Azure Traffic Manager is responsible for initially directing users to the most appropriate Microsoft Azure datacenter. It supports a number of options, but for this game, because one of the key design criteria is to keep the game as interactive as possible, requests are routed to the datacenter that will give the user the lowest latency and therefore the best experience.
The communications backplane was developed to optimize communications across the system. There are two backplane communication servers for each cluster of routing and interest management servers, to ensure resilience and high availability. They handle all communication between clusters, and ensure that every interest management server gets all the messages it needs to manage its part of universe. Intelligent filtering further reduces the need to send messages to other subsystems where they are not needed.
Users joining the game don’t get direct access to it but instead go through a special gate keeper service. This level of indirection between the outside world and the game offers a number of benefits, not least of which is providing a security boundary to manage invalid or illegal attempts to enter the game. Also, during particularly heavy periods of game activity, the gate keeper service limits the number of people connected to the game.
The gatekeeper service is fronted by the game’s landing page, and so, while these checks take place in the background, the landing page can provide up-to-the-minute game information, video footage, news and feeds. The game assets that are stored in the Microsoft Azure CDN are downloaded to further reduce the already low startup time on game entry. The client download for the entire game, including all the assets and textures, was less than 9 MB.
global high latency service
From a user experience perspective, it is important to let players experience the vastness of the “Age of Ascent” universe, but not so that the browser is overwhelmed and unable to cope with the volume of messages or number of objects it needs handle. The high latency service helps solve this problem by updating the client every few seconds with the major events that occur in the far distance that don’t yet affect the user. These events are rendered several seconds after they occur, a latency that is valid given the distance involved, especially in the context of how much time it will take the user to fly to these locations.
In addition, the high latency service is also responsible for updating the client with the overall team score, the number of current players and the number of peak players, data that will eventually be consistent so it doesn’t need to be rendered as low latency.
The high latency service utilizes Microsoft Azure Storage and Microsoft Azure CDN. Every instance of the communications backplane service writes all the major events it tracks to an Azure table storage partition for its own scale unit.
This service then periodically reads all these major events from the tables in all scale units and creates a binary file saved to Microsoft Azure blob storage of all the major events occurring within the past few seconds in the whole universe. This file is set up to be cached using the CDN retrieval.
Testing is important for all software of this size and complexity, but because scalability was the key requirement for breaking the world record for the largest online game, a special emphasis was placed on discovering and removing bottlenecks, in addition to the normal bug finding process.
The Age of Ascent testing cycle consisted of an ongoing series of private and public beta tests to get feedback from users about game play, usability and performance and acquire important test data. Early on, private beta tests were run on just a single cluster to test how well it performed, while a subsequent test was run across multiple datacenters to flex the various subsystems. Public beta tests were then run every week, to check performance optimizations and gain more real-world experience with users.
Although real-world usage is vital for testing the game, it does require a certain amount of management, not to mention a willing group of volunteers. During some early beta tests, the messages flowing through the system were logged to Azure storage. Those messages represent all the actions in the game from myriad users, enabling elements of the game to be replayed at any time, under controlled conditions. In the early stages of games testing, this was felt to be more valuable than trying to use artificial intelligence to simulate users, because it reflects how real players behave.
To do this, it was necessary to build a test harness, which connected via a Web socket to the game, to simulate a user.
Each harness would typically drive one ship, using data extracted from the logged ship data in Azure storage. Visual Studio Test Manager was used to configure and drive these test harnesses, and gather and log diagnostic information from the test runs. These were run directly from within Azure to get compute resources with low latency, good network connection to the place we’re running the game.
In the later part of the development cycle, once beta testing was underway, the team introduced artificial players into the game, alongside real ones, to help stress test the system under load.
The other major aspect to testing was to understand the scale characteristics of the system being built. Unlike most other systems, where the designers have some idea up front of how many users they need to support, during the development of Age of Ascent the focus was on understanding how many users a particular configuration could support.
The team designed the system using the concept of scale units. A scale unit is the smallest unit of deployment, which for this project is composed of:
- 12 routing service worker roles
- 12 interest management worker roles
- 2 backplane service worker roles
The high latency service runs as a global service outside the scale unit.
All the services run in large Azure virtual machines, each with 4 cores and 7 GB of memory. Careful monitoring of the virtual machines revealed them to be mainly CPU and network constrained but not memory constrained.
During the load tests, performance counters for CPU and network were carefully monitored to determine how “hot” machines were running.
During one of the large scale public betas in early 2014, when all the moving parts and scale units were stood up, the system coped with an amazing 267 million application messages per second, with each message being personalized and unique to each player. This equates to an excess of 100,000 concurrent users in a single contiguous battle.
Real-world usage is vital for testing
- Start exploring Microsoft Azure at the official website, and get your free trial
- Microsoft Azure Traffic Manager technical overview
- Microsoft Azure Content Delivery Network (CDN) technical overview
- Find out more about how to test applications using Microsoft Studio Test Manager
- Find out more about developing for Internet Explorer 11 at the Dev Center
- Learn more about the Microsoft Technology Center offerings
While many aspects of the Age of Ascent development story may be unique to its particular design and requirement, some constants hold true:
Iterative design works. The architecture and design of Age of Ascent went through a number of iterations, as the scope and feature set was fleshed out, and various ideas were put forward, debated and analyzed for the trade-offs and benefits. Doing this early, re-evaluating and then prototyping and testing led to big-time savings in the long run.
Know the bottlenecks. The testing and debugging cycle is about finding the current bottleneck, removing it, and then repeating the process until either time or money runs out. Knowing if a specific unit within the system is likely to be bound by CPU, memory, network, etc., is vital. Fortunately, Azure supports a range of different virtual machine sizes and capabilities, so it’s possible to test and determine the optimum specification.
Ship regularly. As Jim McCarthy, ex-Microsoft Visual C++ general manager, advocates in his excellent book “Dynamics of Software Development,” ship regularly and you will get really good at shipping. In this case shipping meant running an end-to-end beta test of the system under load, with real users in play. In the end an ongoing program of betas were shipped, and in each beta the team learned more about the way the system performed, collected more data, explored different configurations, and learned how to stand up and manage hundreds of cores across multiple datacenters.
Unit test. Although unit testing is not specific to large-scale projects such as Age of Ascent, every complex system is built upon smaller units of functionality, and relies on them to work well and perform under a wide range of scenarios, including edge cases, invalid data, etc. Though it can be tempting to skip unit testing due to time constraints, building such tests, as the application is being developed, is a good approach for everyone on the project.