Microsoft uses Cloud Platform System (CPS) in production for engineering workloads

You don’t often think about enterprise private clouds and high scale going together. 45K VMs running, 200K cores and 20K VMs created and deleted per day must surely be the domain of public cloud hosters? In fact this is Microsoft’s internal private IaaS cloud (called Nebula) offering compute capacity for development and validation to Microsoft product engineering teams. The challenge for the Nebula team is not just scale but also functional capability with many engineering teams bringing exacting needs for reliability, compute and networking. With a long list of these items not being met with existing systems the Nebula team jumped onboard as the first internal partner for Microsoft’s Cloud Platform System (CPS).

Here’s a rundown of key features for Nebula service use of CPS:

Windows Azure Pack (WAP) Self-service portal experience

The Nebula service caters for both individual development engineers requiring small individual environments and test automation systems grabbing often hundreds of VMs at a time. For the development engineers the Nebula team offers a self-service experience via the WAP portal. The WAP portal shipping with CPS has been customized for the needs of Nebula users but is an example of how an enterprise IT department could tailor the experience.

The following show the home page view and creating a VM in the Nebula WAP portal.

Nebula WAP portal home page view

Creating a CPS VM in the Nebula WAP portal

Reliable, well monitored capacity

One of the things that Microsoft engineering groups really care about is reliability of their test runs. For production services we all expect high availability and reliability, but for continuous engineering validation consistency and repeatability are also vital, and often without the luxury of resilient components in the mix. Hence Nebula team has been striving to offer a high single VM instance SLA to internal partners. CPS is ideal in this regard since it has built with redundancy and resiliency from the hardware design up through the storage, networking and management systems. Nebula offers CPS as premium reliability in contrast to the standard reliability of our existing data center hardware.

Monitoring and troubleshooting abilities are also a key part of the equation for offering customer a comprehensive reliable service. Historically the Nebula team has found in many cases an engineering test run is not hit with VM crashing but some aspect of the networking or storage environment which fails. CPS offers a comprehensive view of what is going on with the stamp within Operations Manager. A detailed description of the features used is contained in this blog post:

Specialized workloads

A conventional workload for CPS is Sharepoint or SQL server, or getting more complex would be Microsoft Dynamics. All of these offer a network endpoint to client applications to consume their services which can be connected to a corporate network or exposed to the internet in a secure manner. Now consider testing a system which offers remote Operating System Deployment (OSD) capabilities using PXE boot including its own DHCP and DNS environment. Not the sort of infra-structure you want connected to a corporate network if you value your laptop getting email on a Monday morning! Thankfully CPS has a flexible solution to this problem using Software Defined Networking.

Nebula solves this problem by offering isolated networking environments via windows virtual networking within CPS. This allows the flexibility of having one v-net testing an OSD scenario in an isolated network accessed via a proxy server right with another v-net in the same CPS stamp offering a conventional corporate network accessible workload. Previously Nebula had to dedicate specific hardwired data center kit for OSD scenarios which make poor use of capacity.

Another workload that Nebula is now offering via CPS is code integration. This IO intensive operation for merging code trees is working well using the CPS shared storage and was not possible on Nebula conventional direct attached storage servers.

Service operations value

The Nebula team have to deal with challenges of increasing capacity need, balancing needs of Azure and private cloud use and meeting specialized feature needs for customers. All this on top of day to day support for customers and dealing with data center issues. In this environment the hit to the operations team for setup of new hardware and environments is considerable. So this is where CPS really helps out.

Ready to connect, integrated support
When new hardware kit arrives in standard practice for Nebula operations team to spend time imaging the servers, and setting up management and fabric management systems. When a CPS stamp is handed over to the Nebula Operations team all of this is done and all that’s needed is to hook up CPS interfaces to the Nebula VM request infra-structure and some account creation. Since the System Center components are all running its easy to manage and monitor as the stamp is put through essential burn in testing. Since the Azure operational insights data is also available to the CPS support team any problems in production usage can be seen by them in real time. Since this is done in coordination with Dell this offers a faster and tighter loop for diagnosing issues.

Patching and upgrade

One bane of the Ops world is OS security updates for customer host servers. Nebula have used Windows Server live migration for some time but with 4K servers in production running at 90% capacity it’s a fast game of Tetris to shift customer VMs around to complete patching in in a couple of days. In short the administration burden is too great so patching and rebooting the servers remains the predictable path to ensure a secure private cloud.
Enter CPS with cluster aware updates and the operational effort is hugely reduced. CPS takes care of moving VMs around and making continuous operation for customers a practical reality.

Well that’s a brief rundown of how Microsoft’s internal private cloud is gaining huge benefit from CPS adoption. In a future post we’ll look in more detail about the operational aspects of using CPS.

Microsoft Nebula Service team