Unreal Pixel Streaming at Scale in Azure

2022-11-02

Overview

This document goes through an overview on how to deploy Unreal Engine's Pixel Streaming technology in Azure at scale, which is a technology that Epic Games provides in their Unreal Engine to stream remotely deployed interactive 3D applications through a browser (i.e., computer/mobile) without the need for the connecting client to have GPU hardware. Additionally, this document will describe the customizations Azure Engineering has built on top of the existing Pixel Streaming solution to provide additional resiliency, logging/metrics and autoscaling specifically for production workloads in Azure. The additions built for Azure are released on GitHub, which consists of an end-to-end solution deployed via Terraform to spin up a multi-region deployment with only a few Terraform commands. The deployment has many configurations to tailor to your requirements such as which Azure region(s) to deploy to, the SKUs for each VM/GPUs, the size of the deployment, HTTP/HTTPs and autoscaling policies (node count & percentage based).

This solution supports UE4 and UE5 applications, though note that for UE5's Nanite functionality to work, the Operating System for the GPU VM must support the DirectX 12 Agility functionality only available in Azure on Windows 10 and Windows Server 2022 or later.

For a detailed overview of Pixel Streaming and architectures in Azure, see our documentation here. For a more simplified quick-start for the process on manually deploying to a single VM with Matchmaker and Signaling Server in Azure, see the Microsoft documentation here. For the easiest to deploy and the most supported solution for deploying Unreal Pixel Streaming in Azure with Unreal Engine 4 or Unreal Engine 5, please use the Epic Games Azure Marketplace solution, which is where all the latest updates and features are moved to going forward. To jump directly to the documented steps for deploying the older Terraform solution in Azure that we are discussing in this document, click here.

Additions Added by Microsoft

Microsoft has worked with Epic to customize Pixel Streaming for the cloud using Microsoft Azure, which has resulted in many key additions to deploy and monitor a Pixel Streaming solution at scale. Below are the notable additions that have been incorporated into a Fork of Unreal Engine on GitHub:

General
- Azure integration scripts, as the current product only exports out AWS scripts.
- End-To-End deployment in Microsoft Azure with Terraform
- Autoscaling capabilities to enable thresholds for scaling up or scaling back down GPU compute.
- Multi-region deployments in Azure with utility scripts for spin up/down/deployment management.
- Multiple streams per GPU (depending on complexity and resource usage of the 3D app).
- Automated installation of pre-requisites for the VMs (i.e., VC++, DirectX, Node.js, NVIDIA drivers, etc.)
- Auto startup tasks if the services/3D app close unexpectedly or when starting/restarting the VMs.
Matchmaker
- Resiliency for recovering from disconnects with the Signaling Server.
- Removed duplication of redirects causing shared sessions.
- HTTPS capabilities for the Matchmaker Node.js service
- Host and custom application metrics for monitoring in Azure with Application Insights
- Configuration file to avoid hard coding ports and config items in the JavaScript code.
- Autoscaling policies based on total and available connections, which trigger scale operations on the Virtual Machine Scale Set that hosts the Signaling Server and 3D application.
- Added a /ping API to use for Traffic Manager health checks.
Signaling Service
- Resiliency for failed Matchmaker connections
- Retries when WebSocket failures occur.
- Stderror logging to log any unhandled thrown errors.
- js scripting to restart and bring back up the 3D app when a user session ended (via scripts\OnClientDisconnected.ps1), giving new users a fresh session.

Azure GitHub Repo

The public GitHub repo can be found at Azure\Unreal-Pixel-Streaming. If you don't have a GitHub account, create one here and verify your email address.

Architecture

User Flow

Let's walk through the general flow of what is showed in the architecture diagram above when a user connects to the service:

Clients connect to their closest region (1 .. N regions) via Traffic Manager, which does a DNS redirect to the Matchmaking service VM.
The Matchmaking Service redirects to an available node on the paired VMSS which holds the Signaling Service and Unreal 3D app (doesn't use the Load Balancer). The VMSS nodes have public Ips for each VM and not a single private LB IP, otherwise the Matchmaking Service won't be able to redirect to the appropriate VMSS that's available (i.e., a LB would pick a random one)
The Signaling Service streams back the 3D app rendered frames and audio content to the client via WebRTC, brokering any user input back to the 3D app for interactivity.

Azure SKU Recommendations

Below are the recommended compute SKUs for general usage of Pixel Streaming in Azure:

Matchmaker : Standard_F4s_v2 or similar should be sufficient. 4 cores with a smaller memory footprint should be fine for most deployments as there is very little CPU/Memory usage due to the instant redirecting users to Signaling Servers.
Signaling Server : Standard_NV12s_v3 or Standard_NV6 might be the best price per performance GPU VMs in Azure for Pixel Streaming, with the newer NV12s_v3's providing better CPU performance at a similar price-point to the older NV6s. Both have a NVIDIA Tesla M60 GPU. If Ray Tracing is required in your app you'll need to look at the NCas T4 v3 and even more powerful NVadsA10v5 series VMs. For the NVIDIA A10 GPUs, be sure to use the Standard_NV36ads_A10_v5 SKU to leverage a full GPU, as this series is partitioned according to core counts. As GPU SKUs are in high demand, it's important to work with the capacity team early on to request the needed quota for any event or deployment that will be spinning up a great number of GPU VMs.

Important: It is recommended to first deploy your Pixel Streaming executable and run it on your desired GPU SKU to see the performance characteristics around CPU/Memory/GPU usage to ensure no resources are being pegged and frame rates are acceptable. Consider changing resolution and frames per second of the UE4 or UE5 app to achieve acceptable quality per your requirements. Additionally, consider the IOPS / latency requirements for the 3D app when choosing a disk, as SSDs and/or striping disks will be key to gaining the best disk speed (some GPU SKUs might not support Premium SSDs so also consider disk striping for adding IOPS).

Optimizing Pixel Streaming in Azure

Be sure to check out the Pixel Streaming in Azure Overview documentation to learn more about optimizing for Azure VM SKUs, performance and pricing optimizations.

Recommendations for Further Optimizations

The current customized solution in GitHub has many additions that make deploying Pixel Streaming in Azure at scale easier, and below are even more improvements on those customizations which would make it even better:

Add Queuing to Matchmaker : Currently when a Matchmaker can't send a user to any available Signaling Server there is no queuing as users stack up waiting for an available stream and would be good to put in a queuing system in Matchmaker to allow the user waiting the longest to get the next available server.
Matchmaker Database: To have improved resiliency of the MM it could share a common database between one or two MM VMs to keep connection/availability status in case one is restarted/goes down. Currently this connection status and stream availability is not persisted. This could be persisted to a Database like CosmosDB/SQL, or even to the local disk via a simple JSON file.
Multiple Resolutions for Desktop/Mobile: For mobile users there might not need to be a higher resolution for those streams, and so having multiple streams with different resolutions could maximize performance/costs to send mobile users to mobile streams (e.g., 720p) and desktop users go to higher resolution streams (i.e., 1080p). Either this could be lower resolution streams packed on certain GPU VMs, or a mix (e.g., 3 streams with 1080p, 720p, 720p).
Deploying with Azure DevOps or GitHub Actions: Currently much of the orchestration of deploying the code is done through PowerShell scripts in the scripts\ folder, but much of this could be moved into deployment workflows such as Azure DevOps or GitHub Actions. These were left out to reduce extra dependencies and complexity for those not familiar with these technologies, but for a production solution it would be recommended to utilize them (e.g., ADO, GitHub Actions, Jenkins, etc.).

Configurations

Below are notable configurations to consider when deploying the Pixel Streaming solution in Azure.

Terraform Configuration

There was a tremendous amount of work that went into building out the Terraform deployment for Pixel Streaming; however, unless you plan on making major modifications you can focus just on the following 3 files:

iac\terraform.tfvars: This stores the global variable for the deployment_regions, which specify which Azure region(s) will be used (default is "eastus") and their Virtual Network ranges:

iac\region\variables.tf: This is the most important file to be familiar with, as it has the configs for the gitpath (change to your Git fork), pixel_stream_application_name (change to your UE app name), along with other notable parameters such as desired FPS (default 60), resolution (default 1080p), starting instance count (default 1), instances per node (default 1), and Azure VM SKUs for the MM (default Standard_NV6) and SS (Standard_F4s_v2).

iac\variables.tf: This global variables file can mostly be ignored, unless needing to change the global resource group's name (base_resource_group_name), location (global_region, default: eastus), Traffic Manager port or storage account settings (tier/type).

See the Terraform section to learn more about the deployment files.

Deployment Script Configurations

The Git location referenced in the deployment is stored in the iac\region\variables.tf file. Important: You must have read access with a Personal Access Token (PAT) to the specified repository for the deployment to work, since when the VMs are created there is a git clone used to deploy the code to the VMs. Also, you'll want to validate if your organization needs to have Enterprise SSO enabled for your PAT.

Matchmaker Configuration

Below are the configurations available to the Matchmaker, which a config.json file was added to the existing Matchmaker code to reduce hard coding in the Matchmaker.js file:

{
  // The port clients connect to the Matchmaking service over HTTP
  "httpPort": 80,
  // The Matchmaking port the Signaling Service connects to the matchmaker over sockets
  "matchmakerPort": 9999,
  // Instances deployed per node, to be used in the autoscale policy (i.e., 1 unreal app running per GPU VM) – not yet supported
  "instancesPerNode": 1,
  // Amount of available Signaling Service / App instances to be available before we must scale up (0 will ignore)
  "instanceCountBuffer": 5,
  // Percentage amount of available Signaling Service / App instances to be available before we must scale up (0 will ignore)
  "percentBuffer": 25,
  //The amount of minutes of no scaling up activity before we decide we might want to see if we should scale down (i.e., after hours--reduce costs)
  "idleMinutes": 60,
  // % of active connections to total instances that we want to trigger a scale down if idleMinutes passes with no scaleup
  "connectionIdleRatio": 25,
  // Min number of available app instances we want to scale down to during an idle period (idleMinutes passed with no scaleup)
  "minIdleInstanceCount": 0,
  // The total amount of VMSS nodes that we will approve scaling up to
  "maxInstanceScaleCount": 500,
  // The Azure subscription used for autoscaling policy (set by Terraform)
  "subscriptionId": "",
  // The Azure Resource Group where the Azure VMSS is located, used for autoscaling (set by Terraform)
  "resourceGroup": "",
  // The Azure VMSS name used for scaling the Signaling Service / Unreal App compute (set by Terraform)
  "virtualMachineScaleSet": "",
  // Azure App Insights ID for logging and metrics (set by Terraform)
  "appInsightsId": ""
}

Signaling Server Configuration

Below are configs available to the Signaling Server in their config, some added by Microsoft for Azure:

{
  "UseFrontend": false,
  "UseMatchmaker": true, // Set to true if using Matchmaker.
  "UseHTTPS": false,
  "UseAuthentication": false,
  "LogToFile": true,
  "HomepageFile": "player.htm",
  "AdditionalRoutes": {},
  "EnableWebserver": true,
  "matchmakerAddress": "",
  "matchmakerPort": "9999", // The web socket port used to talk to the MM.
  "publicIp": "localhost", // The Public IP of the VM -- set by Terraform.
  "subscriptionId": "", // The Azure subscription -- set by Terraform.
  "resourceGroup": "", // Azure RG -- set by Terraform.
  "virtualMachineScaleSet": "", // Azure VMSS -- set by Terraform.
  "appInsightsId": "" // Azure App Insights ID for logging/metrics -- set by Terraform.
}

TURN / STUN Servers

In some cases, you might need a STUN / TURN server in between the UE4 or UE5 app and the browser to help identify public IPs (STUN) or get around certain NAT'ing/Mobile carrier settings (TURN) that might not support WebRTC. Please refer to Unreal Engine's documentation for details about these options; however, for most users a STUN server should be sufficient. Inside of the SignallingWebServer\ folder there are PowerShell scripts used to spin up the Cirrus.js service which communicates between the user and the UE4 or UE5 app over WebRTC, and Start_Azure_SignallingServer.ps1 or Start_Azure_WithTURN_SignallingServer.ps1 are used to launch with STUN / TURN options. Currently the Start_Azure_SignallingServer.ps1 file points to a public Google STUN server (stun.l.google.com:19302), but it's highly recommended to deploy your own for production. Unreal Engine exports out stunserver.exe and turnserver.exe when packaging up the Pixel Streaming 3D app to setup on your own servers (not included in repo): \Engine\Source\ThirdParty\WebRTC\rev.23789\programs\Win64\VS2017\release\

Start_Azure_SignallingServer.ps1 is called by runAzure.bat when deploying the Terraform solution, so if a TURN server is needed this can be changed in runAzure.bat to call Start_Azure_WithTURN_SignallingServer.ps1 with the right TURN server credentials updated in the PS file.

Unreal 3D App

The Unreal 3D app and dependencies reside in GitHub (Git-LFS enabled) under the Unreal\ folder. The Unreal\ folder structure aligns with what is exported out of Unreal Engine, and below are the specific files\folders you will want to copy over the existing files provided in the example GitHub repository:

Your exported <ProjectName>.exe should replace Unreal\PixelStreamingDemo.exe
<ProjectName>\ folder associated with the <ProjectName>.exe should replace the Unreal\PixelStreaming\ folder.
Replace the ThirdParty folder of the repo with your ThirdParty folder (i.e., \Engine\Binaries\ThirdParty\), as these third-party dlls are specific to what was used in your 3D application. The existing ones were just what was used in the example app provided in the repo. Make sure you can click on your `.exe' to run it locally in your cloned repo folder to ensure all dependencies are copied over. This is the only thing needed to be copied over from your own Engine\ folder to the repo.
Nothing more is needed to copy over unless you've changed any player.htm or specific customizations to the MM or SS web servers. These changes must be merged with the Microsoft special customizations and not replaced over our WebServer\ files to ensure a correct merge.
Important: Be sure to check in any code/app changes back into your forked repo as the Terraform deployment pulls from GitHub on your deployment and not your local resources.

The Unreal application has some key parameters that are passed in upon startup, which the Terraform deployment and PowerShell script (startVMSS.ps1) handles for you:

<PixelStreamingApp>.exe -AudioMixer -PixelStreamingIP=localhost -PixelStreamingPort=8888 -WinX=0 -WinY=0 -ResX=1920 -ResY=1080 -Windowed -RenderOffScreen -ForceRes

Notable app arguments to elaborate on for your understanding (see Unreal docs for others):

-ForceRes: It is important to make sure this argument is used to force the Azure VM's display adapter to use the specified resolution (i.e., ResX/ResY).
-RenderOffScreen: This renders the app in the background of the VM, so it won't be seen if RDP'ing into the box, which ensures that a window won't be minimized and not stream back to the user.
-Windowed: If this flag isn't used the resolution parameters will be ignored (i.e., ResX/ResY).
-PixelStreamingPort: This needs to be the same port specified in the Signaling Server, which is the port on the VM that the communicates with the 3D Unreal app over web sockets.

Autoscaling Configuration

Microsoft has added the ability to autoscale the 3D stream instances up and down, which is done from new logic added to the Matchmaker which evaluates a desired scaling policy and then scales the Virtual Machine Scale Set compute accordingly. This requires that the Matchmaker has a System Assigned Managed Service Identity (MSI) for the VM with permissions to scale up the assigned VMSS resource, which is setup for you already in the Terraform deployment. This eliminates the need to pass in special credentials to the Matchmaker such as a Service Principal, and the MSI is given Contributor access to the region's Resource Group that was created in the deployment—please adjust as needed per your security requirements.

Here are the key parameters in the Matchmaker config.json required to configure on autoscaling for the Signaling Server and 3D app (VMSS nodes). Important: Be sure to check in any config changes back into your forked repo as the Terraform deployment pulls from GitHub on your deployment and not your local resources.

instanceCountBuffer : Min amount of available streams before triggering a scale up (0 will ignore this). For instance, if you have 5 it will only trigger a scale up if only 4 or less streams are available.

percentBuffer : % of available streams before triggering a scale up (0 will ignore this). For instance, if you have 25 it will trigger a scale up if less than 25% of total connected Signaling Servers are available to stream.

idleMinutes : How many minutes of no new scale operations before considering a scale down (e.g., scale down after hours)

connectionIdleRatio : % of active streams to total instances that we want to trigger a scale down after idleMinutes passes.

minIdleInstanceCount : The number of VMSS nodes we want during an idle period (e.g., never go below 10 nodes)

maxInstanceScaleCount : The max number of VMSS nodes to scale out to (e.g., never scale above 250 VMs)

Player HTML & Custom Event Configuration

When Unreal Pixel Streaming is packaged from Unreal Engine the solution contains a \Engine\Source\Programs\PixelStreaming\WebServers\SignallingWebServer\player.htm file to customize the experience, along with the ability to customize JavaScript functions to send custom events between the browser and the 3D Unreal application. Please see Epic's robust documentation on how to make these extra customizations.

Deployment

This section will walk through all the steps necessary to deploy this solution in Azure. Currently the deployment expects a Windows OS as it references powershell.exe directly, though a simple symlink of pwsh to powershell.exe on Linux apparently works (will be added in a future release). Important: Be sure to first follow the guidance in the Configurations section to setup the git repo location.

To deploy the solution, use the steps here:

Install the following prerequisites:
- Install Git and Git LFS as well (i.e., "git lfs install" via PowerShell after installing Git)
- Make sure you have Azure CLI installed.
- Make sure you have terraform installed.
Do a git clone on the repo with a depth of 1 or a git pull if already cloned. Note: If you don't use a --depth 1 it will download the entire Git history for Unreal Engine ( will take a long time ).

  git clone --depth 1 https://github.com/Azure/UnrealEngine.git

Run PowerShell as Administrator (needed for setting PowerShell's execution policy below) and Login to Azure via Azure CLI in the PowerShell window:

  az login

Set the Azure subscription with Azure CLI to deploy into:

az account set --subscription "SUBSCRIPTION_NAME_HERE"

Set PowerShell's execution policy (per your security needs):

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope LocalMachine

Delete the terraform.tfstate file in the iac\ folder if existing from a previous deployment.
In a PowerShell window navigate to the \iac folder where the repo was cloned and use these 3 commands with your Git PAT since the repo is private (replace PUT_GIT_PAT_HERE with PAT):

terraform init
terraform validate
terraform apply -var 'git-pat=PUT_GIT_PAT_HERE' --auto-approve

This process can take between 15-30 minutes to deploy, depending on resources deployed (i.e., UE app size, regions chosen, etc.). The PowerShell window should finish with a successfully completed message. The deployment creates 2 resources groups in Azure:

<random_prefix>-global-unreal-rg : This stores all global resources such as the Traffic Manager, Key Vault and Application Insights.
<random_prefix>-<region>-unreal-rg : This stores the Virtual Machine Scale Set (VMSS) for the GPU nodes that have the 3D app and Signaling Server, the Matchmaker VM and Virtual Network resources.

Testing the Deployment: Open up a web browser and paste in the DNS name from the Traffic Manager in the global Resource Group (e.g., http://<random_prefix>.trafficmanager.net) to be redirected to an available stream. The DNS name can be found under "DNS name: <link>" in the Overview page of the Traffic Manager resource in the Azure Portal. If you've deployed to multiple regions, you will be redirected to the closet Azure region.

Post the deployment there are processes that Terraform will run on the following solution components in each region upon startup of each VM:

Matchmaker VM
- Script: startMMS.ps1
- Executable: Node.exe
- Scheduled Task calling the script on restart: StartMMS
Backend (Web & Signaling Services) VMSS
- Script: startVMSS.ps1
- Executable: Node.exe
- Executable: <PixelStreamingApp>.exe
- Scheduled Task calling the script on restart: StartVMSS

Redeploying Updates

The easiest way to redeploy during the solution would be to do the following for each piece:

Signaling Servers / 3D app
- Scale down the VMSS to 0 in each region, wait for it to finish, then scale back up to the desired count again (pulls down fresh code from Git), restart the Matchmaker VMs in each region, then restarting the VMSS nodes in each region to give a full reset and redeployment.
Matchmakers
- Without having to redeploy everything from Terraform, the easiest way is to manually log into each of the deployed MM VM's and copy the code changes over from git – unless there needs to be any terraform transformations and, in that case, you will need to redeploy it fully from Terraform. If it's just a JavaScript change, replacing the matchmaker.js file is sufficient. Refer to the Matchmaker section above for the folder location.
Full redeployment
- Use the following steps to redeploy the solution from Terraform:
  - Delete the terraform.tfstate file in the iac\ folder.
  - Do a Git sync and grab the latest updates.
  - In the PowerShell console use these 3 commands (Get a PAT from GitHub):

terraform init
terraform validate
terraform apply -var 'git-pat=PUT_GIT_PAT_HERE' --auto-approve

Shutting Down and Restarting Later

If we need to shut down the solution and start it up later, see below for the process. This is just shutting down the compute for the Matchmaker and the Signaling Servers, which are the costlier resources (especially the SS GPU VMs) vs. deleting all the resources and requiring a time-consuming redeployment.

Shutting down the core compute

Matchmakers
- Go to each regional Resource Group in the Azure Portal (i.e., *-eastus-unreal-rg) and click on the matchmaker VMs (e.g., *-mm-vm0). Once entered the Overview page of the VM choose the Stop button at the top to turn off the VM and not be charged for any further compute. Do this for all regions that are deployed.

Signaling Servers / 3D app
- Go to each regional Resource Group in the Azure Portal (i.e., *-eastus-unreal-rg) and click on the Virtual Machine Scale Set (VMSS) resources (e.g., *vmss). Once entered the Overview page of the VMSS choose the Stop button at the top to turn off the VMSS instances and not be charged for any further compute. Do this for all regions that are deployed. See the Redeploying Updates section to see how to scale down to a specific number of VMSS nodes versus turning them all off.

Starting back up the core compute

Matchmakers
- Go to each regional Resource Group in the Azure Portal (i.e., *-eastus-unreal-rg) and click on the matchmaker VMs (e.g., *-mm-vm0). Once entered the Overview page of the VM choose the Start button at the top to turn on the VM. Do this for all regions that are deployed first before turning on the VMSS nodes so they can connect to the MM cleanly. The MM will come back on and start the MM service from the ScheduledTask StartMMS setup on Windows.

Signaling Servers / 3D app
- Go to each regional Resource Group in the Azure Portal (i.e., *-eastus-unreal-rg) and click on the Virtual Machine Scale Set (VMSS) resources (e.g., *vmss). Once entered the Overview page of the VMSS choose the Start button at the top to turn back on the VMSS instances. Do this for all regions that are deployed. Each instance will come back on and start the SS and 3D automatically from the ScheduledTask StartVMSS setup on Windows.

Monitoring

Currently automated Azure dashboards aren't built when deploying the solution; however, outside of regular host metrics like CPU/Memory, some key metrics will be important to monitor in Azure Monitor/Application Insights such as:

SSPlayerConnected – The most key metric to know when a user connected (use Count)
SSPlayerDisconnected – When a user disconnects from the Signaling Server (use Count)
AvailableConnections – The amount of available Signaling Servers not being used (use Avg)
TotalConnectedClients – Amount of Signaling Servers connected to the Matchmaker (use Avg)
TotalInstances – The total number of VMSS instances
PercentUtilized – The percentage of Signaling Servers (streams) in use (use Avg)
MatchmakerErrors – The number of Matchmaker (use Count)

View a tutorial on creating a dashboard in Azure Monitor here.

Supporting the Solution

In supporting the deployed solution, it is recommended to do a few key things:

Monitor the dashboards created using the recommended metrics and make sure that there are not unusually high or low connections exists, to validate if usage is spiking/pegging or unusual traffic is happening like too low (something is wrong?) or too high (a potential bot?).
Looks for an unusual number of errors being logged, and potentially ignore any syntax error exceptions that say "Unexpected token <token> in Json" as that appears to be hackers trying to send garbage to the Matchmaker.
If anything gets broken or out of whack you can follow the guidance in the Updates section above which is to restart the Matchmakers, then restart the VMSS nodes for each region. That should force everything to come back online fresh again. If anything is corrupted, scaling down the VMSS to 0, scaling back up to the desired count, restarting the MM's then restarting the VMSS nodes will do a full reset and redeploy.

Terraform

Below are the key files in the Terraform setup to understand when altering the code and tweaking the parameters.

Folder Structure

\iac is the root of all infrastructure for the solution.
- Outermain.tf is the primary TF File. This file sets base variables like the 5 character prefix on all assets, takes the optional GitHub PAT for private repos and deploys the Global Resource Group, and "Stamps" which are each of the regional deployments
\iac\region is the folder with the files to deploy a region
- Main.tf is the TF file that handles deployment of all assets in each regional deployment
- Variables.tf is the TF file that has parameters for each of the regional deployments

See these links to learn more:

Share via