XBox: Analytics on petabytes of gaming data with Azure HDInsight

Cross post from https://azure.microsoft.com/en-us/blog/how-xbox-uses-hdinsight-to-drive-analytics-on-petabytes-of-telemetry-data/

Microsoft Studios produces some of the world’s most popular game titles including the Halo, Minecraft, and Forza Motorsport series. The Xbox product services team manage thousands of datasets and hundreds of active pipelines consuming hundreds of gigabytes of data each hour for first party studios. Game developers need to know the health of their game through measuring acquisition, retention, player progression, and general usage over time. This presents a textbook big data problem where data needs to be cleaned, formatted, aggregated and reported on, better known as ETL (Extract Transform Load).

HDInsight - Fully managed, full spectrum open source analytics service for enterprises

Azure HDInsight is a fully-managed cloud service for customers to do analytics at a massive scale using the most popular open-source frameworks such as Hadoop, MapReduce, Hive, LLAP, Presto, Spark, Kafka, and R. HDInsight enables a broad range of customer scenarios such as batch & ETL, data warehousing, machine learning, IoT and streaming over massive volumes of data at a high scale using Open Source Frameworks.

Key HDInsight benefits

  • Cloud native: The only service in the industry to provide an end-to-end SLA on your production workloads. Cloud optimized clusters for Hadoop, Spark, Hive, Interactive Query, HBase Storm, Kafka, and Microsoft R Server backed by a 99.9% SLA.
  • Low cost: Cost-effectively scale workloads up or down through decoupled compute and storage. You pay for only what you use. Spark and Interactive Query users can use SSD memory for interactive performance without additional SSD cost.
  • Secure: Protect your data assets by using virtual networks, encryption, authenticate with Active Directory, authorize users and groups, and role based access control policies for all your enterprise data. HDInsight meets many compliance standards such as HIPAA, PCI, and more.
  • Global: Available in more than 25 regions globally. HDInsight is also available in Azure Government cloud and China which allows you to meet our needs in key geographical areas.
  • Productive: Rich productivity tools for Hadoop and Spark such as Visual Studio, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support. Data scientists can also use the two most popular notebooks, Jupyter and Zeppelin. HDInsight is also the only managed-cloud Hadoop solution with integration to Microsoft R Server.
  • Extensible: Seamless integration with leading certified big data applications via an integrated marketplace which provides a one-click deploy experience.

The big data problem

To handle this wide range of uses and varying scale of data, Xbox has harnessed the versatility and power of Azure HDInsight. As raw heterogeneous json data lands in Azure Blob Storage, Hive jobs transform that raw data to more performant and indexed formats such as ORC (Optimized Row Columnar). Studio users can then add additional Hive, Spark, or Azure ML jobs to the pipeline to clean, filter, and aggregate further.

Scalable HDInsight architecture with decoupled compute and underlying storage

Depending on the launch style of a game, Xbox telemetry systems can see huge spikes in data at launch. Outside of an increase in users, the type of analysis and query needed to answer different business questions can vary drastically from game to game and throughout the lifecycle, resulting in shifting the compute needed. Xbox uses the ease of creating HDInsight clusters via the Azure APIs to scale and create new clusters as analytic needs and data fluctuates while maintaining SLA.

Needing a system that scales up and out while offering a variety of isolation levels, Xbox chose to utilize an array of Azure Storage Accounts and a shared Hive metastore. Utilizing an external Azure SQL database as the Hive metastore allows the creation of many clusters while sharing the same metadata, enabling a seamless query experience across dozens of clusters. Utilizing many Azure Storage Accounts, attached with SAS keys to control permissions, allows for a greater degree of consistency and security at the cluster level. Employing this cluster of clusters method greatly increases the scale out ability. In this cluster of clusters configuration, Xbox enabled the separation of processing (ETL) clusters and read-only ad-hoc clusters. Users are able to test queries and read data from these read-only clusters without affecting other users or processing, eliminating noisy neighbor situations. Users can control the scale and how they utilize their read-only cluster while sharing the same underlying data and metadata.

 

xboxmetastore

Shared data and metastore across different analytical engines in Azure HDInsight

We adopted the following best practices while picking up an external metastore with HDInsight for high performance and agility:

  • Use an external metastore. This helped us separate compute and metadata.
  • Ensure that the metastore created for one HDInsight cluster version is not shared across different HDInsight cluster versions. This is due to different Hive versions having different schemas. For example, Hive 1.2 and Hive 2.1 clusters trying to use same metastore.
  • Back-up custom metastore periodically for OOPS recovery and DR needs.
  • Keep metastore, storage accounts, and the HDInsight cluster in same region.

Data flow

Xbox devices generate telemetry data which is consumed by Event Hub for further processing in HDInsight Cluster by running thousands of different Azure Data Factory activities, and finally making data available to users for further insights. The figure below showcases the Xbox telemetry data journey.

xboxarchitecture

Load balancing of Jobs

We use multiple clusters in our architecture to process thousands of jobs. We built our own custom logic to distribute jobs among a number of different clusters, which helped us optimize our job completion time. We typically have long running jobs and interactive high priority jobs that we need to finish. We use the Yarn capacity scheduler to load balance cluster capacity for these jobs. Typically, we set high priority queue at ~80-90% of the cluster with maximum capacity to 100% and a low priority queue with ~10-20% with maximum capacity to 100%. With this distribution, long running jobs can take max cluster capacity until a high priority interactive job shows up. Once that happens, the high priority job can take 80-90% of cluster capacity and finish faster.

Summary

Xbox telemetry processing pipeline, which is based on Azure HDInsight, can be applied to any kind of enterprise trying to solve big data processing at massive scale.

For more information, please reach out to AskHDInsight@Microsoft.com for any questions.