Build a sports analytics architecture on Azure

Data Factory
Data Lake Storage Gen2
Databricks
Event Hubs
Power BI

The focus of this article is to show a practical architecture that uses Azure services to process and maintain data used by sports analytics solutions. It provides a framework for sports organizations to build highly scalable solutions with, while giving them the flexibility to add more services that meet the nuanced requirements of their use cases.

Apache®, Apache Spark®, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Architecture

Diagram that shows an example workload Azure architecture for sports analytics.

Download a Visio file of this architecture.

Dataflow

  1. Data is ingested from source systems by using one of the following methods:

    • Azure Data Factory ingests raw data from several data sources and stores it in Azure Data Lake Storage for downstream processing.
    • Some raw data sources might be large and might not need the raw data to be stored in Data Lake Storage initially, like the spatial on-court/on-field data. In these cases, you can use Azure Databricks to ingest source data and immediately transform data so that it's cleansed, normalized, and saved to Data Lake Storage in an easy-to-digest format.
    • Data that's generated by sensors in real-time is ingested as messages by Azure Event Hubs.
  2. Azure Databricks transforms raw data so that it's cleansed of any errors and normalized. With the cloudFiles feature of Azure Databricks Auto Loader, raw files are automatically processed as they land in Data Lake Storage. The transformed data moves back into Data Lake Storage for further curating.

  3. Azure Databricks applies business logic to the transformed data. Stream data is also combined with the transformed data during this process.

  4. Azure Databricks processes stream data from Azure Event Hubs and combines it with static data.

  5. The final processed data is written to Data Lake Storage in Delta format.

  6. Transformed data that's used in the visualization layer, like Power BI, is written to an Azure SQL Database. This database becomes the data source for any reporting needs.

  7. Curated data is visualized and manipulated through Power BI, Power Apps, or a custom web application that's hosted by an Azure App Service.

  8. Azure Machine Learning builds and trains machine learning models by using data imported into Azure Machine Learning Datasets and external sources. The datasets and sources are directly linked to the Azure Machine Learning Workspace. You can control access and authentication for data and the Machine Learning workspace with Azure Active Directory (Azure AD) and Azure Key Vault. Models can also be retrained as necessary in Machine Learning.

  9. As an alternative to storing model results in Data Lake Storage or SQL Database, you can deploy Machine Learning models to containers using Azure Kubernetes Services (AKS) as a web service and called via a REST API endpoint. The web service deploys by using an Azure App Service, and then you can send data to the REST API endpoint and receive the prediction returned by the model within the web application.

Throughout the process:

  • Azure Monitor collects information on events and performance.
  • Key Vault secures passwords, connection strings, and secrets.
  • Azure DevOps manages code repositories and deployment pipelines.

Components

  • Azure Data Lake Storage is a scalable and secure data lake for high-performance analytics workloads. You can use Data Lake Storage to manage petabytes of data with high throughput. It can accommodate multiple, heterogeneous sources and data that's in structured, semi-structured, or unstructured formats.
  • Azure Databricks is a data analytics platform that uses Spark clusters. The clusters are optimized for the Azure platform.
  • Azure Data Factory is a fully managed, scalable, and serverless data integration service. It provides a data integration and transformation layer that works with various data stores.
  • Azure Machine Learning is a cloud service for accelerating and managing the machine learning project lifecycle. Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows to train, deploy models, and manage MLOps. You can create a model in Machine Learning or use a model built from an open-source platform like PyTorch, TensorFlow, or scikit-learn. MLOps tools help you monitor, retrain, and redeploy models.
  • Azure Event Hubs is a big-data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.
  • Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions like upgrading, patching, backups, and monitoring without user involvement. SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with high availability.
  • Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights.
  • Power Apps is a suite of apps, services, connectors, and data platform that provides a rapid development environment to build custom apps for your business needs. By using Power Apps, you can quickly build custom business apps that connect to your data stored either in the underlying data platform or in various online and on-premises data sources like SharePoint, Microsoft 365, and Dynamics 365.
  • Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. You can use it with your favorite languages, like .NET, .NET Core, and Java.
  • Microsoft Defender for Cloud is a tool for security posture management and threat protection.
  • Azure Cost Management and Billing helps you understand your Azure invoice, manage your billing account and subscriptions, monitor, control Azure spending, and optimize resource use.
  • Azure Monitor delivers a comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments.
  • Azure Key Vault is a cloud service for securely storing and accessing secrets.
  • Azure Active Directory is an identity service that provides single sign-on, multifactor authentication, and conditional access to guard against most cybersecurity attacks.
  • Azure DevOps provides developer services so that teams can plan work, collaborate on code development, and build and deploy applications. Azure DevOps supports a collaborative culture and set of processes that bring together developers, project managers, and contributors to develop software.

Alternatives

  • You can use Synapse Spark Pools instead of Azure Databricks for sports analytics by using the same open-source frameworks.
  • Instead of Azure SQL Database, you can use Azure SQL Managed Instance to store data that's served to the visualize/interact layer.
  • You can use an Azure Synapse Analytics dedicated SQL pool instead of an Azure SQL Database if the reporting requirements require several terabytes of data stored in the serving layer.
  • If you don't want to use a database as the serving layer for reporting, you can choose to use a semantic lakehouse approach. In this scenario, reporting applications connect to logical tables that are defined by a service like Databricks SQL. These logical tables are used to structure data that's stored in the gold layer (data that's formatted using the Delta format) of Azure Data Lake Storage Gen2, so that the data can be easily read.
  • Instead of Azure Databricks, you can use SQL Database or SQL Managed Instance to query and process data. These databases provide the familiar T-SQL language, which you can use for analysis.
  • You can use Azure Stream Analytics instead of Azure Databricks to process stream data.
  • You can use Azure Machine Learning instead of Azure Databricks to train your machine learning models.
  • You can use GitHub instead of Azure DevOps to manage your code repositories and continuous integration and continuous delivery (CI/CD) pipelines.

Scenario details

Sports analytics is a field that applies data analytics techniques to team or individual performance data. Then you can use the data to create a competitive advantage over an opponent. In addition to analyzing traditional box score statistics, there has been an explosion of data in recent years that sports teams can use to improve the performance of an individual athlete or an entire team. Examples of such data include player data collected from sensors and spatial data that captures player movement during a game. Traditional systems struggle to process and maintain these data sources because of the large volumes of data generated. These data sources also format data in several different ways and allow users to process data at different speeds, providing more challenges for traditional data processing solutions.

Potential use cases

This solution is ideal for the sports industry, and applies to the following scenarios:

  • Manage large volumes of data from several source systems in a centralized ecosystem.
  • Analyze player tracking and temporal data to gain insights into individual and team performance.
  • With consideration for spatial metrics, determine the best possible player positioning and strategies during gameplay.
  • Process and evaluate player performance data to optimize athlete training routines.
  • Analyze historical data to make well-informed personnel decisions during the draft or free agency.
  • Store and analyze real-time telemetry from Internet of Things (IoT) devices that are attached to equipment like bats, shoulder pads, and volleyballs.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Follow MLOps guidelines to standardize and manage end-to-end Machine Learning lifecycles that are scalable across multiple workspaces. Before going into production, ensure the implemented solution supports ongoing inference with retraining cycles and automated redeployments of models.

Use the Azure/mlops-v2 GitHub repository as an MLOps resource.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

Consider using the following security resources in this architecture:

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

  • To estimate the cost of implementing this solution, use the Azure pricing calculator for the services mentioned above.
  • Power BI comes with different licensing offerings. For more information, see Power BI pricing.
  • Depending on the volume of data and complexity of your geospatial analysis, you might need to scale your Databricks cluster configurations that affect your cost. Refer to the Databricks cluster sizing examples for best practices on cluster configuration.

Performance efficiency

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. For more information, see Performance efficiency pillar overview.

If you use Azure Data Factory Mapping Data Flows for extract, transform, and load (ETL), follow the performance and tuning guide for mapping data flows. Mapping data flows this way optimizes your data pipeline and ensures that your data flows meet your performance benchmarks.

Deploy this scenario

To deploy this scenario, follow the steps described in this Azure quickstart, Deploy the Sports Analytics on Azure Architecture. Be sure to read the Prerequisites section in the quickstart before deploying the solution.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Other contributor:

Next steps