Apache open-source scenarios on Azure

Microsoft is proud to support open-source projects, initiatives, and foundations and contribute to thousands of open-source communities. By using open-source technologies on Azure, you can run applications your way while optimizing your investments.

This article provides a summary of architectures and solutions that use Azure together with Apache open-source solutions.

ApacheĀ®, Apache Ignite, Ignite, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Apache Cassandra

Architecture Summary Technology focus
Data partitioning guidance View guidance for how to separate data partitions to be managed and accessed separately. Understand horizontal, vertical, and functional partitioning strategies. Cassandra is ideally suited to vertical partitioning. Databases
High availability in Azure public MEC Learn how to deploy workloads in active-standby mode to achieve high availability and disaster recovery in Azure public multi-access edge compute. Cassandra can be used to support geo-replication. Hybrid
N-tier application with Apache Cassandra Deploy Linux virtual machines and a virtual network configured for an N-tier architecture with Apache Cassandra. Databases
Non-relational data and NoSQL Learn about non-relational databases that store data as key-value pairs, graphs, time series, objects, and other storage models, based on data requirements. Azure Cosmos DB for Apache Cassandra is a recommended Azure service. Databases
Run Apache Cassandra on Azure VMs Examine performance considerations for running Apache Cassandra on Azure virtual machines. Use these recommendations as a baseline to test against your workload. Databases
Stream processing with fully managed open-source data engines Stream events by using fully managed Azure data services. Use open-source technologies like Kafka, Kubernetes, Cassandra, PostgreSQL, and Redis components. Analytics

Apache CouchDB

Architecture Summary Technology focus
Baseline web application with zone redundancy Use the proven practices in this reference architecture to improve redundancy, scalability and performance in an Azure App Service web application. CouchDB is a recommended document database. Web

Apache Hadoop

Architecture Summary Technology focus
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. Azure HDInsight Hadoop clusters can be used for batch processing. Databases
Choose a data transfer technology Learn about Azure data transfer options like Azure Import/Export service, Azure Data Box, Azure Data Factory, and command-line and graphical interface tools. The Hadoop ecosystem provides tools for data transfer. Databases
Citizen AI with Power Platform Learn how to use Azure Machine Learning and Power Platform to quickly create a machine learning proof of concept and production version. Azure Data Lake, a Hadoop-compatible file system, stores data. AI
Data considerations for microservices Learn about managing data in a microservices architecture. View an example that uses Azure Data Lake Store, a Hadoop file system. Microservices
Extract, transform, and load Learn about extract-transform-load (ETL) and extract-load-transform (ELT) data transformation pipelines and how to use control flows and data flows. Hadoop can be used as destination data store in ELT processes. Analytics
IoT analyze-and-optimize loops Learn about analyze-and-optimize loops, an IoT pattern for generating and applying optimization insights based on an entire business context. Hadoop map-reduce processing can be used to process big data. IoT
Materialized View pattern Generate prepopulated views over the data in one or more data stores when the data isn't ideally formatted for your required query operations. Use Hadoop for a big data storage mechanism that supports indexing. Databases
Predict loan charge-offs with HDInsight Spark Use HDInsight and machine learning to predict the likelihood of loans getting charged off. HDInsight supports Hadoop. Databases

Apache HBase

Architecture Summary Technology focus
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. You can use HBase for data presentation in these scenarios. Databases
Choose a big data storage technology Compare big data storage technology options in Azure. Includes a discussion of HBase on HDInsight. Databases
Choose an analytical data store Learn about using HBase for random access and strong consistency for large amounts of unstructured and semi-structured data. Analytics
Data partitioning guidance View guidance for separating data partitions so they can be managed and accessed separately. Understand horizontal, vertical, and functional partitioning strategies. HBase is ideally suited to vertical partitioning. Databases
Non-relational data and NoSQL Learn about non-relational databases that store data as key-value pairs, graphs, time series, objects, and other storage models, based on data requirements. HBase can be used for columnar and time series data. Databases

Apache Hive

Architecture Summary Technology focus
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. You can use Hive for batch processing and data presentation in these scenarios. Databases
Choose a batch processing technology Compare technology choices for big data batch processing in Azure. Learn about the capabilities of Hive. Analytics
Choose an analytical data store Evaluate analytical data store options for big data in Azure. Learn about the capabilities of Hive. Analytics
Extract, transform, and load Learn about ETL and ELT data transformation pipelines and how to use control flows and data flows. In ELT, you can use Hive to query source data. You can also use it together with Hadoop as a data store. Databases
Loan charge-off prediction with HDInsight Spark clusters Use HDInsight and machine learning to predict the likelihood of loans getting charged off. Analytics results are stored in Hive tables. Analytics
Predictive aircraft engine monitoring Learn how to combine real-time aircraft data with analytics to create a solution for predictive aircraft engine monitoring and health. Hive scripts provide aggregations on raw events that are archived by Azure Stream Analytics. Analytics
Predictive insights with vehicle telematics Learn how car dealerships, manufacturers, and insurance companies can use Azure to get predictive insights on vehicle health and driving habits. In this solution, Azure Data Factory uses HDInsight to run Hive queries to process and load data. Analytics
Scale AI and machine learning initiatives in regulated industries Learn about scaling Azure AI and machine learning environments that must comply with extensive security policies. Hive is used to store metadata. AI

Apache JMeter

Architecture Summary Technology focus
Banking system cloud transformation on Azure Use simulated and actual applications and existing workloads to monitor the reaction of a solution infrastructure for scalability and performance. A custom JMeter solution is used for load testing. Migration
Patterns and implementations for a banking cloud transformation Learn about the patterns and implementations used to transform a banking system for the cloud. JMeter is used for load testing. Migration
Scalable cloud applications and SRE Build scalable cloud applications by using performance modeling and other principles and practices of site reliability engineering (SRE). JMeter is used for load testing. Web

Apache Kafka

Architecture Summary Technology focus
Application data protection for AKS workloads on Azure NetApp Files Deploy Astra Control Service with Azure NetApp Files for data protection, disaster recovery, and mobility for Azure Kubernetes Service (AKS) applications, including Kafka applications. Containers
Asynchronous messaging options Learn about asynchronous messaging options in Azure, including support for Kafka clients. Integration
Automated guided vehicles fleet control Learn about an end-to-end approach for an automotive original equipment manufacturer (OEM). Includes several open-source libraries that you can reuse. Back-end services in this architecture can connect to Kafka. Web
Azure Data Explorer monitoring Use Azure Data Explorer in a hybrid monitoring solution that ingests streamed and batched logs from Kafka and other sources. Analytics
Banking system cloud transformation on Azure Use simulated and actual applications and existing workloads to monitor the reaction of a solution infrastructure for scalability and performance. Events from Event Hubs for Kafka feed into the system. Containers
Choose a stream processing technology Compare options for real-time message stream processing in Azure, including the Kafka streams API. Analytics
Claim-Check pattern Examine the Claim-Check pattern, which splits a large message into a claim check and a payload to avoid overwhelming a message bus. Learn about an example that uses Kafka for claim-check generation. Integration
Data streaming with AKS Use AKS to easily ingest and process a real-time data stream with millions of data points collected via sensors. Kafka stores data for analysis. Containers
Ingestion, ETL, and stream processing pipelines with Azure Databricks Create ETL pipelines for batch and streaming data with Azure Databricks to simplify data lake ingestion at any scale. Kafka is one option for ingesting data. Analytics
Integrate Event Hubs with Azure Functions Learn how to architect, develop, and deploy efficient and scalable code that runs on Azure Functions and responds to Azure Event Hubs events. Learn how events can be persisted in Kafka topics. Serverless
IoT analytics with Azure Data Explorer Use Azure Data Explorer for near real-time IoT telemetry analytics on fast-flowing, high-volume streaming data from a variety of data sources, including Kafka. Analytics
Mainframe and midrange data replication to Azure using Qlik Use Qlik Replicate to migrate mainframe and midrange systems to the cloud, or to extend such systems with cloud applications. In this solution, Kafka stores change log information that's used to replicate the data stores. Mainframe
Patterns and implementations for a banking cloud transformation Learn about the patterns and implementations used to transform a banking system for the cloud. A Kafka scaler is used to detect whether the solution needs to activate or deactivate application deployment. Serverless
Publisher-Subscriber pattern Learn about the Publisher-Subscriber pattern, which enables an application to announce events to many interested consumers asynchronously. Kafka is recommended for messaging. Integration
Rate Limiting pattern Use a rate limiting pattern to avoid or minimize throttling errors. This pattern can implement Kafka for messaging. Integration
Refactor mainframe applications with Advanced Learn how to use the automated COBOL refactoring solution from Advanced to modernize your mainframe COBOL applications, run them on Azure, and reduce costs. Kafka can be used as a data source. Mainframe
Stream processing with fully managed open-source data engines Stream events by using fully managed Azure data services. Use open-source technologies like Kafka, Kubernetes, Cassandra, PostgreSQL, and Redis components. Analytics

Apache MapReduce

Architecture Summary Technology focus
Asynchronous messaging options Learn about asynchronous messaging options in Azure. You can use MapReduce to generate reports on events captured by Event Hubs. Integration
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. You can use MapReduce for batch processing and to provide functionality for parallel operations in these scenarios. Databases
Choose a batch processing technology Learn about technologies for big data batch processing in Azure, including HDInsight with MapReduce. Analytics
Geode pattern Deploy back-end services into a set of geographical nodes, each of which can service any client request in any region. This pattern occurs in big data architectures that use MapReduce to consolidate results across machines. Databases
Minimize coordination Follow these recommendations to improve scalability by minimizing coordination between application services. Use MapReduce to split work into independent tasks. Databases

Apache NiFi

Architecture Summary Technology focus
Apache NiFi on Azure Automate data flows with Apache NiFi on Azure. Use a scalable, highly available solution to move data into the cloud or storage and between cloud systems. Analytics
Helm-based deployments for Apache NiFi Use Helm charts when you deploy NiFi on AKS. Helm streamlines the process of installing and managing Kubernetes applications. Analytics
Azure Data Explorer monitoring Use Azure Data Explorer and NiFi in a hybrid monitoring solution that ingests streamed and batched logs from diverse sources. Analytics

Apache Oozie

Architecture Summary Technology focus
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. You can use Oozie for orchestration in these scenarios. Databases
Choose a data pipeline orchestration technology Learn about the key orchestration capabilities of Oozie. Databases

Apache Solr

Architecture Summary Technology focus
Choose a search data store Learn about the capabilities of search data stores in Azure and the key criteria for choosing one that best matches your needs. Learn about the key capabilities of HDInsight with Solr. Databases

Apache Spark

Architecture Summary Technology focus
Analytics end-to-end with Azure Synapse Learn how to use Azure Data Services to build a modern analytics platform capable of handling the most common data challenges. The Spark Pools analytics engine is available from Azure Synapse workspaces. Analytics
Batch scoring of Spark on Azure Databricks Build a scalable solution for batch scoring an Apache Spark classification model. AI
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. You can use Spark for batch or stream processing and as an analytical data store. Databases
Choose a batch processing technology Compare technology choices for big data batch processing in Azure, including options for implementing Spark. Analytics
Choose a stream processing technology Compare options for real-time message stream processing in Azure, including options for implementing Spark. Analytics
Choose an analytical data store Evaluate analytical data store options for big data in Azure. Learn about the capabilities of Azure Synapse Spark pools. Analytics
Data science and machine learning with Azure Databricks Improve operations by using Azure Databricks, Delta Lake, and MLflow for data science and machine learning. Develop, train, and deploy machine learning models. Azure Databricks provides managed Spark clusters. AI
Extract, transform, and load Learn about extract-transform-load (ETL) and extract-load-transform (ELT) data transformation pipelines and how to use control flows and data flows. In ELT, you can use Spark to query source data. You can also use it together with Hadoop as a data store. Databases
IoT using Azure Cosmos DB Learn how to use Azure Cosmos DB to accommodate diverse and unpredictable IoT workloads without sacrificing ingestion or query performance. Azure Databricks, running Spark Streaming, processes event data from devices. IoT
Loan charge-off predictions with HDInsight Spark Use HDInsight and machine learning to predict the likelihood of loans getting charged off. Databases
Many models machine learning with Spark Learn about many models machine learning in Azure. AI
Microsoft machine learning products Compare options for building, deploying, and managing your machine learning models, including the Azure Databricks Spark-based analytics platform and SynapseML. AI
Modern data warehouse for small and medium businesses Use Azure Synapse, Azure SQL Database, and Azure Data Lake Storage to modernize SMB legacy and on-premises data. Tools in the Azure Synapse workspace can use Spark compute capabilities to process data. Analytics
Natural language processing technology Choose a natural language processing service for sentiment analysis, topic and language detection, key phrase extraction, and document categorization. Learn about the key capabilities of Azure HDInsight with Spark. AI
Observability patterns and metrics Learn how to use observability patterns and metrics to improve the processing performance of a big data system by using Azure Databricks. The Azure Databricks monitoring library streams Spark events and Spark Structured Streaming metrics from jobs. Databases
Stream processing with fully managed open-source data engines Stream events by using fully managed Azure data services. Use open-source technologies like Spark, Kafka, Kubernetes, Cassandra, PostgreSQL, and Redis components. Analytics

Apache Sqoop

Architecture Summary Technology focus
Big data architectures Learn about big data architectures that handle the ingestion, processing, and analysis of data that's too large or complex for traditional database systems. In these scenarios, you can use Sqoop to automate orchestration workflows. Databases
Choose a data transfer technology Learn about data transfer options like Azure Import/Export, Data Box, and Sqoop. Databases

Apache ZooKeeper

Architecture Summary Technology focus
Apache NiFi on Azure Automate data flows with NiFi on Azure. Use a scalable, highly available solution to move data into the cloud or storage and between cloud systems. In this solution, NiFi uses ZooKeeper to coordinate the flow of data. Analytics
Helm-based deployments for Apache NiFi Use Helm charts when you deploy NiFi on AKS. Helm streamlines the process of installing and managing Kubernetes applications. In this architecture, ZooKeeper provides cluster coordination. Analytics
Rate Limiting pattern Use a rate limiting pattern to avoid or minimize throttling errors. In this scenario, you can use ZooKeeper to create a system that grants temporary leases to capacity. Integration