Azure Cognitive Services for big data

Azure Cognitive Services for big data

Azure Cognitive Services for big data lets users channel terabytes of data through Cognitive Services using Apache Spark™ and open source libraries for distributed machine learning workloads. With Cognitive Services for big data, it's easy to create large-scale intelligent applications with any datastore.

Using the resources and libraries described in this article, you can embed continuously improving, intelligent models directly into Apache Spark™ and SQL computations. These tools liberate developers from low-level networking details, so that they can focus on creating smart, distributed applications.

Features and benefits

Cognitive Services for big data can use resources from any supported region, as well as containerized Cognitive Services. Containers support low or no connectivity deployments with ultra-low latency responses. Containerized Cognitive Services can be run locally, directly on the worker nodes of your Spark cluster, or on an external orchestrator like Kubernetes.

Supported services

Cognitive Services, accessed through APIs and SDKs, help developers build intelligent applications without having AI or data science skills. With Cognitive Services you can make your applications see, hear, speak, and understand. To use Cognitive Services, your application must send data to the service over the network. Once received, the service sends an intelligent response in return. The following Cognitive Services resources are available for big data workloads:


Service Name Service Description
Computer Vision The Computer Vision service provides you with access to advanced algorithms for processing images and returning information.
Face The Face service provides access to advanced face algorithms, enabling face attribute detection and recognition.


Service Name Service Description
Speech service The Speech service provides access to features like speech recognition, speech synthesis, speech translation, and speaker verification and identification.


Service Name Service Description
Anomaly Detector The Anomaly Detector service allows you to monitor and detect abnormalities in your time series data.


Service Name Service Description
Language service The Language service provides natural language processing over raw text for sentiment analysis, key-phrase extraction, and language detection.
Service Name Service Description
Bing Image Search The Bing Image Search service returns a display of images determined to be relevant to the user's query.

Supported programming languages for Cognitive Services for big data

Cognitive Services for big data are built on Apache Spark. Apache Spark is a distributed computing library that supports Java, Scala, Python, R, and many other languages. See SynapseML for documentation, samples, and blog posts.

The following languages are currently supported.


We provide a PySpark API for current and legacy libraries:

For more information, see the Python Developer API. For usage examples, see the Python Samples.

Scala and Java

We provide a Scala and Java-based Spark API for current and legacy libraries:

For more information, see the Scala Developer API. For usage examples, see the Scala Samples.

Supported platforms and connectors

Big data scenarios require Apache Spark. There are several Apache Spark platforms that support Cognitive Services for big data.

Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. It provides one-click setup, streamlined work-flows, and an interactive workspace that supports collaboration between data scientists, data engineers, and business analysts.

Azure Synapse Analytics

Azure Synapse Analytics is as enterprise data warehouse that uses massive parallel processing. With Synapse Analytics, you can quickly run complex queries across petabytes of data. Azure Synapse Analytics provides managed Spark Pools to run Spark Jobs with an intuitive Jupyter Notebook Interface.

Azure Kubernetes Service

Azure Kubernetes Service (AKS) orchestrates Docker Containers and distributed applications at massive scales. AKS is a managed Kubernetes offering that simplifies using Kubernetes in Azure. Kubernetes can enable fine-grained control of Cognitive Service scale, latency, and networking. However, we recommend using Azure Databricks or Azure Synapse Analytics if you're unfamiliar with Apache Spark.

Data Connectors

Once you have a Spark Cluster, the next step is connecting to your data. Apache Spark has a broad collection of database connectors. These connectors allow applications to work with large datasets no matter where they're stored. For more information about supported databases and connectors, see the list of supported datasources for Azure Databricks.



Apache Spark™ is a unified analytics engine for large-scale data processing. Its parallel processing framework boosts performance of big data and analytic applications. Spark can operate as both a batch and stream processing system, without changing core application code.

The basis of Spark is the DataFrame: a tabular collection of data distributed across the Apache Spark worker nodes. A Spark DataFrame is like a table in a relational database or a data frame in R/Python, but with limitless scale. DataFrames can be constructed from many sources such as: structured data files, tables in Hive, or external databases. Once your data is in a Spark DataFrame, you can:

  • Do SQL-style computations such as join and filter tables.
  • Apply functions to large datasets using MapReduce style parallelism.
  • Apply Distributed Machine Learning using Microsoft Machine Learning for Apache Spark.
  • Use Cognitive Services for big data to enrich your data with ready-to-use intelligent services.

Microsoft Machine Learning for Apache Spark (MMLSpark)

Microsoft Machine Learning for Apache Spark (MMLSpark) is an open-source, distributed machine learning library (ML) built on Apache Spark. Cognitive Services for big data is included in this package. Additionally, MMLSpark contains several other ML tools for Apache Spark, such as LightGBM, Vowpal Wabbit, OpenCV, LIME, and more. With MMLSpark, you can build powerful predictive and analytical models from any Spark datasource.

HTTP on Spark

Cognitive Services for big data is an example of how we can integrate intelligent web services with big data. Web services power many applications across the globe and most services communicate through the Hypertext Transfer Protocol (HTTP). To work with arbitrary web services at large scales, we provide HTTP on Spark. With HTTP on Spark, you can pass terabytes of data through any web service. Under the hood, we use this technology to power Cognitive Services for big data.

Developer samples

Blog posts

Webinars and videos

Next steps