Azure Cognitive Services for big data
Azure Cognitive Services for big data lets users channel terabytes of data through Cognitive Services using Apache Spark™ and open source libraries for distributed machine learning workloads. With Cognitive Services for big data, it's easy to create large-scale intelligent applications with any datastore.
Using the resources and libraries described in this article, you can embed continuously improving, intelligent models directly into Apache Spark™ and SQL computations. These tools liberate developers from low-level networking details, so that they can focus on creating smart, distributed applications.
Features and benefits
Cognitive Services for big data can use resources from any supported region, as well as containerized Cognitive Services. Containers support low or no connectivity deployments with ultra-low latency responses. Containerized Cognitive Services can be run locally, directly on the worker nodes of your Spark cluster, or on an external orchestrator like Kubernetes.
Supported services
Cognitive Services, accessed through APIs and SDKs, help developers build intelligent applications without having AI or data science skills. With Cognitive Services you can make your applications see, hear, speak, and understand. To use Cognitive Services, your application must send data to the service over the network. Once received, the service sends an intelligent response in return. The following Cognitive Services resources are available for big data workloads:
Vision
Service Name | Service Description |
---|---|
Computer Vision | The Computer Vision service provides you with access to advanced algorithms for processing images and returning information. |
Face | The Face service provides access to advanced face algorithms, enabling face attribute detection and recognition. |
Speech
Service Name | Service Description |
---|---|
Speech service | The Speech service provides access to features like speech recognition, speech synthesis, speech translation, and speaker verification and identification. |
Decision
Service Name | Service Description |
---|---|
Anomaly Detector | The Anomaly Detector service allows you to monitor and detect abnormalities in your time series data. |
Language
Service Name | Service Description |
---|---|
Language service | The Language service provides natural language processing over raw text for sentiment analysis, key-phrase extraction, and language detection. |
Search
Service Name | Service Description |
---|---|
Bing Image Search | The Bing Image Search service returns a display of images determined to be relevant to the user's query. |
Supported programming languages for Cognitive Services for big data
Cognitive Services for big data are built on Apache Spark. Apache Spark is a distributed computing library that supports Java, Scala, Python, R, and many other languages. See SynapseML for documentation, samples, and blog posts.
The following languages are currently supported.
Python
We provide a PySpark API for current and legacy libraries:
For more information, see the Python Developer API. For usage examples, see the Python Samples.
Scala and Java
We provide a Scala and Java-based Spark API for current and legacy libraries:
For more information, see the Scala Developer API. For usage examples, see the Scala Samples.
Supported platforms and connectors
Big data scenarios require Apache Spark. There are several Apache Spark platforms that support Cognitive Services for big data.
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. It provides one-click setup, streamlined work-flows, and an interactive workspace that supports collaboration between data scientists, data engineers, and business analysts.
Azure Synapse Analytics
Azure Synapse Analytics is as enterprise data warehouse that uses massive parallel processing. With Synapse Analytics, you can quickly run complex queries across petabytes of data. Azure Synapse Analytics provides managed Spark Pools to run Spark Jobs with an intuitive Jupyter Notebook Interface.
Azure Kubernetes Service
Azure Kubernetes Service (AKS) orchestrates Docker Containers and distributed applications at massive scales. AKS is a managed Kubernetes offering that simplifies using Kubernetes in Azure. Kubernetes can enable fine-grained control of Cognitive Service scale, latency, and networking. However, we recommend using Azure Databricks or Azure Synapse Analytics if you're unfamiliar with Apache Spark.
Data Connectors
Once you have a Spark Cluster, the next step is connecting to your data. Apache Spark has a broad collection of database connectors. These connectors allow applications to work with large datasets no matter where they're stored. For more information about supported databases and connectors, see the list of supported datasources for Azure Databricks.
Concepts
Spark
Apache Spark™ is a unified analytics engine for large-scale data processing. Its parallel processing framework boosts performance of big data and analytic applications. Spark can operate as both a batch and stream processing system, without changing core application code.
The basis of Spark is the DataFrame: a tabular collection of data distributed across the Apache Spark worker nodes. A Spark DataFrame is like a table in a relational database or a data frame in R/Python, but with limitless scale. DataFrames can be constructed from many sources such as: structured data files, tables in Hive, or external databases. Once your data is in a Spark DataFrame, you can:
- Do SQL-style computations such as join and filter tables.
- Apply functions to large datasets using MapReduce style parallelism.
- Apply Distributed Machine Learning using Microsoft Machine Learning for Apache Spark.
- Use Cognitive Services for big data to enrich your data with ready-to-use intelligent services.
Microsoft Machine Learning for Apache Spark (MMLSpark)
Microsoft Machine Learning for Apache Spark (MMLSpark) is an open-source, distributed machine learning library (ML) built on Apache Spark. Cognitive Services for big data is included in this package. Additionally, MMLSpark contains several other ML tools for Apache Spark, such as LightGBM, Vowpal Wabbit, OpenCV, LIME, and more. With MMLSpark, you can build powerful predictive and analytical models from any Spark datasource.
HTTP on Spark
Cognitive Services for big data is an example of how we can integrate intelligent web services with big data. Web services power many applications across the globe and most services communicate through the Hypertext Transfer Protocol (HTTP). To work with arbitrary web services at large scales, we provide HTTP on Spark. With HTTP on Spark, you can pass terabytes of data through any web service. Under the hood, we use this technology to power Cognitive Services for big data.
Developer samples
Blog posts
- Learn more about how Cognitive Services work on Apache Spark™
- Saving Snow Leopards with Deep Learning and Computer Vision on Spark
- Microsoft Research Podcast: MMLSpark, empowering AI for Good with Mark Hamilton
- Academic Whitepaper: Large Scale Intelligent Microservices
Webinars and videos
- The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Services
- Spark Summit Keynote: Scalable AI for Good
- Cognitive Services for big data in Azure Cosmos DB
- Lightning Talk on Large Scale Intelligent Microservices
Next steps
Feedback
Submit and view feedback for