Build a content-based recommendation system

Azure Databricks
Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

Recommendations are a key revenue driver for many businesses and are used in different kinds of industries, including retail, news, and media. With the availability of large amounts of data about customer activity, you can provide highly relevant recommendations by using machine learning.


Architectural diagram that shows training, evaluation, and development of a machine learning model for content-based personalization that uses Azure Databricks.

Download a PowerPoint file of this architecture.


  1. Store. Azure Data Lake Storage stores large amounts of data about user and consumer behavior.

  2. Read. Azure Databricks connects to and reads from Azure Data Lake Storage. Ingestion into Databricks enables preprocessing and training to register the model.

  3. Preprocess. Data preprocessing cleanses, transforms, and prepares data to be fed to the recommendations system model.

  4. Train. Training has two steps: feature engineering and model training. During model training, Azure Databricks uses the preprocessed dataset to train and explain the behavior of the best recommendation model.

  5. Postprocess. Postprocessing involves model evaluation and selection based on which model performs best.

  6. Deploy. Azure Databricks maintains the model. Batch managed endpoints deploy the model for exposure to front-end display. As the model is deployed, the new data is accessible via new endpoints. Batch and near-real-time recommendations are supported.

  7. Write. User interfaces, such as web applications, can consume the stored model results. The results are written to and captured in Azure Synapse. The model runs as batch inference and stores the results in the respective datastore.


This architecture makes use of the following components:

  • Azure Data Lake Storage is a set of storage capabilities that are dedicated to big data analytics and that provide file system semantics, file-level security, and scaling.

  • Azure Databricks is a managed Apache Spark cluster for model training and evaluation.

  • Azure Synapse Analytics is the fast, flexible, and trusted cloud data warehouse that lets you scale, compute, and store elastically and independently, with a massively parallel processing architecture.

Scenario details

The approach described in this article focuses on building a content-based recommendation system. For more information about the best practices of building recommendation systems, see the documentation and examples for Recommenders on GitHub.

This example scenario shows how you can use machine learning to automate content-based personalization for your customers. The solution uses Azure Databricks to train a model that predicts the probability that a user will be interested in an item. batched managed endpoints deploys that model as a prediction service. You can use this service to create personalized recommendations by ranking items based on the content that a user is most likely to be interested in.

Potential use cases

This solution is ideal for the retail industry. It's relevant to the following use cases:

  • Content recommendations for websites and mobile apps
  • Product recommendations for e-commerce sites
  • Displayed ad recommendations for websites

Types of recommendation systems

There are three main types of recommendation systems:

  • Collaborative filtering. Collaborative filtering identifies similar patterns in customer behavior and recommends items that other similar customers have interacted with. An advantage of collaborative filtering is the ease of generating data—users create data while interacting with listings of items and products. Moreover, customers can discover new items and products other than those that are curated from their historical interactions. However, the downside of collaborative filtering is the cold start problem: since there's a scarcity of interactions between users and new offerings, newly added items aren't recommended by an algorithm that depends entirely on customer interactions.

  • Content-based. Content-based recommendation uses information about the items to learn customer preferences, and it recommends items that share properties with items that a customer has previously interacted with. Content-based recommendation systems aren't hampered by the cold-start problem and can adapt to the introduction of new items. However, the recommendations are limited to the features of the original item that a customer interacted with.

  • Hybrid method. Another approach to building recommendation systems is to blend content-based and collaborative filtering. This system recommends items based on user ratings and on information about items. The hybrid approach has the advantages of both collaborative filtering and content-based recommendation.


This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Other contributor:

  • Andrew Ajaluwa | Program Manager
  • Gary Moore | Programmer/Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps