Geospatial data processing and analytics

Data Factory
Data Lake Storage
Database for PostgreSQL
Databricks
Event Hubs

This article outlines a manageable solution for making large volumes of geospatial data available for analytics.

Architecture

Architecture diagram showing how geospatial data flows through an Azure system. Various components receive, process, store, analyze, and publish the data.

Download a Visio file of this architecture.

The diagram contains several gray boxes, each with a different label. From left to right, the labels are Ingest, Prepare, Load, Serve, and Visualize and explore. A final box underneath the others has the label Monitor and secure. Each box contains icons that represent various Azure services. Numbered arrows connect the boxes in the way that the steps describe in the diagram explanation.

Workflow

  1. IoT data enters the system:

  2. GIS data enters the system:

    • Azure Data Factory ingests raster GIS data and vector GIS data of any format.

      • Raster data consists of grids of values. Each pixel value represents a characteristic like the temperature or elevation of a geographic area.
      • Vector data represents specific geographic features. Vertices, or discrete geometric locations, make up the vectors and define the shape of each spatial object.
    • Data Factory stores the data in Data Lake Storage.

  3. Spark clusters in Azure Databricks use geospatial code libraries to transform and normalize the data.

  4. Data Factory loads the prepared vector and raster data into Azure Database for PostgreSQL. The solution uses the PostGIS extension with this database.

  5. Data Factory loads the prepared vector and raster data into Azure Data Explorer.

  6. Azure Database for PostgreSQL stores the GIS data. APIs make this data available in standardized formats:

    • GeoJSON is based on JavaScript Object Notation (JSON). GeoJSON represents simple geographical features and their non-spatial properties.
    • Well-known text (WKT) is a text markup language that represents vector geometry objects.
    • Vector tiles are packets of geographic data. Their lightweight format improves mapping performance.

    A Redis cache improves performance by providing quick access to the data.

  7. The Web Apps feature of Azure App Service works with Azure Maps to create visuals of the data.

  8. Users analyze the data with Azure Data Explorer. GIS features of this tool create insightful visualizations. Examples include creating scatterplots from geospatial data.

  9. Power BI provides customized reports and business intelligence (BI). The Azure Maps visual for Power BI highlights the role of location data in business results.

Throughout the process:

  • Azure Monitor collects information on events and performance.
  • Log Analytics runs queries on Monitor logs and analyzes the results.
  • Azure Key Vault secures passwords, connection strings, and secrets.

Components

  • Azure Event Hubs is a fully managed streaming platform for big data. This platform as a service (PaaS) offers a partitioned consumer model. Multiple applications can use this model to process the data stream at the same time.

  • Azure Data Factory is an integration service that works with data from disparate data stores. You can use this fully managed, serverless platform to create, schedule, and orchestrate data transformation workflows.

  • Azure Databricks is a data analytics platform. Its fully managed Spark clusters process large streams of data from multiple sources. Azure Databricks can transform geospatial data at large scale for use in analytics and data visualization.

  • Data Lake Storage is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data typically comes from multiple, heterogeneous sources and can be structured, semi-structured, or unstructured.

  • Azure Database for PostgreSQL is a fully managed relational database service that's based on the community edition of the open-source PostgreSQL database engine.

  • PostGIS is an extension for the PostgreSQL database that integrates with GIS servers. PostGIS can run SQL location queries that involve geographic objects.

  • Redis is an open-source, in-memory data store. Redis caches keep frequently accessed data in server memory. The caches can then quickly process large volumes of application requests that use the data.

  • Power BI is a collection of software services and apps. You can use Power BI to connect unrelated sources of data and create visuals of them.

  • The Azure Maps visual for Power BI provides a way to enhance maps with spatial data. You can use this visual to show how location data affects business metrics.

  • Azure App Service and its Web Apps feature provide a framework for building, deploying, and scaling web apps. The App Service platform offers built-in infrastructure maintenance, security patching, and scaling.

  • GIS data APIs in Azure Maps store and retrieve map data in formats like GeoJSON and vector tiles.

  • Azure Data Explorer is a fast, fully managed data analytics service that can work with large volumes of data. This service originally focused on time series and log analytics. It now also handles diverse data streams from applications, websites, IoT devices, and other sources. Geospatial functionality in Azure Data Explorer provides options for rendering map data.

  • Azure Monitor collects data on environments and Azure resources. This diagnostic information is helpful for maintaining availability and performance. Two data platforms make up Monitor:

  • Log Analytics is an Azure portal tool that runs queries on Monitor log data. Log Analytics also provides features for charting and statistically analyzing query results.

  • Key Vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.

Alternatives

  • Instead of developing your own APIs, consider using Martin. This open-source tile server makes vector tiles available to web apps. Written in Rust, Martin connects to PostgreSQL tables. You can deploy it as a container.

  • If your goal is to provide a standardized interface for GIS data, consider using GeoServer. This open framework implements industry-standard Open Geospatial Consortium (OGC) protocols such as Web Feature Service (WFS). It also integrates with common spatial data sources. You can deploy GeoServer as a container on a virtual machine. When customized web apps and exploratory queries are secondary, GeoServer provides a straightforward way to publish geospatial data.

  • Various Spark libraries are available for working with geospatial data on Azure Databricks. This solution uses these libraries:

    But other solutions also exist for processing and scaling geospatial workloads with Azure Databricks.

  • Vector tiles provide an efficient way to display GIS data on maps. This solution uses PostGIS to dynamically query vector tiles. This approach works well for simple queries and result sets that contain well under 1 million records. But in the following cases, a different approach may be better:

    • Your queries are computationally expensive.
    • Your data doesn't change frequently.
    • You're displaying large data sets.

    In these situations, consider using Tippecanoe to generate vector tiles. You can run Tippecanoe as part of your data processing flow, either as a container or with Azure Functions. You can make the resulting tiles available through APIs.

  • Like Event Hubs, Azure IoT Hub can ingest large amounts of data. But IoT Hub also offers bi-directional communication capabilities with devices. If you receive data directly from devices but also send commands and policies back to devices, consider IoT Hub instead of Event Hubs.

  • To streamline the solution, omit these components:

    • Azure Data Explorer
    • Power BI

Scenario details

Many possibilities exist for working with geospatial data, or information that includes a geographic component. For instance, geographic information system (GIS) software and standards are widely available. These technologies can store, process, and provide access to geospatial data. But it's often hard to configure and maintain systems that work with geospatial data. You also need expert knowledge to integrate those systems with other systems.

This article outlines a manageable solution for making large volumes of geospatial data available for analytics. The approach is based on Advanced Analytics Reference Architecture and uses these Azure services:

  • Azure Databricks with GIS Spark libraries processes data.
  • Azure Database for PostgreSQL queries data that users request through APIs.
  • Azure Data Explorer runs fast exploratory queries.
  • Azure Maps creates visuals of geospatial data in web applications.
  • The Azure Maps Power BI visual feature of Power BI provides customized reports

Potential use cases

This solution applies to many areas:

  • Processing, storing, and providing access to large amounts of raster data, such as maps or climate data.
  • Identifying the geographic position of enterprise resource planning (ERP) system entities.
  • Combining entity location data with GIS reference data.
  • Storing Internet of Things (IoT) telemetry from moving devices.
  • Running analytical geospatial queries.
  • Embedding curated and contextualized geospatial data in web apps.

Considerations

The following considerations, based on the Microsoft Azure Well-Architected Framework, apply to this solution.

Availability

Scalability

This solution's implementation meets these conditions:

  • Processes up to 10 million data sets per day. The data sets include batch or streaming events.
  • Stores 100 million data sets in an Azure Database for PostgreSQL database.
  • Queries 1 million or fewer data sets at the same time. A maximum of 30 users run the queries.

The environment uses this configuration:

  • An Azure Databricks cluster with four F8s_V2 worker nodes.
  • A memory-optimized instance of Azure Database for PostgreSQL.
  • An App Service plan with two Standard S2 instances.

Consider these factors to determine which adjustments to make for your implementation:

  • Your data ingestion rate.
  • Your volume of data.
  • Your query volume.
  • The number of parallel queries you need to support.

You can scale Azure components independently:

The autoscale feature of Monitor also provides scaling functionality. You can configure this feature to add resources to handle increases in load. It can also remove resources to save money.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

  • To estimate the cost of implementing this solution, see a sample cost profile. This profile is for a single implementation of the environment described in Scalability considerations. It doesn't include the cost of Azure Data Explorer.
  • To adjust the parameters and explore the cost of running this solution in your environment, use the Azure pricing calculator.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Next steps

Product documentation:

To start implementing this solution, see this information:

Information on processing geospatial data