Use repartitioning to optimize processing with Azure Stream Analytics

This article shows you how to use repartitioning to scale your Azure Stream Analytics query for scenarios that can't be fully parallelized.

You might not be able to use parallelization if:

  • You don't control the partition key for your input stream.
  • Your source "sprays" input across multiple partitions that later need to be merged.

Repartitioning, or reshuffling, is required when you process data on a stream that's not sharded according to a natural input scheme, such as PartitionId for Event Hubs. When you repartition, each shard can be processed independently, which allows you to linearly scale out your streaming pipeline.

How to repartition

You can repartition your input in two ways:

  1. Use a separate Stream Analytics job that does the repartitioning
  2. Use a single job but do the repartitioning first before your custom analytics logic

Creating a separate Stream Analytics job to repartition input

You can create a job that reads input and writes to an event hub output using a partition key. This event hub can then serve as input for another Stream Analytics job where you implement your analytics logic. When configuring this event hub output in your job, you must specify the partition key by which Stream Analytics will repartition your data.

-- For compat level 1.2 or higher
INTO output
FROM input

--For compat level 1.1 or lower
INTO output
FROM input PARTITION BY PartitionId

Repartition input within a single Stream Analytics job

You can also introduce a step in your query that first repartitions the input, which can then be used by other steps in your query. For example, if you want to repartition input based on DeviceId, your query would be:

WITH RepartitionedInput AS 
    SELECT * 
    FROM input PARTITION BY DeviceID

SELECT DeviceID, AVG(Reading) as AvgNormalReading  
INTO output
FROM RepartitionedInput  
GROUP BY DeviceId, TumblingWindow(minute, 1)  

The following example query joins two streams of repartitioned data. When you join two streams of repartitioned data, the streams must have the same partition key and count. The outcome is a stream that has the same partition scheme.

WITH step1 AS 
    SELECT * FROM input1 
step2 AS 
    SELECT * FROM input2 

SELECT * INTO output 

The output scheme should match the stream scheme key and count so that each substream can be flushed independently. The stream could also be merged and repartitioned again by a different scheme before flushing, but you should avoid that method because it adds to the general latency of the processing and increases resource utilization.

Streaming Units for repartitions

Experiment and observe the resource usage of your job to determine the exact number of partitions you need. The number of streaming units (SU) must be adjusted according to the physical resources needed for each partition. In general, six SUs are needed for each partition. If there are insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.

Repartitions for SQL output

When your job uses SQL database for output, use explicit repartitioning to match the optimal partition count to maximize throughput. Since SQL works best with eight writers, repartitioning the flow to eight before flushing, or somewhere further upstream, might benefit job performance.

When there are more than eight input partitions, inheriting the input partitioning scheme might not be an appropriate choice. Consider using INTO in your query to explicitly specify the number of output writers.

The following example reads from the input, regardless of it being naturally partitioned, and repartitions the stream tenfold according to the DeviceID dimension and flushes the data to output.

SELECT * INTO [output] 
FROM [input] 

For more information, see Azure Stream Analytics output to Azure SQL Database.

Next steps