Choosing the correct setup for a timeseries data

Hannes Caesar 5 Reputation points
2023-07-06T08:11:32.2866667+00:00

I'm seeking advice on how to optimize my timeseries database setup, which should handle a large volume of time-series data. I have around 20,000 time-series profiles with a one-year duration, using a quarterly time resolution (4 timestamps per hour). This amounts to approximately 700 million entries. Right now, I am using Azure PostgreSQL server with timescaledb extension.

Here are the details of my setup:

Hardware Specifications:

4 vCores 16 GiB memory 512 GB storage Database Structure: I have two tables, one for the load profiles with the columns (id, time, value, sensor_id), and another table with the columns (id, sensor_id). There are two indexes on the load profile table, one on (sensor_id, time), and another on sensor_id.

Sample Query: A typical query I use to aggregate data is:

SELECT AVG(value), time
FROM public.loadprofilepool
WHERE sensor_id IN (
    SELECT id 
    FROM public.sensor_table
    ORDER BY RANDOM()
    LIMIT 500
)
GROUP BY time;

Please note that this is a sample query where the list of sensor_ids is generated on the fly to mimic retrieval of different (hence random ordering of ids) sets of sensors. In a real situation, the list of ids would come from elsewhere and could contain from a few to couple of thousand sensor ids.

Data Distribution: For now, there are 24 * 4 * 365 rows (one year duration, quarterly) per sensor and there are 20,000 sensors. In the future, there will also be live sensor data, which data distribution will depend on the specific sensor.

Performance Metrics: When running these queries, the CPU usage does not exceed 20% and memory usage is constant at about 40%.

Given these details, I'm struggling with query speed. Extracting 10 to 1000 profiles and summing them up to generate a timeseries for each timestamp currently takes about 5 seconds to several minutess, whereas my target is a few seconds for an aggregation of a couple thousand sensors.

My questions are as follows:

  1. Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I've considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I'm not sure which, if any, would be more suitable.
  2. How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance?

Thank you in advance for your help. Any suggestions or guidance would be greatly appreciated.

Best regards, Hannes

I have tried to create different indexes and cluster the loadprofilepool table.

Azure Database for PostgreSQL
{count} votes

2 answers

Sort by: Most helpful
  1. SSingh-MSFT 16,371 Reputation points Moderator
    2023-07-11T09:07:03.1233333+00:00

    Hi Hannes Caesar •,

    Refer to the performance tuning documents here:

    https://learn.microsoft.com/en-us/azure/postgresql/single-server/tutorial-monitor-and-tune

    https://azure.microsoft.com/en-us/blog/performance-updates-and-tuning-best-practices-for-using-azure-database-for-postgresql/

    https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-intelligent-tuning

    The question asked depends upon many other factors of your design. You could read through and use what is appropriate for your scenario/specific usecase.

    Thank you.

    1 person found this answer helpful.
    0 comments No comments

  2. SSingh-MSFT 16,371 Reputation points Moderator
    2023-07-07T09:34:50.55+00:00

    Hi
    Hannes Caesar
    ,

    Thanks for your patience.

    I have got the below reply from the internal team on the questions asked above as inline:

    1.     Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I've considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I'm not sure which, if any, would be more suitable.

     

    • Timescale DB is 3rd party PostgreSQL extension that offers timeseries database like features. All third-party extensions including Timescale DB , offered in Azure Database for PostgreSQL - Flexible Server are open-source licensed code.
    • Based on usage, Timescale DB and PostgreSQL is popular choice of running timeseries workload, however this largely depends on your specific use case.

     

    2.     How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance?

     

    • As mentioned earlier, the 3rd party postgresql extension is maintained and supported by the extension maintainers (Timescale) and they deploy their own schema on OSS PostgreSQL. We recommend leveraging Timescale DB Documentation and contacting their support team directly for your queries specific to Timescale DB indexing strategies, or query formulation.
    • In addition, customer can still leverage some Azure PostgreSQL performance tools, such as Azure Monitor Metrics, Query Performance Insights (QPI) and Troubleshooting Guides (TSG) for monitoring performance.

     Hope this helps. If this answers your query, do click Accept Answer and Mark Helpful for the same. And, if you have any further query do let us know.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.