Choosing the correct setup for a timeseries data

Question

Choosing the correct setup for a timeseries data

Hannes Caesar 5

I'm seeking advice on how to optimize my timeseries database setup, which should handle a large volume of time-series data. I have around 20,000 time-series profiles with a one-year duration, using a quarterly time resolution (4 timestamps per hour). This amounts to approximately 700 million entries. Right now, I am using Azure PostgreSQL server with timescaledb extension.

Here are the details of my setup:

Hardware Specifications:

4 vCores 16 GiB memory 512 GB storage Database Structure: I have two tables, one for the load profiles with the columns (id, time, value, sensor_id), and another table with the columns (id, sensor_id). There are two indexes on the load profile table, one on (sensor_id, time), and another on sensor_id.

Sample Query: A typical query I use to aggregate data is:

SELECT AVG(value), time
FROM public.loadprofilepool
WHERE sensor_id IN (
    SELECT id 
    FROM public.sensor_table
    ORDER BY RANDOM()
    LIMIT 500
)
GROUP BY time;

Please note that this is a sample query where the list of sensor_ids is generated on the fly to mimic retrieval of different (hence random ordering of ids) sets of sensors. In a real situation, the list of ids would come from elsewhere and could contain from a few to couple of thousand sensor ids.

Data Distribution: For now, there are 24 * 4 * 365 rows (one year duration, quarterly) per sensor and there are 20,000 sensors. In the future, there will also be live sensor data, which data distribution will depend on the specific sensor.

Performance Metrics: When running these queries, the CPU usage does not exceed 20% and memory usage is constant at about 40%.

Given these details, I'm struggling with query speed. Extracting 10 to 1000 profiles and summing them up to generate a timeseries for each timestamp currently takes about 5 seconds to several minutess, whereas my target is a few seconds for an aggregation of a couple thousand sensors.

My questions are as follows:

Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I've considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I'm not sure which, if any, would be more suitable.
How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance?

Thank you in advance for your help. Any suggestions or guidance would be greatly appreciated.

Best regards, Hannes

I have tried to create different indexes and cluster the loadprofilepool table.

SSingh-MSFT 16,371 Reputation points Moderator

2023-07-06T13:12:14.29+00:00

Hi
Hannes Caesar,

Welcome to Microsoft Q&A forum and thanks for using Azure Services.

As I understand, you want to know choice of the correct setup for a timeseries data.

I am checking on this and will get back to you. Thanks
SSingh-MSFT 16,371 Reputation points Moderator

2023-07-10T08:56:44.24+00:00

Hi
Hannes Caesar,

Following up to see if the below suggestion was helpful. If this answers your query, do click Accept Answer and Mark Helpful for the same. And, if you have any further query do let us know.

2 answers

Your answer

SSingh-MSFT 16,371 Reputation points Moderator

2023-07-06T13:12:14.29+00:00

Hi
Hannes Caesar,

Welcome to Microsoft Q&A forum and thanks for using Azure Services.

As I understand, you want to know choice of the correct setup for a timeseries data.

I am checking on this and will get back to you. Thanks
SSingh-MSFT 16,371 Reputation points Moderator

2023-07-10T08:56:44.24+00:00

Hi
Hannes Caesar,

Following up to see if the below suggestion was helpful. If this answers your query, do click Accept Answer and Mark Helpful for the same. And, if you have any further query do let us know.

Answer 1

Hi Hannes Caesar •,

Refer to the performance tuning documents here:

https://learn.microsoft.com/en-us/azure/postgresql/single-server/tutorial-monitor-and-tune

https://azure.microsoft.com/en-us/blog/performance-updates-and-tuning-best-practices-for-using-azure-database-for-postgresql/

https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-intelligent-tuning

The question asked depends upon many other factors of your design. You could read through and use what is appropriate for your scenario/specific usecase.

Thank you.

Answer 2

Hi
Hannes Caesar,

Thanks for your patience.

I have got the below reply from the internal team on the questions asked above as inline:

1. Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I've considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I'm not sure which, if any, would be more suitable.

Timescale DB is 3rd party PostgreSQL extension that offers timeseries database like features. All third-party extensions including Timescale DB , offered in Azure Database for PostgreSQL - Flexible Server are open-source licensed code.
Based on usage, Timescale DB and PostgreSQL is popular choice of running timeseries workload, however this largely depends on your specific use case.

2. How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance?

As mentioned earlier, the 3rd party postgresql extension is maintained and supported by the extension maintainers (Timescale) and they deploy their own schema on OSS PostgreSQL. We recommend leveraging Timescale DB Documentation and contacting their support team directly for your queries specific to Timescale DB indexing strategies, or query formulation.
In addition, customer can still leverage some Azure PostgreSQL performance tools, such as Azure Monitor Metrics, Query Performance Insights (QPI) and Troubleshooting Guides (TSG) for monitoring performance.

Hope this helps. If this answers your query, do click Accept Answer and Mark Helpful for the same. And, if you have any further query do let us know.

Thank you.

Hannes Caesar 5 Reputation points

2023-07-10T11:54:20.3+00:00

Hi and thanks for your answer,

I see that the optimal setup always depends on the requirements. However my question regarding performance is not quite answered.
I want to know, whether my aim to reduce times for any queries to below 5 seconds (for a table with 700 million rows) is easy to achieve with postgres and timescaledb or not. Am I on the right track or should I try out a different solution for such a performance target? ]

Here, the answer is refererring to other sources of information which is fine, I guess.

If you could give more specific answers regarding the performance targets I have, this would be very helpful.
Hannes Caesar 5 Reputation points

2023-07-18T09:32:45.04+00:00

Thank you. I will consider these documents.

Share via

Choosing the correct setup for a timeseries data

2 answers

Your answer