Imputation of missing dataset in time series big data

Sanyal, Mihir 26 Reputation points
2023-03-23T18:22:36.59+00:00

Hi,

I have been working on a specific solutioning for time series data analysis and insight with Azure Data explorer(ADLS) for big data analytics. In the time series dataset stored in Data Lake ADLS we have some missing dataset for certain periods and we need to impute those missing dataset for Business insight.

Example :

hour metric_vaule

1 10

2 12

3 9

4 15

6 7

9 14

in above example I need to impute vaules for hour 5, 6 and 7.

As we ingest the dataset from ADLS to ADX is there a best practice to impute this dataset like imputation should happen prior to ingestion into ADX or we should do it in ADX using it's native feature if any. Also we need to be cost considerate in this approach as we will ingest the time series big data in batch mode.

If you need more clarity please let me know and appreciate your help on this.

Thanks,

Mihir

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
484 questions
{count} vote

Accepted answer
  1. BhargavaGunnam-MSFT 26,496 Reputation points Microsoft Employee
    2023-03-29T00:21:22.7633333+00:00

    Hello Sanyal, Mihir,

    Welcome to the MS Q&A platform.

    Since you are using Azure Data Explorer (ADX) for big data analytics, you can leverage its native features for imputing missing data. ADX provides native functions for time series analysis and data manipulation, including interpolation and extrapolation functions that can be used to impute missing values

    Regarding your question about whether to impute the data before ingestion into ADX or in ADX itself, it is generally recommended to impute the data before ingestion if possible, as this can reduce the amount of data that needs to be ingested and can improve query performance. However, if the missing data is spread out over a long period of time, it may be more efficient to impute the data in ADX using the native features.

    Regarding cost considerations, imputing the data before ingestion may be more cost-effective, as it can reduce the amount of data that needs to be ingested. However, this depends on the specifics of your use case, and you may want to consider the cost of storage and query performance in addition to the cost of ingestion.

    I hope this helps. Please let us know if you have any further questions


0 additional answers

Sort by: Most helpful