Hello Peter Fine,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
Problem
I understand that you are experiencing missing data occasionally in your Avro files despite successful message routing from IoT Edge devices to Azure Blob storage.
Solution
Most of all, thank you for providing the details about your setup and the issue you're facing. To solve the issues, try to explore some potential solutions as I listed in the followings:
- You will need to understand that by default, Azure IoT Hub writes messages to Azure Blob storage in Avro format. Avro has both a message body property and a message property, which can make querying data challenging. Therefore, you can specify the message format using the
ContentEncoding
andContentType
properties. For JSON data, setContentEncoding
to "utf-8" andContentType
to "application/json" in the message system properties and ensure that your device messages are correctly formatted with the appropriate encoding and content type. - Azure Data Lake Analytics can help you query Avro data efficiently. It follows a "pay-per-query" model, which is suitable for non-relational big data. Try to set up Azure IoT Hub to route data to an Azure Blob storage endpoint. Then configure Azure Blob storage as an additional store in Data Lake Analytics. You can use U-SQL scripts to query Avro data and export it to other formats (e.g., CSV) in Azure Blob storage.
- Verify that the data in your Avro files doesn't contain non-JSON pairs. Sometimes, non-JSON data can cause issues when reading Avro files. If you encounter non-JSON data, consider using Avro tools to convert it to a readable format or filter out problematic records.
- Although you mentioned not seeing any dropped or orphaned messages in the metrics browser, it's essential to continue monitoring your system. Check the IoT Hub metrics, Blob storage metrics, and any other relevant logs to identify any anomalies or patterns related to missing data.
- Ensure that the data synchronization process between IoT Hub, Blob storage, and your data warehouse (ClickHouse) is robust. You can consider implementing retries, error handling, and consistency checks to prevent data loss during ingestion.
- Finaly, if you're using multiple partitions, verify that the partitioning strategy aligns with your data flow. Try to parallelize data processing to handle large volumes efficiently.
References
For more reading and information. Kindly use the additional resources provided by the right side of this page and the followings:
Source: Azure IOT Apache Avro format - Stack Overflow. Accessed, 7/22/2024.
Source: Query Avro data by using Azure Data Lake Analytics. Accessed, 7/22/2024.
Accept Answer
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.
Best Regards,
Sina Salam