Understand the fruit classification model
This solution combines computer vision and speech capabilities from Foundry Tools.
Note
The linked GitHub sample repository (MicrosoftDocs/mslearn-oxford.create-image-recognition-with-azure-iot-edge) is archived as of September 2025 and is in read-only state. You can still browse and clone the linked module folders, but they no longer accept issues or pull requests. If you intend to make code changes for your own use, fork the repository to your own GitHub account and work in the fork.
Azure AI Custom Vision
The Azure AI Custom Vision service is a simple way to create an image classification machine learning model without being a data science or machine learning expert. In a Custom Vision project, you upload multiple collections of labeled images. For example, you could upload a collection of banana images and label them as 'banana'.
In this module, you use a provided, prebuilt classification model. Azure AI Custom Vision was used before the exercise to create and export the model used by the Image Classification module; you don't create, train, or export a model as part of this module.
If you want to replace the provided model after the exercise, create and train your own Azure AI Custom Vision project and export the trained model, but keep those creation, training, and export steps outside this exercise. For an IoT Edge-exportable fruit classifier, create the replacement project as Classification with Multiclass (single tag per image) classification, choose the General (compact) domain, and under Export Capabilities, select the Basic platforms export capability. Custom Vision exports only trained projects and iterations that use compact domains, so don't choose regular non-compact domains such as Food or General for local IoT Edge use. See Build an image classifier with Custom Vision.
Training-image guidance for that optional replacement model comes from Custom Vision classifier guidance: use at least 30 images per tag in the initial training set and collect a few extra images to test the trained model. Choose images with varied camera angle, lighting, background, visual style, individual or grouped subjects, subject size, and type. Training images must be .jpg, .png, .bmp, or .gif and no greater than 6 MB in size (prediction images no greater than 4 MB). Prefer images that are at least 256 pixels on the shortest edge so the model has enough source detail. Custom Vision accepts smaller training images and automatically upscales them, but upscaling can reduce useful detail and affect model quality, so use 256-pixel-or-larger source images whenever possible. See Choose training images and Custom Vision limits and quotas.
Export the optional replacement model from the Performance tab for a trained compact-domain iteration. If Export is unavailable, the selected iteration doesn't use a compact domain; use the Iterations section of the Performance tab to select an iteration that uses a compact domain, or switch to a compact domain and retrain before exporting. For IoT Edge, choose DockerFile as the platform and Linux as the version, then export and download the package. The extracted .zip package should contain the app and azureml folders and the Dockerfile and README files, but copy the exported model files into the architecture-specific module source tree that the module Dockerfile actually uses. For the AMD64 lab path in the linked sample, modules\ImageClassifierService\Dockerfile.amd64 copies modules\ImageClassifierService\cv-amd64\app into /app, so replace the contents of modules\ImageClassifierService\cv-amd64\app with the exported app contents instead of copying the export package only to the modules\ImageClassifierService root. If your export package or target architecture uses a different layout, update the module folder structure and Dockerfile COPY statements together so the built image includes every exported model asset it needs. A Docker container, including a Linux container, is one supported export option for offline or edge scenarios. See Export your model and Build an image classifier with Custom Vision.
Note
Azure AI Custom Vision is planned for retirement on September 25, 2028. Microsoft will support existing Azure Custom Vision customers until then. Current Microsoft Custom Vision migration guidance includes Azure Machine Learning AutoML for traditional image classification and object detection, generative models in Microsoft Foundry, and Azure Content Understanding in Foundry Tools for managed generative-AI classification workflows. Check the migration guidance for current service status and recommendations. New content and production plans should evaluate these paths before relying on Custom Vision model creation or export.
Azure Speech text to speech
Azure Speech in Foundry Tools (also branded as Azure AI Speech, now surfaced through Foundry Tools) provides text to speech to convert text or Speech Synthesis Markup Language (SSML) into synthesized audio. This solution uses text to speech to announce item labels. Free-tier allowances and pricing can change; check Azure Speech pricing for current text to speech quotas and costs.
For Text-to-speech REST API calls, the Azure Speech key must match the regional endpoint you call, such as https://<region>.tts.speech.microsoft.com/cognitiveservices/v1; keys are valid only in the region where the Speech resource was created. The current linked sample deployment template exposes only azureSpeechServicesKey, and the Camera Capture Module code calls the Southeast Asia (southeastasia) Speech REST endpoints. For this exercise, use a Speech resource in Southeast Asia with the existing template. To use another region, first update the sample code and deployment template to make the Speech region or endpoint configurable.
Before you build or modify the linked sample's text-to-speech code, verify the current en-AU neural voice names in the live Azure Speech language and voice support catalog. You can also use the Text-to-speech REST API voices list endpoint for your Speech region. The archived sample contains en-AU-Catherine, a retired non-neural Standard voice; don't deploy with that value. Replace it with a currently supported en-AU neural voice. Also update the SSML generated by modules\CameraCaptureOpenCV\app\azure_text_speech.py: the archived sample hardcodes xml:lang values as en-us or en-US, so replacing only modules\CameraCaptureOpenCV\app\speech_map_australian.json with an en-AU-*Neural voice leaves the SSML locale inconsistent. Change every SSML xml:lang value used for the <speak> and <voice> elements to match the selected voice locale, such as en-AU, or refactor the code and configuration to derive the locale from selected voice metadata and render that locale into the SSML. Voice names such as en-AU-JoanneNeural, en-AU-NatashaNeural, or en-AU-WilliamNeural are examples only; always check the catalog for the latest supported voices before using one.
Warning
The azureSpeechServicesKey value is a secret. Replace any sample key with your own local lab value only when preparing your deployment, remove any hardcoded default before building images or generating a deployment manifest, don't commit real keys, and rotate the Azure AI services key if a key is exposed.
In this module, the Camera Capture Module handles scanning items using a camera. It then calls the Image Classification module to identify the item, calls Azure Speech text to speech to convert the item label to speech, and plays the scanned item's name on the attached speaker.