We have checked with our product team to understand the inner details of the two fields in the response. Here are more details to clarify the same.
The Tags section of the response is based on a model that is different from the Description[Tags] model and they are using different threshold settings internally which provide different set of tags where some of them can be common and some of them only available in the Description[Tags] section as we have seen below.
{
"description": {
"tags": ["outdoor", "road", "grass", "path", "trail", "forest", "tree", "side", "area", "narrow", "country", "track", "train", "street", "traveling", "dirt", "covered", "sign", "riding", "standing", "stop", "man", "red", "snow"],
"captions": [{
"text": "a path with trees on the side of a road",
"confidence": 0.965715635493424
}]
},
"requestId": "<id>",
"metadata": {
"width": 800,
"height": 600,
"format": "Jpeg"
}
}
The threshold setting in our Tags setting is more optimized on precision while the captioning or Description[Tags] section is optimized on recall to encourage more words for captioning an image and sentence generation.
If you want to understand more details about Precision and recall please check this documentation from custom vision which explains these scenarios.
So, the above responses are basically available based on customer scenarios to help them use either precision or recall i.e to either use Tags for precise scenarios with higher thresholds or Description[Tags] for recall where sentence or text generation of an image is the primary objective.
Source: Azure Documentation