I would very strongly recommend that you insert all of the images so that their layout is set to In-Line with text, which will result in them being inserted in a paragraph which you should then format so that it is kept together with the next paragraph into which you insert the caption.
If you do that, you should have no problem with creating the Table of Contents and probably more importantly, your document will be less likely to become corrupted as a result of mismatched xml tag errors.
If you need to display images side by side, insert them into the cells of a table
If you need to annotate or mark up images, thereby creating what I refer to as a compound image, create the image in a separate document and then use a screen capture utility to create a single image which you then insert into your document so that it is In-line with text. Retain the separate document in case it becomes necessary to edit the compound image.