Creating an Index from a PDF

Question

Creating an Index from a PDF

Anonymous

I've got a PDF of a book needing an index, which was created in QuarkExpress. We've converted the PDF to Word, but the numbers are text, not content Word recognizes as page numbering. Is there any way to generate an Index without recognizable page numbering? If not, is there any way to create a page number without disturbing the exact page breaks of a 400 page document; in essence, floating in the header/footer so it doesn't interfere with the page breaks?

Thank you!

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments

6 answers

Answer 1

Thank you, Jay! This is so good to know.

I'm helping a writer index a 405 page book, which has been converted from QuarkExpress. He has a concordance file. Just to test that this would work, I massaged the concordance file down to just a few entries. I've tried to use the AutoMark feature, but the error I get back is: "No Index Entries Were Marked". An additional piece of info is that the file is 100% made up of text boxes. That seems to be how Adobe converted the PDF to Word.

That is, unfortunately, the way Adobe and most optical character recognition programs do conversions, in an attempt to keep everything in the same position on the page as in the PDF. Because text boxes are in the graphics layer in Word rather than in the text layer, many of the features like AutoMark can't "see" the text box contents at all. However, the INDEX field will see XE fields that you insert into text boxes -- one at a time instead of AutoMark.

Trying to move the text out of the text boxes often runs into a problem: Each page contains only one empty paragraph mark in the text layer, and all the text boxes are anchored to that paragraph. If a macro moves the texts into the text layer, there's no way to be sure of the order in which to place them, so you would have to examine each page to be sure they weren't scrambled.

You might be better off making an index by hand, by adding page numbers to the concordance file. The Find feature in Word's Navigation pane, when set to "Pages", will show you thumbnails of all the pages where a particular term appears in the book's file, and you can put those page numbers into the concordance. When that's done, you can copy/paste or use Insert > Object > Text From File to place the index at the end of the book file.

I wish I had better news.

Answer 2

Word's indexing does work without any visible (or machine-readable) page numbers on the pages. You insert XE fields (which are hidden text, shown only when the ¶ button is activated) into the text next to or near the terms being indexed. When you insert and update an INDEX field at the end of the document, the entries are created with the page numbers that Word *would show if there was a PAGE field in the header*. There does not have to be a PAGE field or any page number actually in place. (And the text numbers from the PDF are completely ignored.)

Bonus tip: After you insert the XE fields and before you update the INDEX field, be sure to turn off the display of hidden text. Otherwise, the visible XE fields will cause the page breaks to change, and some (or many) of the numbers in the index will be incorrect.

Answer 3

Thank you, Jay! This is so good to know.

I'm helping a writer index a 405 page book, which has been converted from QuarkExpress. He has a concordance file. Just to test that this would work, I massaged the concordance file down to just a few entries. I've tried to use the AutoMark feature, but the error I get back is: "No Index Entries Were Marked". An additional piece of info is that the file is 100% made up of text boxes. That seems to be how Adobe converted the PDF to Word.

Answer 4

Thank you for your reply, Charles.

I have compared the pre-converted with post-converted documents and they are exactly the same; same number of pages. The problem isn't the actual page numbers generated from QuarkExpress that were converted to text in the Word document, it's that to get the Index to work I need Word's page numbering. So my requirements are: 1) Word Page Numbering, 2) numbers appear on every page, 3) numbers cannot modify the existing page breaks.

Putting the number in a Header or Footer achieves the repeat requirement, but it breaks the "cannot modify existing page breaks" requirement. I've tried making the page number font 1pt. I've tried a Text box with the Page Num Field and set the Text Box to float. Neither work because the construct of the Header/Footer is using up page space and messing with the page breaks.

It seems like Indexing would work just by Word knowing it's own internal page count without forcing me to insert a page number?

Answer 5

Word documents from conversions are a royal pain to edit.

Pagination between the pdf and the Word document is highly unlikely to be the same.

Every Word document has page numbering that would be recognized by the Indexing features. You would need to remove the text page numbers from the converted document and then add page numbers in Word.

Share via

Creating an Index from a PDF

6 answers