Understanding the Word .doc Binary File Format

Summary: Learn about the MS-DOC binary file format that is used in legacy Microsoft Word products, including the basic structures and key concepts for interacting with it programmatically.

Applies to: Office 2007 | Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word | Word 2007 | Word 2010

In this article
Overview of the MS-DOC File Format
Conclusion
Additional Resources

Published:   February 2011

Provided by:   Microsoft Corporation

Contents

  • Overview of the MS-DOC File Format

  • Key Components of the MS-DOC File Format

  • Extracting Text from Word Files

  • Conclusion

  • Additional Resources

This article describes the structures and some procedures for working with MS-DOC files. It is the part of a series of articles that introduce the binary file formats used by Microsoft Office products. These articles are designed to be used in conjunction with the Microsoft Office File Format Documents available on MSDN.

Overview of the MS-DOC File Format

Microsoft Office Word 2003, Microsoft Word 2002, Microsoft Word 2000, and Microsoft Word 97 all use the MS-DOC binary file format as their default file format. This file format applies to any file that has a .doc or .dot extension. The basic unit of data in a Word document is a character, which may include formatting and other non-visible characters, as well as ANSI and Unicode. All of the character data resides in the Word Document Stream. At the beginning of that stream is a structure called the File Information Block (FIB), which contains pointers to all of the data in the file.

Note

The recommended way to perform most programming tasks in Microsoft Word is to use the Word Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Word. This article series deals only with advanced scenarios, such as where Microsoft Word is not installed.

Key Components of the MS-DOC File Format

Here are some of the most important structures you must know about when you work with .doc files.

  • 2.1.1 WordDocument Stream

    The Word Document stream is the main stream in a .doc file, and contains all the data in the file except for tables, which are stored in the 1Table stream or 0Table stream.

    • File Information Block

      The File Information Block starts at offset 0x00 of the Word Document stream. It specifies the locations of all other data in the file. The locations are specified by a pair of integers, the first of which specifies the location and the second of which specifies the size. These integers appear in substructures of the File Information Block, such as the FibRgFcLcb97. The location names are prefixed with fc. The size names are prefixed with lcb.

    • Clx structure

      The Clx structure is an array of 0 or more Prc structures, which contain property information, followed by a Pcdt structure, which contains a PlcPcd structure.

  • Character

    A character may be a text character or a non-text character, such as a paragraph mark or an object anchor. Its size may vary according to whether it is ANSII, Unicode, or a control character. Adjacent characters in the document are not necessarily adjacent in the binary file.

    • Character Position (CP)

      A Character Position (CP) is an unsigned, 32-bit integer that gives the index location of a character in the document text.

    • Pcd structure

      A Pcd structure specifies the position of text in the Word Document Stream, together with some properties of the text.

  • Plc

    A PLC structure is an array of CPs, followed by an array of data elements. Different Plc structures have different names and functions, such as the Plcbkf structure, which consists of bookmarks and pointers to bookmarks.

  • PlcPcd structure

    A PlcPcd structure is a PLC structure that maps an array of CPs to Pcd structures. In other words, it maps the positions of characters in the stream to characters in the document text.

Extracting Text from Word Files

The formal algorithm for retrieving text is published in the Open Specification documents on MSDN, under 2.4.1 Retrieving Text, and there is an example of part of the process in the Examples section, under 3.1 Example of a Clx. Here is a simplified version of the process.

To extract text from a Word document

  1. Read the .doc file into a data stream.

  2. Begin reading the File Information Block (FIB), at offset 0 of the Word Document stream. For more information, see 2.5.15 How to read the FIB

  3. Inside the File Information Block, locate the FibRgFcLcb97 structure. This structure begins at byte 154 of the FIB. It consists of a series of 4-byte fields.

  4. Read the FibRgFcLcb97.fcClx field at byte 268, and the FibRgFcLcb97.lcbClx field at byte 272. These specify the offset location and size of the Clx.

  5. Begin reading the Clx structure from the Table stream, at the offset specified by the FibRgFcLcb97.fcClx field.

  6. Inside the Clxstructure, locate the Pcdt, which immediately follows the variable-length .RgPrc array of Prc structures

    For each member of the array:

    1. Read the .clxt attribute, which is byte 0 of the Prc structure. If .clxt = 0x02, you have found the Pcdt.

    2. If .clxt = 0x01, read the next 2 bytes as a signed integer, and then skip ahead that number of bytes to the next member of the array.

  7. Inside the Pcdtstructure, locate the PlcPcd structure, which starts at byte 5 of the Pcdt.

  8. Load the PlcPcd.aPcd array and the PlcPcd.aCp array. The members of these arrays correspond to one another by index value.

  9. For each Pcd structure in PlcPcd.aPcd:

    1. Read the value of the Pcd.Fc.fCompressed field at bit 46 of the current Pcd structure. If 0, the Pcd structure refers to a 16-bit Unicode character. If 1, it refers to an 8-bit ANSI character.

    2. Read the value of Pcd.Fc, which is bytes 2-5 of the current Pcd, and the corresponding CP value.

      • If Unicode, the text at the character position specified by the current CP value starts at on offset equal to the value of Pcd.Fc in the Word Document stream, and occupies two bytes per character.

      • If ANSI, The text at the current CP starts at an offset of half the value of Pcd.Fc, and occupies one byte per character.

      In either case, the number of characters specified by the current CP is equal to the value of the next CP in the array minus that of the current CP.

Conclusion

This article is just a sample of the MS-DOC format. With the tools provided in this article, simple data recovery should be within your reach. With additional exploration, you can start to recover formatting information and other metadata, and eventually work up to Save operations.

Additional Resources

For more information, see the following resources: