Fetching the FIB of a .doc file

Parth Gupta 20 Reputation points
2023-05-10T14:22:16.14+00:00

I am trying to parse a (.doc) file (Microsoft Word 97-2003 Document) for purposes of extracting the FIB (File Information Block). In the Documentation given in this link: https://interoperability.blob.core.windows.net/files/MS-DOC/%5bMS-DOC%5d.pdf

it says that the FIB is present at 0 offset from the start of WordDocument Stream. However i am unable to locate the Word Document Stream. i am parsing the file in python with the 'with open(filename,'rb')' function. Please guide me towards a proper documentation of where the 'WordDocument' stream is or explain how to fetch the FIB given the file in bytes form. (I know tools like olefile and oledump and oledir exist but i do not want to use these tools and go the hard way.)

Word
Word
A family of Microsoft word processing software products for creating web, email, and print documents.
669 questions
Office Open Specifications
Office Open Specifications
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Open Specifications: Technical documents for protocols, computer languages, standards support, and data portability. The goal with Open Specifications is to help developers open new opportunities to interoperate with Windows, SQL, Office, and SharePoint.
119 questions
{count} votes

Accepted answer
  1. Tom Jebo 1,906 Reputation points Microsoft Employee
    2023-05-15T03:41:25.8833333+00:00

    Hi @Parth Gupta ,

    The Office binary file formats use what is known at the Compound File Binary format described in a specification that's referenced by the section before the one you're referencing:

    2.1 File Structure

    A Word Binary File is an OLE compound file as specified by [MS-CFB]. The file consists of the following storages and streams.

    This is also known as "structured storage" and is effectively a FAT file system within a file. That means there are sectors of data organized by directories (i.e. folders) which contain streams (i.e. files) that have the data like the WordDocument stream that contains the FIB block at offset 0.

    All this to say that parsing this without an API library to assist in navigating the internal FAT orgnization to get to this stream is extremely tedious. We typically don't do this but instead use such an API. On Windows, we use the Windows SDK's structured storage API. This API can be found in our developer docs here: https://learn.microsoft.com/en-us/windows/win32/stg/structured-storage-start-page.

    Effectively, you start by calling StgOpenStorageEx to obtain the root storage for the .doc file that you're trying to parse. This gives you a IStorage pointer (in the ppObjectOpen out parameter). Then you would call OpenStream on that IStorage passing the name "WordDocument" as the name of the stream you want and receiving the IStream pointer in the ppstm out parameter. With that you would call Seek, Read and Write as you would from typical byte streams in other architectures.

    Having said all that, I assume that your use of Python could mean you do have access to the Windows Structured Storage API set (i.e. not on Windows or not in a context that can call these APIs). If that's the case, you would need to find some library to assist in parsing the compound file binary architecture. I see that there are some hits when I search for libraries like that but can't recommend that as I've not tried them. However, the first hits I see are olefile and oletools. These look promising.

    I hope this helps.

    Best regards,
    Tom Jebo
    Microsoft Open Specifications Support

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful