Clarification on 'WordDocument' stream in OLE file format

Question

Clarification on 'WordDocument' stream in OLE file format

Parth Gupta 180

Hi,

I am trying to parse a '.doc' file (OLE file) in Python. I am trying to understand the structure of the 'WordDocument' stream inside the file.
With reference to [MS-DOC] and [MS-CFB], it is known that this stream must be present and have the FIB (File Information Block) at zero offset.

I can parse the complete FIB.

However, I am unable to understand the content of this 'WordDocument' stream after the FIB. There is some document text in this stream but also a weird pattern of alphabets (see the attached screenshot of a file opened in Hex editor). Kindly provide some explanation of this.

In the documentation, it is mentioned that this stream has no pre-determined format other than the FIB being present. But then, something must explain what type of data is stored in this stream after the FIB.

Kindly point me towards a documentation (if any) that explains the contents of the 'WordDocument' stream other than FIB in detail.

Thanks

User's image

KristianSmith-MSFT 446 Reputation points Microsoft Employee Moderator

2024-06-02T20:56:37.3533333+00:00

Hi Parth,

Thank you for your request. One of our team members will review this and follow up with you soon.

Regards,
Kristian S
Support Escalation Engineer
Microsoft Open Specifications

Accepted answer

0 additional answers

Your answer

KristianSmith-MSFT 446 Reputation points Microsoft Employee Moderator

2024-06-02T20:56:37.3533333+00:00

Hi Parth,

Thank you for your request. One of our team members will review this and follow up with you soon.

Regards,
Kristian S
Support Escalation Engineer
Microsoft Open Specifications

Answer 1

Tom Jebo 2,336 Microsoft Employee Moderator

Hi @Parth Gupta , The WordDocument stream besides the FIB is primarily document content, i.e. actual text, images, shapes, etc... that are rendered in the document. The other structures are largely in the Table streams. We don't specifically list out all the contents in the WordDocument stream for this reason. To get a more detailed idea of what is in the WordDocument stream (aside from my generalization above), you should 1) read the entire section 1.3 Overview and 2) follow one or more of the algorithms listed in section 2.4 Document Content. Then I believe you will start to understand the purpose of the various streams better.

Best regards,
Tom Jebo
Microsoft Open Specifications Support

Parth Gupta 180 Reputation points

2024-06-06T08:52:42.65+00:00
Hi, @Tom Jebo

I am desperately trying to follow the "Retrieving Text" algorithm given in section 2.4.1 of [MS-DOC]. kindly see below:

Using a hex viewer, I know that my text starts at offset 0x800 from the 'WordDocument' Stream. Now lets follow the algorithm of section 2.4.1,

my FibRgFcLcb97.fcClx = 0x10E57 and FibRgFcLcb97.lcbClx = 0x15 (21 in decimal)

Now, my Clx in the '1Table' stream at that offset and length is as follows:

b'\x02\x10\x00\x00\x00\x00\x00\x00\x00\xb1\x04\x00\x00\x40\x03\x00\x10\x00\x40\x00\x00'

since first byte is \x02, RgPrc is NULL so the rest is Pcdt

now for the Pcdt structure, Clxt is \x02 and lcb is '\x10\x00\x00\x00', when converted to little endian 4 byte integer, lcb is 16

so PlcPcd is as follows:

b'\x00\x00\x00\x00\xb1\x04\x00\x00\x40\x03\x00\x10\x00\x40\x00\x00'

This means i will have two elements in my aCP array and one element in my aPcd array

aCP array is as follows:

aCP = [b'\x00\x00\x00\x00', b'\xb1\x04\x00\x00']

Which when converted to little endian 4 byte integers is as follow:

aCP = [0, 1201]

now aPcd is as follows:

aPcd = [b'x40\x03\x00\x10\x00\x40\x00\x00']

now, for the first element of aPcd,

Pcd.fc =

b'\x00\x10\x00\x40' # In binary, we will write it as follows: # 0000 0000 0001 0000 0000 0000 0100 0000 # first 30 bits are: # 0000 0000 0001 0000 0000 0000 0100 00 # if we made it little endian # b'\x40\x00\x10\x00' # which in binary is: # 0100 0000 0000 0000 0001 0000 0000 0000 # first 30 bits are: # 0100 0000 0000 0000 0001 0000 0000 00

This is an FcCompressed structure

Now correct me if i am wrong,

Pcd.fc.fc is given by the "first" 30 bits of these 32 bits

also, fCompressed and r1 are "last" 2 bits respectively (both 0 here in either case)

now,

I have tried it many times differently but this never results in the offset 0x800 which we know from before.

(i anticipate that it has something to do with reversing the order of bits because then it works but since reverseing the order of bits was not given in documentation, i must ask it here to clarify)

(Note that the sum of my ccpFtn, ccpHdd, ccpAtn, ccpEdn, ccpTxbx, ccpHdrTxbx, ccpText + 1 = 1201)

I apologise for your inconvience.

Thanks.
Tom Jebo 2,336 Reputation points Microsoft Employee Moderator

2024-06-07T22:14:40.3+00:00

Hi @Parth Gupta ,

It would be easier for me to follow along if I have the same document you're working on. Can you share it with me?

Edit:

And after reviewing the latter part of your explanation, I think maybe the problem is that your view of the diagram for fc is wrong.

Notice that the bits are labeled from left to right, 0 to 31. The most significant bit is actually on the right, not the left. So you do have to reverse the bits for them to make sense as we would normally think about them. Does this make sense to you?

Tom
Parth Gupta 180 Reputation points

2024-06-08T06:26:47.1066667+00:00

Thank you soo much.

I was stuck on this trivial thing for a long time.

Thanks a lot, and I apologise for the inconvenience. It is working now.

Share via

Clarification on 'WordDocument' stream in OLE file format

0 additional answers

Your answer