Conversion of irregular tables doc->docx.

Vlado 20 Reputation points
2023-12-07T17:59:09.8766667+00:00

Hello,
I am trying to convert a binary doc file that contains table - into docx ooxml format. I have problem with irregular tables. I have 3 questions:

  1. When I'm parsing table properties, I can read cell widths (from sprmTDefTable.rgTc80[i].wWidth) for rows 1, 2 and 4,
    but there is no information about row 3 (yellow marked). Internally it looks like the paragraphs from 3 row follow the paragraphs from 4 row without table properties.
  2. I can't find information about gridSpan, for example for cell "B" is
    <w:gridSpan w:val="4"/> (the calculation can be difficult for large tables).
  3. I don't understand different values in tblGrid and tcPr/tcW, for example first column:
    <w:tblGrid>
    <w:gridCol w:w="811">
    ...
    and in column properties width:
    <w:tc>
    <w:tcPr>
    <w:tcW w:w="828" w:type="dxa"/> <- I have this valueTS_4

Thank you,

Vlado

Word
Word
A family of Microsoft word processing software products for creating web, email, and print documents.
667 questions
Office Open Specifications
Office Open Specifications
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Open Specifications: Technical documents for protocols, computer languages, standards support, and data portability. The goal with Open Specifications is to help developers open new opportunities to interoperate with Windows, SQL, Office, and SharePoint.
119 questions
{count} votes

Accepted answer
  1. Tom Jebo 1,906 Reputation points Microsoft Employee
    2023-12-21T01:40:03.2666667+00:00

    Hi @Vlado,

    I did the parsing of your sample document (TS_4.doc) by hand and my findings are that the 3rd row has table properties with definitions for the cells using sprmTInsert among other properties.

    <WARNING: The below offsets may not match yours as I do some rearranging of sectors to make editing the hex bytes in the file easier>

    Here is the text run in hex (i.e. ccpText): User's image

    The end of row TTPM is at offset 0x4223 (0x07). Note this is after the para marker (0x07) for "E3" and the para marker (0x07) for the merged cell under "D".

    If you go through the algorithm in MS-DOC 2.4.6.1 "Direct Paragraph Formatting", you'll arrive at a PapxInFkp (eventually) for this mark:

    PapxInFkp @ 0x4C00 + 2*BxPap.bOffset = 0x4C00 + 2*0xEB = 0x4DD6:

    00 04 00 00 46 66 92 02 00 00

    Inside this is a grpprl @0x4DD0 (after the cb, cb' and GrpprlAndIstd.istd which is 0x0000):

    46 66 92 02 00 00

    This (0x6646) is a sprmPHugePapx.

    Breaking this open, we are led to a PrcData with GrpPrl:

    PrcData @ 0x292+Data Stream = 0xE92:

    PrcData.cbGrpprl = 0x1FB

    PrcData addresses: 0xE92 -> 0x108F

    And the first Prl in GrpPrl is:

    6B 64 95 01 00 00

    sprmPTableProps(0x646B)

    Ok, so PrcData @ 0x195 into Data Stream:

    Range for PrcData: 0xD95 -> 0xE91

    User's image

    At this point, after offset 0x0D95 in the above hex dump (which is the length of the PrcData, you can actually visually scan the Prl's to see that there the following (and I won't list all of them):

    Offset Sprm (Id)

    =================

    0xD97 sprmPFInTable(0x2416)

    0xD9A sprmPFTtp(0x2417)

    0xD9D sprmPItap(0x6649)

    0xDA7 sprmTInsert(0x7621)

    Etc... I won't list them all. But this should be where the 3rd row cell definitions are found.

    Best regards,
    Tom Jebo
    Microsoft Open Specifications Support

    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Tom Jebo 1,906 Reputation points Microsoft Employee
    2023-12-12T22:28:19+00:00

    Hi @Vlado ,

    To explain this, I'll use ISO 29500-1:2016 (download can be found [here]). I'll address the questions about gridSpan and gridCol first and then post again about the binary format question shortly.

    Question 2: The gridSpan element is described in 17.4.17 gridSpan (Grid Columns Spanned by Current Table Cell). The short of this is that gridSpan tells the application how many logical columns (gridCol) a cell must span or occupy on a row after it is placed in the order it appears amongst it's sibling tc elements. Therefore, cell B starts after A and spans for 4 logical column grids.

    Question 3: For the explanation of the tcW=828 for cell A, 17.4.71 tcW (Preferred Table Cell Width) explains this:

    This element specifies the preferred width for this table cell. This preferred width is used as part of the table layout algorithm specified by the tblLayout element (§17.4.52; §17.4.53) full description of the algorithm in the ST_TblLayout simple type (§17.18.87).

    All widths in a table are considered preferred because:

    • The table shall satisfy the shared columns as specified by the tblGridelement (§17.4.48)
    • Two or more widths can have conflicting values for the width of the same grid column
    • The table layout algorithm (§17.18.87) can require a preference to be overridden

    This value is specified in the units applied via its type attribute. Any width value of type pct for this element shall be calculated relative to the overall width of the table.

    Therefore, the 828 value is not used and the cell is fit within the 811 gridCol size for the logical column.

    Best regards,
    Tom Jebo
    Microsoft Open Specifications Support


  2. Vlado 20 Reputation points
    2024-01-09T12:59:41.7466667+00:00

    Thank you very much Tom!
    Your answer with detailed description cleared it up for me.

    The problem will be in my code, where I probably don't have all the data loaded correctly. I have to analyze it.

    (sorry for late reply. I was on a long vacation)
    Best regards,
    Vlado