Understanding the PowerPoint MS-PPT Binary File Format

Summary: Learn about the MS-PPT binary file format that is used in previously released versions of Microsoft PowerPoint products, including the basic structures and key concepts for interacting with PowerPoint programmatically.

Applies to: Office 2007 | Office 2010 | PowerPoint 2010 | VBA

In this article
Overview of the MS-PPT File Format
Conclusion
Additional Resources

Published:   February 2011

Provided by:   Microsoft Corporation

Contents

  • Overview of the MS-PPT File Format

  • Key Components of the MS-PPT File Format

  • Extracting Content from a PowerPoint File

    • Retrieving Text from PowerPoint Files

    • Retrieving Slides from PowerPoint Files

    • Undo History in the MS-PPT File Format

  • Conclusion

  • Additional Resources

This article explains the structures and some procedures for working with MS-PPT files. It is the part of a series of articles that introduce the binary file formats used by Microsoft Office products. These articles are designed to be used in conjunction with the Microsoft Office File Format Documents available on MSDN.

Overview of the MS-PPT File Format

The MS-PPT binary file format is used by Microsoft Office PowerPoint 2003, Microsoft PowerPoint 2002, Microsoft PowerPoint 2000, and Microsoft PowerPoint 97. It starts with a Current User stream, then a PowerPoint Document stream, and a Pictures stream, plus some optional streams for summary information, custom XML data, and digital signatures. Most of the actual content resides in Slide Containers inside the PowerPoint Document stream. The Current User stream contains the Current User Atom, which holds user, version, and system data. The Pictures stream contains embedded images that are referenced from the slide containers they appear in. The smallest unit of data in this format is a Record.

File data is stored sequentially by user edits. This means that if you want to re-create only the current version of the file, you can extract just the content of the last user edit and none of the data before it. Likewise, you can get previous versions of the file by going to previous user edits in the stream.

Note

The recommended way to perform most programming tasks in Microsoft PowerPoint is to use the PowerPoint Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft PowerPoint. This article series deals only with advanced scenarios, such as where Microsoft PowerPoint is not installed.

Key Components of the MS-PPT File Format

Following are the main structures of interest in a .ppt file. All of the structures reside in the PowerPoint Document stream unless otherwise specified.

  • CurrentUserAtom

    An atom record that specifies information about the last user to modify the file and where the most recent user edit is located. This is the only record in the CurrentUser stream.

  • UserEditAtom record

    The UserEditAtom record includes, among other things, pointers to the PersistDirectoryAtom record for the current edit and to the previous UserEditAtom record.

  • PersistDirectoryAtom record

    The PersistDirectoryAtom record specifies a persist object directory, which is a table of persist object identifiers and stream offsets to where you can find persist objects. Each user edit stores a persist object directory that identifies where you can find any new and modified persist objects.

  • PersistDirectoryEntry structure

    PersistDirectoryEntry structures are part of a PersistDirectoryAtom record, and are incremented. For example: persistDirEntry[0], persistDirEntry[1]. It specifies a compressed table of sequential persist object identifiers and stream offsets to associated persist objects.

    The first 20 bits give the ID of a persist object. The next 12 bits give the number of persist offset entries in the currentPersistDirectoryEntry structure. The rest of the PersistDirectoryEntry structure consists of offset entries, each of which use 4 bytes. If there is more than one PersistOffsetEntry structure, each successive entry applies the Persist object together with an ID one larger than that of the previous one.

  • DocumentContainer structure

    This is a container record that specifies information about the document. It includes lists of slides, notes, sounds, graphical elements, and other content. All the outline text for the current edit is stored in the Slide List of the DocumentContainer structure. Text inside shapes is stored in Shapes records.

  • MainMasterContainer structure

    This is a container record that specifies a main master slide. The main master slide defines the formatting and some content, such as template graphics, for a presentation slide.

  • SlideContainer structure

    This is the container record for a slide in the presentation. It includes transition settings, header and footer information, pointers and formatting for graphic elements, and a SlideAtom structure that specifies the placeholder shapes used in the slide layout.

  • SlideListWithTextContainer structure

    A SlideWithTextContainer structure specifies a list of SlideListWithTextSubContainerOrAtom records, each of which contains references to presentation slides and records for text that is contained within them.

  • TextCharsAtom structure and TextBytesAtom structure

    These are the two kinds of SlideListWithTextSubContainerOrAtom records that contain text characters. Each contains an 8-byte record header, followed by a series of Unicode or partial Unicode characters.

  • DrawingGroupContainer structure

    This is the part of the DocumentContainer structure that specifies images, WordArt, and other graphical content.

  • RecordHeader structure

    A RecordHeader structure is an 8-byte structure at the beginning of each container record and atom record in the file. It contains four fields: recVer, recInstance, recType, and recLength. The last two are the most interesting. The recType field specifies the type of the current record, and the recLen field gives the length in bytes.

Extracting Content from a PowerPoint File

Extracting content from a PowerPoint document that uses the MS-PPT file format depends on the kind of content that you want to retrieve and in what condition. You can indiscriminately grab all the text from a .ppt file by using one or two pages of code, but that does not maintain the formatting, graphics, transitions, or even slide boundaries. You could extract all of the clip art, for example, with a similar technique, although that is beyond the scope of this article. To get the content slide by slide, you have to identify the current edits by reading PersistDirectoryEntry records.

Retrieving Text from PowerPoint Files

To retrieve plain text from a PowerPoint presentation

  1. Open the PowerPoint Document Stream.

  2. For each record:

    1. Read the record header.

    2. If the recType field is TextCharsAtom (0x0FA0) or TextBytesAtom (0x0FA8), read the rest of the record as text.

Retrieving Slides from PowerPoint Files

In theory, you could reconstruct a slide deck with the same approach as retrieving plain text. For example, by checking record headers for recType = DocumentContainer/SlideContainer/NotesContainer/, but this returns an indiscriminate collection of active slides and outdated slides at every state of edit with no way to differentiate them. But if you start by building a persist object directory, you have pointers to all of the current content and no outdated content.

To retrieve slides and their content from PowerPoint files

  1. Create a persist object directory.

    1. Create a data structure, such as a dictionary, to hold two columns of linked data. The first column is for Persist object IDs, the second is for the offsets where they are located in the stream.

    2. Read the CurrentUserAtom structure from the CurrentUser stream. Bytes 16–19 of the CurrentUserAtom structure specify the offsetToCurrentEdit field.

    3. Open the PowerPoint Document Stream, and read the UserEditAtom structure from the offset specified by CurrentUserAtom.offsetToCurrentEdit field. Bytes 16–19 specify the offsetLastEdit field, and bytes 20–23 specify the offsetPersistDirectory field.

    4. Read the PersistDirectoryAtom structure from the offset supplied by the UserEditAtom.offsetPersistDirectory field.

    5. Populate the data structure you created by using the ID values and offsets of each PersistDirectoryEntry structure. In cases in which a PersistDirectoryEntry structure has more than one offset, assign each successive offset an ID value one larger than the previous ID in the table.

    6. Go to the offset in the PowerPoint Document Stream specified by the UserEditAtom.offsetLastEdit field and read the UserEditAtom structure that starts there.

    7. Repeat the previous three steps until you run out of user edits. Ignore any persist directory entries whose ID values conflict with existing ID values in the table, as these represent slides that have been overwritten.

  2. Go through the persist object directory, checking the record headers at each specified offset, and reading each record of type = RT_Document. This is the DocumentContainer structure.

  3. Within the DocumentContainer structure, find the slide list, which is a record with rh.recType = RT_SlideListWithText. This is a SlideListWithTextContainer record, which contains an array of SlideListWithTextSubContainerOrAtom entries.

  4. Read the slide list. Each SlidePersistAtom structure (type = rh.RT_SlidePersistAtom) comes before the records that contain the content for that slide. Any record of rh.recType = RT_TextCharsAtom or rh.recType = RT_TextBytesAtom contains text.

  5. For each SlidePersistItem structure in the slide list, create a slide with the corresponding text content.

You can use the same approach for other slide content, such as notes, headers and footers, and formatting information.

Undo History in the MS-PPT File Format

Because of the sequential nature of the MS-PPT binary file format, you can easily restore previous versions of PowerPoint files.

To retrieve previous versions of a PowerPoint file

  1. Load a copy of the PowerPoint file into memory.

  2. From the in-memory copy, read the CurrentUserAtom record from the CurrentUser stream.

  3. In the PowerPoint Document Stream, read the UserEditAtom structure from the offset specified by CurrentUserAtom.offsetToCurrentEdit field.

  4. Read the previous UserEditAtom structure from the offset supplied by the UserEditAtom.offsetLastEdit field.

  5. Delete everything after the previous UserEditAtom structure.

  6. Update the value of the CurrentUserAtom.offsetToCurrentEdit field.

  7. Save the file by using a modified file name. To arrive at previous known versions, you can continue backtracking until you reach the correct UserEditAtom structure.

Conclusion

This is just a sampling of the MS-PPT format. With the tools provided in this article, simple data recovery should be within your reach. With additional exploration, you can start to recover media, formatting information and other metadata, and eventually work up to save operations.

Additional Resources

For more information, see the following resources: