Python fails to correctly identify all Italic fonts in docx files

Suzana Eree 811 Reputation points
2023-09-24T13:32:24.8433333+00:00

hello, I use Python to detect all Italic Fonts from a docx file. The problem: is doesn't find all the Italic Fonts, just some of them.

For example in bebe.docx I have the next 4 lines, all have Italic Font. Times new Roman (13). Python

Bebe.

Bebe este acasa maine.

- Bebe.

- The Little Mermaid sighed and looked sadly at her fish tail.

This is the simples Python code to check:

from docx import Document

for para in Document('bebe.docx').paragraphs:
    for run in para.runs:
        print(f"Text: {run.text}, Bold: {run.bold}, Italic: {run.italic}")

Look here what Python finds:

Text: Bebe, Bold: None, Italic: None

Text: ., Bold: None, Italic: None

Text: Bebe este , Bold: None, Italic: False

Text: acasa, Bold: None, Italic: None

Text: maine., Bold: None, Italic: False

Text: - Bebe, Bold: None, Italic: None

Text: ., Bold: None, Italic: None

Text: - The Little Mermaid sighed and looked sadly at her fish tail., Bold: None, Italic: False

Word
Word
A family of Microsoft word processing software products for creating web, email, and print documents.
852 questions
Windows 10
Windows 10
A Microsoft operating system that runs on personal computers and tablets.
11,605 questions
Microsoft Office Online Server
Microsoft Office Online Server
Microsoft on-premises server product that runs Office Online. Previously known as Office Web Apps Server.
641 questions
Word Management
Word Management
Word: A family of Microsoft word processing software products for creating web, email, and print documents.Management: The act or process of organizing, handling, directing or controlling something.
923 questions
Windows 11
Windows 11
A Microsoft operating system designed for productivity, creativity, and ease of use.
9,667 questions
0 comments No comments
{count} votes

Accepted answer
  1. Emi Zhang-MSFT 25,071 Reputation points Microsoft Vendor
    2023-09-25T06:15:26.5833333+00:00

    Hi,

    I tested and didn't find your problem:

    User's image

    User's image

    I suggest you try to check if your local python env installed python-docx package.

    Just checking in to see if the information was helpful. Please let us know if you would like further assistance.


    If the response is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. Suzana Eree 811 Reputation points
    2023-09-25T07:13:26.4133333+00:00

    I found what the problem was, but I cannot find the most important solution. The problem was with Styles from the docx file. So, in the docx file, I had copy/paste the content from a web page. Then, I select all content, I change the Font: Times New Roman, 12, align Justify , Shading: White. Then SAVE.

    the problem was that all the text was copied with everything in Bold and Italic. If I had selected the paragraphs separately and had put them in italics, then it would have gone well. But the part of words with italic font remained visible in the docx, but Python does not see them.

    In order for Python to correctly see Bold, Italic, Underline fonts first I follow this steps: "Clear Formatting" -> "Create a style" -> "Save style"

    **
    Because I tried to change the style to any other, and Python still did not see the italic font. So, the most important was step "Clear Formatting". Then, step by step, I change each of the paragraphs, with bold, normal and italic. I test again, and its working fine. Also, any of the styles works good now, I change them for the test. I test with other code:

    from docx import Document
    import re
    
    def print_sentence_style(paragraph):
        sentence = ""
        for run in paragraph.runs:
            sentence += run.text
            is_bold = 'True' if run.bold else 'False'
            is_italic = 'True' if run.italic else 'False'
            is_normal = 'True' if not run.bold and not run.italic else 'False'
            style = f"Bold: {is_bold}, Italic: {is_italic}, Normal: {is_normal}"
            print(f"Run text: '{run.text}', Style: {style}")
    
        # Împărțirea textului în propoziții pentru afișare ulterioară
        sentences = re.split(r'(?<=[.!?])\s+', sentence.strip())
        for sent in sentences:
            print(f"Sentence: '{sent}'")
    
    document = Document('bebe.docx')
    for paragraph in document.paragraphs:
        print(" ")
        print_sentence_style(paragraph)
    
    
    

    So, the problem is the following: If I copy the content of the text from a web page, and it already has the bold, italic, normal font, how to format all the text at once, without giving "Clear Formatting" and putting separate fonts again?

    I think that when you copy/paste the text on the web page, the word file and the html classes that frame each paragraph are copied, and this can be seen in Styles in the docx when I click on each paragraph. I can see the class from the html file.

    here he makes a conflict or something. I think that Python does not see the italic font in the docx, because in the html class it is not specified in the css and the fact that it is Italic.

    For example, in the file.html I have this css:

    .text_obisnuit_1{font:13px arial,garamond,sans-serif;color:#333;}
    .text_obisnuit_2{font:13px arial,garamond,sans-serif;font-weight:bold;color:#333;}

    The first class is normal text. The second class is bold text.

    Probably, if a class had been declared with italic, Python would have recognized the class in word. Right now, in the docx file you can see the italic font, but it is not recognized by Python and, also, the italic class does not even appear in Word styles, even if I see the paragraph as italic.

    So, in a html page, If there is <em> and </em> on a paragraph with text_obisnuit_1 class, and text_obisnuit_1 doesn't have in it font-style: italic; declared in css, then Word .docx will see the Italic font, but Python will not recognize it. Even Word will not recognize it in the syles, even if it will be visible in docx.

    In short, you must take into consideration the <em> and </em> or <i> and </i> tags in the html files, even if the classes did not include the italic style.

    And this, I don't know how to fix it in a document with many pages, I must run Python code, but the italic will not be seen. And takes time to "Clear Formatting" and to change in Word the styles each of paragraph.

    You can test this page that as some italic paragraphs. Copy the content in a docx file, and see how italic can be seen in Word and in Python.

    https://neculaifantanaru.com/en/delight-my-gaze-with-something-that-reflects-the-harmony-of-nature-II.html


  2. Suzana Eree 811 Reputation points
    2023-09-29T16:14:47.0933333+00:00

    I find a solution that works great. It finds all the fonts from docx paragraphs and parsing them to html

    import docx
    import re
    
    def run_get_style(run) -> str:
        if run.bold:
            return "bold"
        elif run.italic:
            return "italic"
        else:
            return "normal"
    
    def detect_fonts(document: docx.Document) -> None:
        with open("bebe.html", "w") as f:
            for paragraph in document.paragraphs:
                runs = paragraph.runs
                if not runs:
                    continue
                current_style = None
                current_text = ""
    
                for run in runs:
                    run_style = run_get_style(run)
                    if run_style == current_style:
                        current_text += run.text
                    else:
                        if current_style:
                            if current_style == "bold":
                                f.write(f"<b>{current_text}</b>")
                            elif current_style == "italic":
                                f.write(f"<em>{current_text}</em>")
                            else:
                                f.write(current_text)
                        current_style = run_style
                        current_text = run.text
    
                if current_style:
                    if current_style == "bold":
                        f.write(f"<b>{current_text}</b>")
                    elif current_style == "italic":
                        f.write(f"<em>{current_text}</em>")
                    else:
                        f.write(current_text)
    
                f.write("</p>\n")  # Add a closing paragraph tag at the end of each paragraph
    
        # Add a paragraph tag at the beginning of each line
        with open("bebe.html", "r") as f:
            content = f.read()
    
        content = re.sub(r"^[ \t]*", "<p>", content, flags=re.MULTILINE)
    
        with open("bebe.html", "w") as f:
            f.write(content)
    
    def main():
        document = docx.Document("bebe.docx")
        detect_fonts(document)
    
    if __name__ == "__main__":
        main()
    
    
    0 comments No comments

  3. Suzana Eree 811 Reputation points
    2023-10-01T10:24:04.82+00:00

    Or this code:

    import docx
    
    def run_get_style(run) -> str:
        if run.bold and run.italic:
            return "bold-italic"
        elif run.bold:
            return "bold"
        elif run.italic:
            return "italic"
        else:
            return "normal"
    
    def convert_docx_to_html_style(para):
        result = ""
        if para.runs:
            html_para = '<p>'
            current_style = None
            current_text = ""
    
            for run in para.runs:
                run_style = run_get_style(run)
                if run_style == current_style:
                    current_text += run.text
                else:
                    if current_style:
                        if "bold" in current_style:
                            if "italic" in current_style:
                                html_para += '<b><em>' + current_text + '</em></b>'
                            else:
                                html_para += '<b>' + current_text + '</b>'
                        elif "italic" in current_style:
                            html_para += '<em>' + current_text + '</em>'
                        else:
                            html_para += current_text
                    current_style = run_style
                    current_text = run.text
    
            if current_style:
                if "bold" in current_style:
                    if "italic" in current_style:
                        html_para += '<b><em>' + current_text + '</em></b>'
                    else:
                        html_para += '<b>' + current_text + '</b>'
                elif "italic" in current_style:
                    html_para += '<em>' + current_text + '</em>'
                else:
                    html_para += current_text
    
            html_para += '</p>\n'
            result += html_para
        return result
    
    
    # Citirea documentului DOCX
    document = docx.Document("bebe.docx")
    
    # Deschiderea fișierului HTML pentru scriere
    with open("bebe.html", "w", encoding="utf-8") as html_file:
        # Scrierea începutului fișierului HTML
        html_file.write("<html>\n<head>\n<title>Document</title>\n</head>\n<body>\n")
    
        # Parcurgerea și conversia fiecărui paragraf în HTML
        for paragraph in document.paragraphs:
            converted_paragraph = convert_docx_to_html_style(paragraph)
            # Scrierea paragrafului convertit în fișierul HTML
            html_file.write(converted_paragraph)
    
        # Scrierea sfârșitului fișierului HTML
        html_file.write("</body>\n</html>")
    
    print("Conținutul din bebe.docx a fost salvat în bebe.html.")
    
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.