I found what the problem was, but I cannot find the most important solution. The problem was with Styles from the docx file. So, in the docx file, I had copy/paste the content from a web page. Then, I select all content, I change the Font: Times New Roman, 12, align Justify , Shading: White.
Then SAVE.
the problem was that all the text was copied with everything in Bold and Italic. If I had selected the paragraphs separately and had put them in italics, then it would have gone well. But the part of words with italic font remained visible in the docx, but Python does not see them.
In order for Python to correctly see Bold, Italic, Underline fonts first I follow this steps: "Clear Formatting" -> "Create a style" -> "Save style"
**
Because I tried to change the style to any other, and Python still did not see the italic font. So, the most important was step "Clear Formatting". Then, step by step, I change each of the paragraphs, with bold, normal and italic. I test again, and its working fine. Also, any of the styles works good now, I change them for the test. I test with other code:
from docx import Document
import re
def print_sentence_style(paragraph):
sentence = ""
for run in paragraph.runs:
sentence += run.text
is_bold = 'True' if run.bold else 'False'
is_italic = 'True' if run.italic else 'False'
is_normal = 'True' if not run.bold and not run.italic else 'False'
style = f"Bold: {is_bold}, Italic: {is_italic}, Normal: {is_normal}"
print(f"Run text: '{run.text}', Style: {style}")
# Împărțirea textului în propoziții pentru afișare ulterioară
sentences = re.split(r'(?<=[.!?])\s+', sentence.strip())
for sent in sentences:
print(f"Sentence: '{sent}'")
document = Document('bebe.docx')
for paragraph in document.paragraphs:
print(" ")
print_sentence_style(paragraph)
So, the problem is the following: If I copy the content of the text from a web page, and it already has the bold, italic, normal font, how to format all the text at once, without giving "Clear Formatting" and putting separate fonts again?
I think that when you copy/paste the text on the web page, the word file and the html classes that frame each paragraph are copied, and this can be seen in Styles in the docx when I click on each paragraph. I can see the class from the html file.
here he makes a conflict or something. I think that Python does not see the italic font in the docx, because in the html class it is not specified in the css and the fact that it is Italic.
For example, in the file.html I have this css:
.text_obisnuit_1{font:13px arial,garamond,sans-serif;color:#333;}
.text_obisnuit_2{font:13px arial,garamond,sans-serif;font-weight:bold;color:#333;}
The first class is normal
text. The second class is bold
text.
Probably, if a class had been declared with italic
, Python would have recognized the class in word. Right now, in the docx file you can see the italic font, but it is not recognized by Python and, also, the italic class does not even appear in Word styles, even if I see the paragraph as italic.
So, in a html page, If there is <em>
and </em>
on a paragraph with text_obisnuit_1
class, and text_obisnuit_1
doesn't have in it font-style: italic;
declared in css, then Word .docx will see the Italic font, but Python will not recognize it. Even Word will not recognize it in the syles, even if it will be visible in docx.
In short, you must take into consideration the <em>
and </em>
or <i>
and </i>
tags in the html files, even if the classes did not include the italic style.
And this, I don't know how to fix it in a document with many pages, I must run Python code, but the italic will not be seen. And takes time to "Clear Formatting" and to change in Word the styles each of paragraph.
You can test this page that as some italic paragraphs. Copy the content in a docx file, and see how italic can be seen in Word and in Python.
https://neculaifantanaru.com/en/delight-my-gaze-with-something-that-reflects-the-harmony-of-nature-II.html