Folks
We are getting the following issues [and i m non-native arabic person who can't read or write arabic but can understand unicode characters] and the steps are as follows:
- Create a new MS Word Document.
- Copy & Paste the following string [as an example] - "الأحكام
والشروط"
- The font is standard Arial and Size 10.
- Save As the file as PDF - which is available as a standard functionality in MS Word.
- Open the PDF File and try selecting the text.
The following are the issues:
- In MS Word - all the characters are proper unicode and the unicode for the above string is:
| Unicode |
Description |
| 627 |
LETTER ALEF |
| 644 |
LETTER LAM |
| 623 |
LETTER ALEF WITH HAMZA ABOVE |
| 062D |
LETTER HAH |
| 643 |
LETTER KAF |
| 627 |
LETTER ALEF |
| 645 |
LETTER MEEM |
| 20 |
SPACE |
| 648 |
LETTER WAW |
| 627 |
LETTER ALEF |
| 644 |
LETTER LAM |
| 634 |
LETTER SHEEN |
| 631 |
LETTER REH |
| 648 |
LETTER WAW |
| 637 |
LETTER TAH |
- When we open the PDF File created by MS Word and we do CTRL+A [Select all Text] and look at the text copied, the unicodes are as follows:
| Unicode |
Description |
Remarks |
| 627 |
ARABIC LETTER ALEF |
|
| 644 |
ARABIC LETTER LAM |
|
| 623 |
ARABIC LETTER ALEF WITH HAMZA ABOVE |
|
| 062D |
ARABIC LETTER HAH |
|
| 643 |
ARABIC LETTER KAF |
|
| 627 |
ARABIC LETTER ALEF |
|
| 645 |
ARABIC LETTER MEEM |
|
| 20 |
SPACE |
|
| 648 |
ARABIC LETTER WAW |
|
| 627 |
ARABIC LETTER ALEF |
|
| 627 |
ARABIC LETTER ALEF |
Original Unicode was 644 |
| 634 |
ARABIC LETTER SHEEN |
|
| 631 |
ARABIC LETTER REH |
|
| 648 |
ARABIC LETTER WAW |
|
| 627 |
ARABIC LETTER ALEF |
Original Unicode was 637 |
You can see - that when the MS Word Document was saved as PDF - there were certain characters - which get replaced automatically and is a loss of data as text in the concerned PDF File.
If you visually see the PDF File - every thing in terms of characters seems to be same.
We have tried even adobe Acrobat Professional to convert the arabic into PDF - and the issue remains same.
Based on a 20 page document we had - when we compare the original MS Word characters with corresponding text extracted from PDF via copy paste - we get about 17% replacements. We can't identify any pattern in the same.
Request support for the above.