PDF File Transformation

Rafał Jagniewski 1 Reputation point
2021-05-04T19:31:57.637+00:00

I have a problem with one of these – trying to reproduce the structure in attached .pdf in PBI via cleaning it in PQ, no way to access the data differently. How would you tackle it ? I believe that it’s the whole range of difficulties : offset columns (data scattered throughout many adjacent columns), text cleaning (some fields merged with relevant content), some immaterial information to get rid of (like the last Page or top portion of first Page). I am thinking of generalizing the possible solution as much as possible, since getting similar .pdf is periodic event.
When I compare how things look through the connector versus how they look in the PDF I notice that there’s quite few values that either get concatenated with other fields and the rest of the rows don’t, or somehow there are missing values in some fields because they were concatenated in others. This situation in itself would prevent me from ever reaching our desired solution. Here’s a screenshot of one of those values.

93721-pqscreenshot.png

I would like to attach an original file, yet I cannot upload a .PDF with a message "No such upload"

Best

Rafal

@Ehren (MSFT) - shared additional info in reply to comment

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
35,985 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Daniel Perelman 6 Reputation points Microsoft Employee
    2021-05-26T20:25:26.77+00:00

    Thank you for your bug report. A fix for that extra text appearing out of nowhere will be included in Implementation 1.3 of Pdf.Tables which will released/be default in the July 2021 release of Power BI. (Sorry, that's how long the release cycle is.)

    The technical explanation is that as you may have noticed, you can get that same text when copy and pasting out when looking at the PDF in Adobe Reader because the text really is in the PDF, but isn't visible because the background of the table is drawn on top of it. I added some logic to ignore text that gets drawn over when extracting tables.

    I understand that even with that bug fix, cleaning the data in that PDF is complicated, and it would be great if I could get Power BI to help more with that, but I think I got it to the point where it's not doing anything obviously wrong on importing the text as a table.

    1 person found this answer helpful.
    0 comments No comments