How do I skip pages with no data when web scrapping using Power Query in Excel or Power BI?

Question

How do I skip pages with no data when web scrapping using Power Query in Excel or Power BI?

NN 1

Hallo everyone!

I am trying to web scrap an archive with Bulgarian metheorological data. Each page opens measurements over all metheorological stations on a given date and time (in theory, every three hours, 365/366 days a year). A sample page can be seen here.
Even though it’s in Bulgarian, you can grasp the main idea - a date and time selector and after the date and time is submitted, the PHP script outputs a table containing station names and ID’s followed by temperature, humidiry, cloud, snowfall, wind and other data.

Finding the URL parameter structure and generating a parametrized function to crawl the site is not a problem both in Excel and Power BI; I started in Excel and created a simple table containing columns for Year, Month, Day, Hour and the corresponding URL. Power BI does better in extracting the pictogram file names from the CSS (for their meaning, there is a nomenclature which I also have extracted so each pictogram can be linked to its description in a data model). Therefore I then went on reading the URL list into Power BI and creating a function that contains a query by example to fetch data from the parameter URL so that invoking the function would create a somewhat cleaned up list of measurements for that date and time, and the query would get all data from every URL in the list.

My problem is that on some days and hours, there’s no data. When such page is selected, it doesn’t contain the uniform table but merely displays a message saying “data missing for DD-MM-YYYY HH UTC”, like here. When I run the query and Power BI hits such page, the query just hangs and eventuallt times out. Running the same query with a subset of some URLs known to contain valid data fetches data in a form suitable for further cleanup.

So my question is: How do I write code (a function or modification of the query) to implement the following logic (in pseudocode since I'm not sure if coding ought to be done in M or DAX):

FOR EACH URL IN {list of URLS from my Excel table} 
     IF (retrieved page contains no data) 
     THEN fetch next URL 
     ELSE RUN fnFetchPageData

This processing may be done both in Excel in order to clean the initial URL list, or as an additional check in the Power BI query invoking the function that fetches data from a single page of measurements. Suggsteions are welcome on which is best.

Ideally, I would also like to make the scraping method future-proof for continuous scraping. As of today the URL list contains 60 897 rows but it will be complemented daily by several new valid URLs. Ideas on where to make a dynamic list - whether in Excel or in the PowerBI data model - are also welcome so as each subsequent scrapping would add the new data. This involves some filtering of URLs that are not yet valid - a different use case from valid URLs with missing data.

3 answers

Your answer

Answer 1

Power Bi is not currently supported here on QnA. They're actively answering questions in dedicated forums here.
https://community.powerbi.com/
https://social.technet.microsoft.com/Forums/en-US/home?category=powerbi

--please don't forget to Accept as answer if the reply is helpful--

Regards, Dave Patrick ....
Microsoft Certified Professional
Microsoft MVP [Windows Server] Datacenter Management

Disclaimer: This posting is provided "AS IS" with no warranties or guarantees, and confers no rights.

Answer 2

NN 1

Thanks, I'll post there as well. What about doing the required URL purging task purely in Excel?

Anonymous

2020-07-17T13:14:42.07+00:00

They're actively answering excel dev questions in dedicated forums here.

https://social.msdn.microsoft.com/Forums/Lync/en-US/home?forum=exceldev

--please don't forget to Accept as answer if the reply is helpful--

Answer 3

NN 1

HI @Anonymous ,

I would be very grateful if you could post my message at the forum you suggested. For some odd reason I don't seem to be able to log in there with my credentials from here but if trying to register there, the system tells me there's already an account.

Many thanks in advance and have a nice day :-)

Anonymous

2020-07-17T13:16:10.347+00:00

For the power bi community you can logon using your O365 account.

For MSDN/TechNet you can ask for help with this in dedicated forum for forums support
https://social.msdn.microsoft.com/Forums/en-US/home?forum=reportabug

Share via

How do I skip pages with no data when web scrapping using Power Query in Excel or Power BI?

3 answers

Your answer