How do I skip pages with no data when web scrapping using Power Query in Excel or Power BI?

NN 1 Reputation point
2020-07-16T09:48:48.577+00:00

Hallo everyone!

I am trying to web scrap an archive with Bulgarian metheorological data. Each page opens measurements over all metheorological stations on a given date and time (in theory, every three hours, 365/366 days a year). A sample page can be seen here.
Even though it’s in Bulgarian, you can grasp the main idea - a date and time selector and after the date and time is submitted, the PHP script outputs a table containing station names and ID’s followed by temperature, humidiry, cloud, snowfall, wind and other data.

Finding the URL parameter structure and generating a parametrized function to crawl the site is not a problem both in Excel and Power BI; I started in Excel and created a simple table containing columns for Year, Month, Day, Hour and the corresponding URL. Power BI does better in extracting the pictogram file names from the CSS (for their meaning, there is a nomenclature which I also have extracted so each pictogram can be linked to its description in a data model). Therefore I then went on reading the URL list into Power BI and creating a function that contains a query by example to fetch data from the parameter URL so that invoking the function would create a somewhat cleaned up list of measurements for that date and time, and the query would get all data from every URL in the list.

My problem is that on some days and hours, there’s no data. When such page is selected, it doesn’t contain the uniform table but merely displays a message saying “data missing for DD-MM-YYYY HH UTC”, like here. When I run the query and Power BI hits such page, the query just hangs and eventuallt times out. Running the same query with a subset of some URLs known to contain valid data fetches data in a form suitable for further cleanup.

So my question is: How do I write code (a function or modification of the query) to implement the following logic (in pseudocode since I'm not sure if coding ought to be done in M or DAX):

FOR EACH URL IN {list of URLS from my Excel table} 
     IF (retrieved page contains no data) 
     THEN fetch next URL 
     ELSE RUN fnFetchPageData

This processing may be done both in Excel in order to clean the initial URL list, or as an additional check in the Power BI query invoking the function that fetches data from a single page of measurements. Suggsteions are welcome on which is best.

Ideally, I would also like to make the scraping method future-proof for continuous scraping. As of today the URL list contains 60 897 rows but it will be complemented daily by several new valid URLs. Ideas on where to make a dynamic list - whether in Excel or in the PowerBI data model - are also welcome so as each subsequent scrapping would add the new data. This involves some filtering of URLs that are not yet valid - a different use case from valid URLs with missing data.

Community Center | Not monitored
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Anonymous
    2020-07-16T13:03:44.303+00:00

    Power Bi is not currently supported here on QnA. They're actively answering questions in dedicated forums here.
    https://community.powerbi.com/
    https://social.technet.microsoft.com/Forums/en-US/home?category=powerbi

    --please don't forget to Accept as answer if the reply is helpful--


    Regards, Dave Patrick ....
    Microsoft Certified Professional
    Microsoft MVP [Windows Server] Datacenter Management

    Disclaimer: This posting is provided "AS IS" with no warranties or guarantees, and confers no rights.

    0 comments No comments

  2. NN 1 Reputation point
    2020-07-17T12:43:32.51+00:00

    Thanks, I'll post there as well. What about doing the required URL purging task purely in Excel?


  3. NN 1 Reputation point
    2020-07-17T13:08:48.81+00:00

    HI @Anonymous ,

    I would be very grateful if you could post my message at the forum you suggested. For some odd reason I don't seem to be able to log in there with my credentials from here but if trying to register there, the system tells me there's already an account.

    Many thanks in advance and have a nice day :-)


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.