Hallo everyone!
I am trying to web scrap an archive with Bulgarian metheorological data. Each page opens measurements over all metheorological stations on a given date and time (in theory, every three hours, 365/366 days a year). A sample page can be seen here.
Even though it’s in Bulgarian, you can grasp the main idea - a date and time selector and after the date and time is submitted, the PHP script outputs a table containing station names and ID’s followed by temperature, humidiry, cloud, snowfall, wind and other data.
Finding the URL parameter structure and generating a parametrized function to crawl the site is not a problem both in Excel and Power BI; I started in Excel and created a simple table containing columns for Year, Month, Day, Hour and the corresponding URL. Power BI does better in extracting the pictogram file names from the CSS (for their meaning, there is a nomenclature which I also have extracted so each pictogram can be linked to its description in a data model). Therefore I then went on reading the URL list into Power BI and creating a function that contains a query by example to fetch data from the parameter URL so that invoking the function would create a somewhat cleaned up list of measurements for that date and time, and the query would get all data from every URL in the list.
My problem is that on some days and hours, there’s no data. When such page is selected, it doesn’t contain the uniform table but merely displays a message saying “data missing for DD-MM-YYYY HH UTC”, like here. When I run the query and Power BI hits such page, the query just hangs and eventuallt times out. Running the same query with a subset of some URLs known to contain valid data fetches data in a form suitable for further cleanup.
So my question is: How do I write code (a function or modification of the query) to implement the following logic (in pseudocode since I'm not sure if coding ought to be done in M or DAX):
FOR EACH URL IN {list of URLS from my Excel table}
IF (retrieved page contains no data)
THEN fetch next URL
ELSE RUN fnFetchPageData
This processing may be done both in Excel in order to clean the initial URL list, or as an additional check in the Power BI query invoking the function that fetches data from a single page of measurements. Suggsteions are welcome on which is best.
Ideally, I would also like to make the scraping method future-proof for continuous scraping. As of today the URL list contains 60 897 rows but it will be complemented daily by several new valid URLs. Ideas on where to make a dynamic list - whether in Excel or in the PowerBI data model - are also welcome so as each subsequent scrapping would add the new data. This involves some filtering of URLs that are not yet valid - a different use case from valid URLs with missing data.