Parellize url scraping in Function Apps (Python)

Question

Hi,

I am facing issues/erros when running a timetrigger funciton app. The function triggers the code, which runs just fine, until I add the parellelization. I show the code below:

df_list = []
with multiprocessing.Pool(processes=4) as p:
for result in p.imap(web_scraper_function, url_list_to_scrape):
price = result[1]
item_id = result[2]
df_list.append([price, item_id])
df_final = pd.DataFrame(df_list)
df_final.to_sql('table1', AZURE_CONN, schema='one', if_exists='append', index=False)

The issue/error faced are:
(1) after scraping 6000/7000 urls, I get:
(a) Timeout value of 00:05:00 exceeded by function 'Functions.TimerTest123456' (Id: 'xxxx'). Initiating cancellation.
(b) Executed '{functionName}' ({status}, Id={invocationId}, Duration={executionDuration}ms)
(c) Executed 'Functions.TimerTest123456' (Failed, Id=xxxx, Duration=300142ms)
(2) It never gets to send the df_final to our database (hosted in azure)

Would anyone be able to help on how to make this code work to paste the df into the database? Or, aleternatively -yet not preferable-, to change the way I am approching the parallelization so as to make it work?

Accepted Answer

Hi @Micaela ,

Thanks for posting this query in Q&A forum.

Instead of trying parallelization in the code, you can make use of the inbuilt functionality of running the function in more threads. By adding the app setting PYTHON_THREADPOOL_THREAD_COUNT to a value between 2 and 32, you can achieve parallelization.

https://learn.microsoft.com/en-us/azure/azure-functions/functions-app-settings#python_threadpool_thread_count

Hope this helps!

Please 'Accept as answer' and ‘Upvote’ if it helped so that it can help others in the community looking for help on similar topics.

Parellize url scraping in Function Apps (Python)

0 additional answers