Parellize url scraping in Function Apps (Python)

Micaela 21 Reputation points


I am facing issues/erros when running a timetrigger funciton app. The function triggers the code, which runs just fine, until I add the parellelization. I show the code below:

df_list = []
with multiprocessing.Pool(processes=4) as p:
for result in p.imap(web_scraper_function, url_list_to_scrape):
price = result[1]
item_id = result[2]
df_list.append([price, item_id])
df_final = pd.DataFrame(df_list)
df_final.to_sql('table1', AZURE_CONN, schema='one', if_exists='append', index=False)

The issue/error faced are:
(1) after scraping 6000/7000 urls, I get:
(a) Timeout value of 00:05:00 exceeded by function 'Functions.TimerTest123456' (Id: 'xxxx'). Initiating cancellation.
(b) Executed '{functionName}' ({status}, Id={invocationId}, Duration={executionDuration}ms)
(c) Executed 'Functions.TimerTest123456' (Failed, Id=xxxx, Duration=300142ms)
(2) It never gets to send the df_final to our database (hosted in azure)

Would anyone be able to help on how to make this code work to paste the df into the database? Or, aleternatively -yet not preferable-, to change the way I am approching the parallelization so as to make it work?

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
3,049 questions
0 comments No comments
{count} votes

Accepted answer
  1. MughundhanRaveendran-MSFT 12,226 Reputation points

    Hi @Micaela ,

    Thanks for posting this query in Q&A forum.

    Instead of trying parallelization in the code, you can make use of the inbuilt functionality of running the function in more threads. By adding the app setting PYTHON_THREADPOOL_THREAD_COUNT to a value between 2 and 32, you can achieve parallelization.

    Hope this helps!

    Please 'Accept as answer' and ‘Upvote’ if it helped so that it can help others in the community looking for help on similar topics.

0 additional answers

Sort by: Most helpful