Azure Function that uses scrapy returns 404 with correct URL

Question

Azure Function that uses scrapy returns 404 with correct URL

Lykos, Manos 0

Hello,

I have the following Function that uses Scrapy in order to crawl data from a specific site

import azure.functions as func 
import os 
import datetime 
import json 
import logging 
import subprocess 
import sys  

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'X_crawler')))  

from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings  
from X_crawler.spiders.X import X    

app = func.FunctionApp()  

@app.route(route="crawl_X", auth_level=func.AuthLevel.ANONYMOUS) 
def crawl_X(req: func.HttpRequest) -> func.HttpResponse:
     logging.info("Scrapy Azure Function triggered.")

      # Set the project file path
     os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'X_crawler.settings')

     # Load settings from scrapy.cfg
     print(get_project_settings())
     process = CrawlerProcess(get_project_settings())
     process.crawl(X)
     process.start()
    
 	 # Blocking call
     return func.HttpResponse('Crawling completed successfully9', status_code=200)

The problem is that with this code when I invoke the URL I get a 404 and I also do not get any information from Log Stream like the request never happened. Also when I run the function locally using "func start" it runs as expected.

When I comment the last 3 imports + the necessary code to keep only the logging line and the return statement the function runs successfully. Therefore, I think that this has to do with those imports. However, why that happens, why I'm not getting any information and how can I fix them but keep crawler's code into a separate module if possible?

PS. Also not that I get the same error just by importing scrapy.crawler and scrapy.utils

Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-18T17:44:19.37+00:00
Hello Lykos, Manos,

As you're encountering a 404 error with no logs when trying to deploy a Scrapy crawler inside an Azure Function. The function works locally but fails in Azure, especially when importing Scrapy.

First ensure all required packages (like scrapy, azure-functions) are listed in requirements.txt. Run pip freeze > requirements.txt locally to generate this file.

Also, add logging.basicConfig(level=logging.INFO) to ensure logs are captured. Also, check Log Stream and enable Application Insights for more detailed logs.

At last verify the function is deployed correctly and check if any errors are reported in Azure's log stream.

Note: Scrapy can be slow. Increase the timeout in host.json(Try this and please let us know)

If you have noticed any error during or after this process, please try to share them here so that it will help us in better investigating the issue.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-19T17:56:05.63+00:00

Hello Lykos, Manos,

Just checking in to see if the information above was helpful. If you have any further updates on this issue, please feel free to post them here.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-20T20:28:36.92+00:00

Hello Lykos, Manos,

Just checking in to see if the provided information helps you better understand and resolve your concern. If you have further questions, please feel free to drop them here, and we will assist you accordingly.

Lykos, Manos 0

Initially, I'm sorry for the delayed answer. Checking my deployments I saw that they used something like "cached" deployment(I dont have the message currently). Therefore, I created a new function and deployed also a refactored code. This action however led to another error. Checking Application Insights I have this exception

Result: Failure Exception: ReactorNotRestartable: Stack: File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 671, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 1001, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/extension.py", line 211, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/function_app.py", line 34, in crawl_mantinades process.start() # Blocking call File "/home/site/wwwroot/.python_packages/lib/site-packages/scrapy/crawler.py", line 496, in start reactor.run(installSignalHandlers=install_signal_handlers) # blocking call File "/home/site/wwwroot/.python_packages/lib/site-packages/twisted/internet/base.py", line 695, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "/home/site/wwwroot/.python_packages/lib/site-packages/twisted/internet/base.py", line 926, in startRunning raise error.ReactorNotRestartable()

Checking the log stream I get this which does not show any error

2025-03-27T12:27:22Z [Verbose] Request successfully matched the route with name 'crawl_mantinades' and template 'api/crawl_mantinades'
2025-03-27T12:27:22Z [Information] Executing 'Functions.crawl_mantinades' (Reason='This function was programmatically called via the host APIs.', Id=c6ef5b30-5aed-4d0c-b004-22cd608a4d15)
2025-03-27T12:27:22Z [Verbose] Sending invocation id: 'c6ef5b30-5aed-4d0c-b004-22cd608a4d15
2025-03-27T12:27:22Z [Verbose] Posting invocation id:c6ef5b30-5aed-4d0c-b004-22cd608a4d15 on workerId:e24ca5f2-cb06-4433-a57e-a0af6900e508
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure_functions_worker] INFO: Received FunctionInvocationRequest, request ID: 99155e54-2203-4ff0-8ccd-532a08d67a79, function ID: 01024e69-5e2a-508b-af21-1aadd6b3cf1d, function name: crawl_mantinades, invocation ID: c6ef5b30-5aed-4d0c-b004-22cd608a4d15, function type: sync, timestamp (UTC): 2025-03-27 12:27:22.116113, sync threadpool max workers: 5
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [root] INFO: Python HTTP trigger function processed a request.
2025-03-27T12:27:22Z [Information] Python HTTP trigger function processed a request.
2025-03-27T12:27:22Z [Information] Scrapy 2.12.0 started (bot: mantinades_crawler)
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: mantinades_crawler)
2025-03-27T12:27:22Z [Information] Versions: lxml 5.3.1.0, libxml2 2.12.9, cssselect 1.3.0, parsel 1.10.0, w3lib 2.3.1, Twisted 24.11.0, Python 3.10.16 (main, Jan 14 2025, 02:23:41) [GCC 10.2.1 20210110], pyOpenSSL 25.0.0 (OpenSSL 3.4.1 11 Feb 2025), cryptography 44.0.2, Platform Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] INFO: Versions: lxml 5.3.1.0, libxml2 2.12.9, cssselect 1.3.0, parsel 1.10.0, w3lib 2.3.1, Twisted 24.11.0, Python 3.10.16 (main, Jan 14 2025, 02:23:41) [GCC 10.2.1 20210110], pyOpenSSL 25.0.0 (OpenSSL 3.4.1 11 Feb 2025), cryptography 44.0.2, Platform Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31
2025-03-27T12:27:22Z [Information] Enabled addons: []
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.addons] INFO: Enabled addons:
2025-03-27T12:27:22Z [Information] []
2025-03-27T12:27:22Z [Verbose] Using reactor: twisted.internet.epollreactor.EPollReactor
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2025-03-27T12:27:22Z [Information] Telnet Password: dd6adb8c39119294
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.telnet] INFO: Telnet Password: dd6adb8c39119294
2025-03-27T12:27:22Z [Information] Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled extensions:
2025-03-27T12:27:22Z [Information] ['scrapy.extensions.corestats.CoreStats',
2025-03-27T12:27:22Z [Information] 'scrapy.extensions.telnet.TelnetConsole',
2025-03-27T12:27:22Z [Information] 'scrapy.extensions.memusage.MemoryUsage',
2025-03-27T12:27:22Z [Information] 'scrapy.extensions.logstats.LogStats']
2025-03-27T12:27:22Z [Information] Overridden settings: {'BOT_NAME': 'mantinades_crawler', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.crawler] INFO: Overridden settings:
2025-03-27T12:27:22Z [Information] {'BOT_NAME': 'mantinades_crawler',
2025-03-27T12:27:22Z [Information] 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
2025-03-27T12:27:22Z [Information] '(KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
2025-03-27T12:27:22Z [Information] Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
2025-03-27T12:27:22Z [Information] ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-03-27T12:27:22Z [Information] Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled spider middlewares:
2025-03-27T12:27:22Z [Error] ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-03-27T12:27:22Z [Information] Enabled item pipelines: ['pipelines.RemoveDuplicatesPipeline', 'pipelines.MantinadesCrawlerPipeline']
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled item pipelines:
2025-03-27T12:27:22Z [Information] ['pipelines.RemoveDuplicatesPipeline', 'pipelines.MantinadesCrawlerPipeline']
2025-03-27T12:27:22Z [Information] Spider opened
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.core.engine] INFO: Spider opened
2025-03-27T12:27:22Z [Information] ManagedIdentityCredential will use App Service managed identity
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._credentials.managed_identity] INFO: ManagedIdentityCredential will use App Service managed identity
2025-03-27T12:27:22Z [Verbose] Obtaining token via managed identity on Azure App Service
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [msal.managed_identity] DEBUG: Obtaining token via managed identity on Azure App Service
2025-03-27T12:27:22Z [Information] Request URL: 'http://localhost:8081/msi/token?api-version=REDACTED&resource=REDACTED' Request method: 'GET' Request headers: 'X-IDENTITY-HEADER': 'REDACTED' 'Metadata': 'REDACTED' 'User-Agent': 'azsdk-python-identity/1.21.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' No body was attached to the request
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Request URL: 'http://localhost:8081/msi/token?api-version=REDACTED&resource=REDACTED'
2025-03-27T12:27:22Z [Information] Request method: 'GET'
2025-03-27T12:27:22Z [Information] Request headers:
2025-03-27T12:27:22Z [Information] 'X-IDENTITY-HEADER': 'REDACTED'
2025-03-27T12:27:22Z [Information] 'Metadata': 'REDACTED'
2025-03-27T12:27:22Z [Information] 'User-Agent': 'azsdk-python-identity/1.21.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)'
2025-03-27T12:27:22Z [Information] No body was attached to the request
2025-03-27T12:27:22Z [Verbose] Starting new HTTP connection (1): localhost:8081
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): localhost:8081
2025-03-27T12:27:22Z [Verbose] http://localhost:8081 "GET /msi/token?api-version=2019-08-01&resource=https://storage.azure.com HTTP/1.1" 200 None
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: http://localhost:8081 "GET /msi/token?api-version=2019-08-01&resource=https://storage.azure.com HTTP/1.1" 200 None
2025-03-27T12:27:22Z [Information] Response status: 200 Response headers: 'Content-Type': 'application/json; charset=utf-8' 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT' 'Server': 'Kestrel' 'Transfer-Encoding': 'chunked' 'X-CORRELATION-ID': 'REDACTED'
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Response status: 200
2025-03-27T12:27:22Z [Information] Response headers:
2025-03-27T12:27:22Z [Information] 'Content-Type': 'application/json; charset=utf-8'
2025-03-27T12:27:22Z [Information] 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT'
2025-03-27T12:27:22Z [Information] 'Server': 'Kestrel'
2025-03-27T12:27:22Z [Information] 'Transfer-Encoding': 'chunked'
2025-03-27T12:27:22Z [Information] 'X-CORRELATION-ID': 'REDACTED'
2025-03-27T12:27:22Z [Verbose] event={ "client_id": null, "data": {}, "params": {}, "response": { "access_token": "********", "expires_in": 86358, "refresh_in": 43179, "resource": "https://storage.azure.com", "token_type": "Bearer" }, "scope": [ "https://storage.azure.com" ], "token_endpoint": "https://localhost/managed_identity" }
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [msal.token_cache] DEBUG: event={
2025-03-27T12:27:22Z [Information] "client_id": null,
2025-03-27T12:27:22Z [Information] "data": {},
2025-03-27T12:27:22Z [Information] "params": {},
2025-03-27T12:27:22Z [Information] "response": {
2025-03-27T12:27:22Z [Information] "access_token": "********",
2025-03-27T12:27:22Z [Information] "expires_in": 86358,
2025-03-27T12:27:22Z [Information] "refresh_in": 43179,
2025-03-27T12:27:22Z [Information] "resource": "https://storage.azure.com",
2025-03-27T12:27:22Z [Information] "token_type": "Bearer"
2025-03-27T12:27:22Z [Information] },
2025-03-27T12:27:22Z [Information] "scope": [
2025-03-27T12:27:22Z [Information] "https://storage.azure.com"
2025-03-27T12:27:22Z [Information] ],
2025-03-27T12:27:22Z [Information] "token_endpoint": "https://localhost/managed_identity"
2025-03-27T12:27:22Z [Information] }
2025-03-27T12:27:22Z [Information] AppServiceCredential.get_token_info succeeded
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.msal_managed_identity_client] INFO: AppServiceCredential.get_token_info succeeded
2025-03-27T12:27:22Z [Information] ManagedIdentityCredential.get_token_info succeeded
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.decorators] INFO: ManagedIdentityCredential.get_token_info succeeded
2025-03-27T12:27:22Z [Verbose] [Authenticated account] Client ID: 894918ea-f282-42a1-b4ae-329965baea9c. Tenant ID: 715bbffd-4af5-413a-bb58-30d31fd704b8. User Principal Name: unavailableUpn. Object ID (user): c6e153ce-062b-404f-b533-d397205fc871
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.decorators] DEBUG: [Authenticated account] Client ID: 894918ea-f282-42a1-b4ae-329965baea9c. Tenant ID: 715bbffd-4af5-413a-bb58-30d31fd704b8. User Principal Name: unavailableUpn. Object ID (user): c6e153ce-062b-404f-b533-d397205fc871
2025-03-27T12:27:22Z [Information] Request URL: 'https://mantinadescrawleraccount.blob.core.windows.net/?comp=REDACTED&prefix=REDACTED&include=REDACTED' Request method: 'GET' Request headers: 'x-ms-version': 'REDACTED' 'Accept': 'application/xml' 'User-Agent': 'azsdk-python-storage-blob/12.25.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' 'x-ms-date': 'REDACTED' 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 'Authorization': 'REDACTED' No body was attached to the request
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Request URL: 'https://mantinadescrawleraccount.blob.core.windows.net/?comp=REDACTED&prefix=REDACTED&include=REDACTED'
2025-03-27T12:27:22Z [Information] Request method: 'GET'
2025-03-27T12:27:22Z [Information] Request headers:
2025-03-27T12:27:22Z [Information] 'x-ms-version': 'REDACTED'
2025-03-27T12:27:22Z [Information] 'Accept': 'application/xml'
2025-03-27T12:27:22Z [Information] 'User-Agent': 'azsdk-python-storage-blob/12.25.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)'
2025-03-27T12:27:22Z [Information] 'x-ms-date': 'REDACTED'
2025-03-27T12:27:22Z [Information] 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f'
2025-03-27T12:27:22Z [Information] 'Authorization': 'REDACTED'
2025-03-27T12:27:22Z [Information] No body was attached to the request
2025-03-27T12:27:22Z [Verbose] Starting new HTTPS connection (1): mantinadescrawleraccount.blob.core.windows.net:443
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): mantinadescrawleraccount.blob.core.windows.net:443
2025-03-27T12:27:22Z [Verbose] https://mantinadescrawleraccount.blob.core.windows.net:443 "GET /?comp=list&prefix=data&include= HTTP/1.1" 200 None
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: https://mantinadescrawleraccount.blob.core.windows.net:443 "GET /?comp=list&prefix=data&include= HTTP/1.1" 200 None
2025-03-27T12:27:22Z [Information] Response status: 200 Response headers: 'Transfer-Encoding': 'chunked' 'Content-Type': 'application/xml' 'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' 'x-ms-request-id': '78e32b60-401e-005c-2713-9f808b000000' 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 'x-ms-version': 'REDACTED' 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT'
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Response status: 200
2025-03-27T12:27:22Z [Information] Response headers:
2025-03-27T12:27:22Z [Information] 'Transfer-Encoding': 'chunked'
2025-03-27T12:27:22Z [Information] 'Content-Type': 'application/xml'
2025-03-27T12:27:22Z [Information] 'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0'
2025-03-27T12:27:22Z [Information] 'x-ms-request-id': '78e32b60-401e-005c-2713-9f808b000000'
2025-03-27T12:27:22Z [Information] 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f'
2025-03-27T12:27:22Z [Information] 'x-ms-version': 'REDACTED'
2025-03-27T12:27:22Z [Information] 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT'
2025-03-27T12:27:22Z [Information] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-03-27T12:27:22Z [Information] Telnet console listening on 127.0.0.1:6025
2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025
2025-03-27T12:27:22Z [Error] Executed 'Functions.crawl_mantinades' (Failed, Id=c6ef5b30-5aed-4d0c-b004-22cd608a4d15, Duration=105ms)

Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-27T20:06:43.4933333+00:00

Hello Lykos, Manos,

Thanks for sharing the details.

The Scrapy crawler's inability to crawl pages in your Azure Function is likely due to either missing dependencies, spider configuration errors, or Azure Functions environment-specific limitations. To resolve this, begin by ensuring a complete requirements.txt file, generated from a local virtual environment, is deployed with your function, guaranteeing all necessary Scrapy dependencies are included. Next, meticulously verify your spider's start_urls and allowed_domains settings, and strategically add logging statements within the spider's code to track execution.

Temporarily disable any custom pipelines to isolate potential pipeline-related errors. Test network connectivity by attempting to crawl a simple website. Ensure the Scrapy CrawlerProcess is instantiated and started only once per function invocation, and for long-running crawls, consider migrating to Azure Durable Functions or utilizing an App Service Plan for increased resources. Implement comprehensive logging using Application Insights to capture detailed error information. Finally, if timeouts are suspected, increase the functionTimeout value within your host.json file. Verify successful deployment via Azure portal logs.

Please try the above and let us know if you have any further concerns, we will guide you accordingly.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-28T21:26:43.3766667+00:00

Hello Lykos, Manos,

Just checking in to see if the above provided information was helpful and if you have any further concerns, please feel free to drop them here.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-31T19:12:47.1266667+00:00

Hello Lykos, Manos,
We wanted to follow up on the issue you encountered. Please let us know if it's resolved or if you need further assistance.
Lykos, Manos 0 Reputation points

2025-04-01T14:52:03.9033333+00:00
Hello @Loknathsatyasaivarma Mahali ,

I found the following solution. I used the subprocess library in order to run scrapy from its terminal. But this surfaced two problems;

It showed me that the dependencies are not there even though I have the correct requirements.txt but I fixed that by running the pip command using subprocess.

When the crawler runs, even though I use a custom user agent I get 403 error which is something I'm trying to fix. Do you know if this happens due to some networking restriction by azure?
Lykos, Manos 0 Reputation points

2025-04-01T17:44:36.7233333+00:00
If this cannot be overcomed I think I have 2 solutions:

Use BeautifulSoup where I will probably have the same problem.

Deploy my scrapy code into a cloud like Zyte or Scrapyd and have my function just use their API to get the results
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-02T04:38:18.2866667+00:00

Hi Lykos, Manos,

Your issue is caused by Scrapy using Twisted’s reactor, which Azure Functions does not allow to be restarted within the same process. Instead of process.start(), use process.start(stop_after_crawl=False), or consider running Scrapy as a separate process using subprocess.Popen(['scrapy', 'crawl', 'X']).
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-03T11:45:35.0966667+00:00

Hi Lykos, Manos,
Just following up to see if you had a chance to review my previous message.
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-04T11:46:16.26+00:00

Hi Lykos, Manos,
Following up to check if the issue is resolved or if you need any further assistance. Let me know how it's going!
Lykos, Manos 0 Reputation points

2025-04-05T15:37:18.9133333+00:00

The problem now is that I get 403 with a USER_AGENT that works locally probably due to something related to Networking
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-07T04:33:26.26+00:00

Hi Lykos, Manos,
If possible, could you share your GitHub repository?

Your answer

Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-18T17:44:19.37+00:00

Hello Lykos, Manos,

As you're encountering a 404 error with no logs when trying to deploy a Scrapy crawler inside an Azure Function. The function works locally but fails in Azure, especially when importing Scrapy.

First ensure all required packages (like scrapy, azure-functions) are listed in requirements.txt. Run pip freeze > requirements.txt locally to generate this file.

Also, add logging.basicConfig(level=logging.INFO) to ensure logs are captured. Also, check Log Stream and enable Application Insights for more detailed logs.

At last verify the function is deployed correctly and check if any errors are reported in Azure's log stream.

Note: Scrapy can be slow. Increase the timeout in host.json(Try this and please let us know)

If you have noticed any error during or after this process, please try to share them here so that it will help us in better investigating the issue.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-19T17:56:05.63+00:00

Hello Lykos, Manos,

Just checking in to see if the information above was helpful. If you have any further updates on this issue, please feel free to post them here.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-20T20:28:36.92+00:00

Hello Lykos, Manos,

Just checking in to see if the provided information helps you better understand and resolve your concern. If you have further questions, please feel free to drop them here, and we will assist you accordingly.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-27T20:06:43.4933333+00:00

Hello Lykos, Manos,

Thanks for sharing the details.

The Scrapy crawler's inability to crawl pages in your Azure Function is likely due to either missing dependencies, spider configuration errors, or Azure Functions environment-specific limitations. To resolve this, begin by ensuring a complete requirements.txt file, generated from a local virtual environment, is deployed with your function, guaranteeing all necessary Scrapy dependencies are included. Next, meticulously verify your spider's start_urls and allowed_domains settings, and strategically add logging statements within the spider's code to track execution.

Temporarily disable any custom pipelines to isolate potential pipeline-related errors. Test network connectivity by attempting to crawl a simple website. Ensure the Scrapy CrawlerProcess is instantiated and started only once per function invocation, and for long-running crawls, consider migrating to Azure Durable Functions or utilizing an App Service Plan for increased resources. Implement comprehensive logging using Application Insights to capture detailed error information. Finally, if timeouts are suspected, increase the functionTimeout value within your host.json file. Verify successful deployment via Azure portal logs.

Please try the above and let us know if you have any further concerns, we will guide you accordingly.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-28T21:26:43.3766667+00:00

Hello Lykos, Manos,

Just checking in to see if the above provided information was helpful and if you have any further concerns, please feel free to drop them here.
Loknathsatyasaivarma Mahali 2,740 Reputation points Microsoft External Staff Moderator

2025-03-31T19:12:47.1266667+00:00

Hello Lykos, Manos,
We wanted to follow up on the issue you encountered. Please let us know if it's resolved or if you need further assistance.
Lykos, Manos 0 Reputation points

2025-04-01T14:52:03.9033333+00:00

Hello @Loknathsatyasaivarma Mahali ,

I found the following solution. I used the subprocess library in order to run scrapy from its terminal. But this surfaced two problems;

It showed me that the dependencies are not there even though I have the correct requirements.txt but I fixed that by running the pip command using subprocess.

When the crawler runs, even though I use a custom user agent I get 403 error which is something I'm trying to fix. Do you know if this happens due to some networking restriction by azure?
Lykos, Manos 0 Reputation points

2025-04-01T17:44:36.7233333+00:00

If this cannot be overcomed I think I have 2 solutions:

Use BeautifulSoup where I will probably have the same problem.

Deploy my scrapy code into a cloud like Zyte or Scrapyd and have my function just use their API to get the results
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-02T04:38:18.2866667+00:00

Hi Lykos, Manos,

Your issue is caused by Scrapy using Twisted’s reactor, which Azure Functions does not allow to be restarted within the same process. Instead of process.start(), use process.start(stop_after_crawl=False), or consider running Scrapy as a separate process using subprocess.Popen(['scrapy', 'crawl', 'X']).
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-03T11:45:35.0966667+00:00

Hi Lykos, Manos,
Just following up to see if you had a chance to review my previous message.
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-04T11:46:16.26+00:00

Hi Lykos, Manos,
Following up to check if the issue is resolved or if you need any further assistance. Let me know how it's going!
Lykos, Manos 0 Reputation points

2025-04-05T15:37:18.9133333+00:00

The problem now is that I get 403 with a USER_AGENT that works locally probably due to something related to Networking
Dasari Kamali 425 Reputation points Microsoft External Staff Moderator

2025-04-07T04:33:26.26+00:00

Hi Lykos, Manos,
If possible, could you share your GitHub repository?

Share via

Azure Function that uses scrapy returns 404 with correct URL

Your answer