Azure Function that uses scrapy returns 404 with correct URL
Hello,
I have the following Function that uses Scrapy in order to crawl data from a specific site
import azure.functions as func
import os
import datetime
import json
import logging
import subprocess
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'X_crawler')))
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from X_crawler.spiders.X import X
app = func.FunctionApp()
@app.route(route="crawl_X", auth_level=func.AuthLevel.ANONYMOUS)
def crawl_X(req: func.HttpRequest) -> func.HttpResponse:
logging.info("Scrapy Azure Function triggered.")
# Set the project file path
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'X_crawler.settings')
# Load settings from scrapy.cfg
print(get_project_settings())
process = CrawlerProcess(get_project_settings())
process.crawl(X)
process.start()
# Blocking call
return func.HttpResponse('Crawling completed successfully9', status_code=200)
The problem is that with this code when I invoke the URL I get a 404 and I also do not get any information from Log Stream like the request never happened. Also when I run the function locally using "func start" it runs as expected.
When I comment the last 3 imports + the necessary code to keep only the logging line and the return statement the function runs successfully. Therefore, I think that this has to do with those imports. However, why that happens, why I'm not getting any information and how can I fix them but keep crawler's code into a separate module if possible?
PS. Also not that I get the same error just by importing scrapy.crawler and scrapy.utils
Azure Functions
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-18T17:44:19.37+00:00 Hello Lykos, Manos,
As you're encountering a 404 error with no logs when trying to deploy a Scrapy crawler inside an Azure Function. The function works locally but fails in Azure, especially when importing Scrapy.
- First ensure all required packages (like scrapy, azure-functions) are listed in requirements.txt. Run pip freeze > requirements.txt locally to generate this file.
- Also, add logging.basicConfig(level=logging.INFO) to ensure logs are captured. Also, check Log Stream and enable Application Insights for more detailed logs.
- At last verify the function is deployed correctly and check if any errors are reported in Azure's log stream.
Note: Scrapy can be slow. Increase the timeout in host.json(Try this and please let us know)
If you have noticed any error during or after this process, please try to share them here so that it will help us in better investigating the issue.
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-19T17:56:05.63+00:00 Hello Lykos, Manos,
Just checking in to see if the information above was helpful. If you have any further updates on this issue, please feel free to post them here.
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-20T20:28:36.92+00:00 Hello Lykos, Manos,
Just checking in to see if the provided information helps you better understand and resolve your concern. If you have further questions, please feel free to drop them here, and we will assist you accordingly.
-
Lykos, Manos • 0 Reputation points
2025-03-27T12:28:16.63+00:00 Initially, I'm sorry for the delayed answer. Checking my deployments I saw that they used something like "cached" deployment(I dont have the message currently). Therefore, I created a new function and deployed also a refactored code. This action however led to another error. Checking Application Insights I have this exception
Result: Failure Exception: ReactorNotRestartable: Stack: File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 671, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/dispatcher.py", line 1001, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.10/LINUX/X64/azure_functions_worker/extension.py", line 211, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/function_app.py", line 34, in crawl_mantinades process.start() # Blocking call File "/home/site/wwwroot/.python_packages/lib/site-packages/scrapy/crawler.py", line 496, in start reactor.run(installSignalHandlers=install_signal_handlers) # blocking call File "/home/site/wwwroot/.python_packages/lib/site-packages/twisted/internet/base.py", line 695, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "/home/site/wwwroot/.python_packages/lib/site-packages/twisted/internet/base.py", line 926, in startRunning raise error.ReactorNotRestartable()
Checking the log stream I get this which does not show any error
2025-03-27T12:27:22Z [Verbose] Request successfully matched the route with name 'crawl_mantinades' and template 'api/crawl_mantinades' 2025-03-27T12:27:22Z [Information] Executing 'Functions.crawl_mantinades' (Reason='This function was programmatically called via the host APIs.', Id=c6ef5b30-5aed-4d0c-b004-22cd608a4d15) 2025-03-27T12:27:22Z [Verbose] Sending invocation id: 'c6ef5b30-5aed-4d0c-b004-22cd608a4d15 2025-03-27T12:27:22Z [Verbose] Posting invocation id:c6ef5b30-5aed-4d0c-b004-22cd608a4d15 on workerId:e24ca5f2-cb06-4433-a57e-a0af6900e508 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure_functions_worker] INFO: Received FunctionInvocationRequest, request ID: 99155e54-2203-4ff0-8ccd-532a08d67a79, function ID: 01024e69-5e2a-508b-af21-1aadd6b3cf1d, function name: crawl_mantinades, invocation ID: c6ef5b30-5aed-4d0c-b004-22cd608a4d15, function type: sync, timestamp (UTC): 2025-03-27 12:27:22.116113, sync threadpool max workers: 5 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [root] INFO: Python HTTP trigger function processed a request. 2025-03-27T12:27:22Z [Information] Python HTTP trigger function processed a request. 2025-03-27T12:27:22Z [Information] Scrapy 2.12.0 started (bot: mantinades_crawler) 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: mantinades_crawler) 2025-03-27T12:27:22Z [Information] Versions: lxml 5.3.1.0, libxml2 2.12.9, cssselect 1.3.0, parsel 1.10.0, w3lib 2.3.1, Twisted 24.11.0, Python 3.10.16 (main, Jan 14 2025, 02:23:41) [GCC 10.2.1 20210110], pyOpenSSL 25.0.0 (OpenSSL 3.4.1 11 Feb 2025), cryptography 44.0.2, Platform Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] INFO: Versions: lxml 5.3.1.0, libxml2 2.12.9, cssselect 1.3.0, parsel 1.10.0, w3lib 2.3.1, Twisted 24.11.0, Python 3.10.16 (main, Jan 14 2025, 02:23:41) [GCC 10.2.1 20210110], pyOpenSSL 25.0.0 (OpenSSL 3.4.1 11 Feb 2025), cryptography 44.0.2, Platform Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31 2025-03-27T12:27:22Z [Information] Enabled addons: [] 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.addons] INFO: Enabled addons: 2025-03-27T12:27:22Z [Information] [] 2025-03-27T12:27:22Z [Verbose] Using reactor: twisted.internet.epollreactor.EPollReactor 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2025-03-27T12:27:22Z [Information] Telnet Password: dd6adb8c39119294 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.telnet] INFO: Telnet Password: dd6adb8c39119294 2025-03-27T12:27:22Z [Information] Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled extensions: 2025-03-27T12:27:22Z [Information] ['scrapy.extensions.corestats.CoreStats', 2025-03-27T12:27:22Z [Information] 'scrapy.extensions.telnet.TelnetConsole', 2025-03-27T12:27:22Z [Information] 'scrapy.extensions.memusage.MemoryUsage', 2025-03-27T12:27:22Z [Information] 'scrapy.extensions.logstats.LogStats'] 2025-03-27T12:27:22Z [Information] Overridden settings: {'BOT_NAME': 'mantinades_crawler', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'} 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.crawler] INFO: Overridden settings: 2025-03-27T12:27:22Z [Information] {'BOT_NAME': 'mantinades_crawler', 2025-03-27T12:27:22Z [Information] 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' 2025-03-27T12:27:22Z [Information] '(KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'} 2025-03-27T12:27:22Z [Information] Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled downloader middlewares: 2025-03-27T12:27:22Z [Information] ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2025-03-27T12:27:22Z [Information] Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled spider middlewares: 2025-03-27T12:27:22Z [Error] ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.referer.RefererMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 2025-03-27T12:27:22Z [Information] 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-03-27T12:27:22Z [Information] Enabled item pipelines: ['pipelines.RemoveDuplicatesPipeline', 'pipelines.MantinadesCrawlerPipeline'] 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.middleware] INFO: Enabled item pipelines: 2025-03-27T12:27:22Z [Information] ['pipelines.RemoveDuplicatesPipeline', 'pipelines.MantinadesCrawlerPipeline'] 2025-03-27T12:27:22Z [Information] Spider opened 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.core.engine] INFO: Spider opened 2025-03-27T12:27:22Z [Information] ManagedIdentityCredential will use App Service managed identity 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._credentials.managed_identity] INFO: ManagedIdentityCredential will use App Service managed identity 2025-03-27T12:27:22Z [Verbose] Obtaining token via managed identity on Azure App Service 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [msal.managed_identity] DEBUG: Obtaining token via managed identity on Azure App Service 2025-03-27T12:27:22Z [Information] Request URL: 'http://localhost:8081/msi/token?api-version=REDACTED&resource=REDACTED' Request method: 'GET' Request headers: 'X-IDENTITY-HEADER': 'REDACTED' 'Metadata': 'REDACTED' 'User-Agent': 'azsdk-python-identity/1.21.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' No body was attached to the request 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Request URL: 'http://localhost:8081/msi/token?api-version=REDACTED&resource=REDACTED' 2025-03-27T12:27:22Z [Information] Request method: 'GET' 2025-03-27T12:27:22Z [Information] Request headers: 2025-03-27T12:27:22Z [Information] 'X-IDENTITY-HEADER': 'REDACTED' 2025-03-27T12:27:22Z [Information] 'Metadata': 'REDACTED' 2025-03-27T12:27:22Z [Information] 'User-Agent': 'azsdk-python-identity/1.21.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' 2025-03-27T12:27:22Z [Information] No body was attached to the request 2025-03-27T12:27:22Z [Verbose] Starting new HTTP connection (1): localhost:8081 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): localhost:8081 2025-03-27T12:27:22Z [Verbose] http://localhost:8081 "GET /msi/token?api-version=2019-08-01&resource=https://storage.azure.com HTTP/1.1" 200 None 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: http://localhost:8081 "GET /msi/token?api-version=2019-08-01&resource=https://storage.azure.com HTTP/1.1" 200 None 2025-03-27T12:27:22Z [Information] Response status: 200 Response headers: 'Content-Type': 'application/json; charset=utf-8' 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT' 'Server': 'Kestrel' 'Transfer-Encoding': 'chunked' 'X-CORRELATION-ID': 'REDACTED' 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Response status: 200 2025-03-27T12:27:22Z [Information] Response headers: 2025-03-27T12:27:22Z [Information] 'Content-Type': 'application/json; charset=utf-8' 2025-03-27T12:27:22Z [Information] 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT' 2025-03-27T12:27:22Z [Information] 'Server': 'Kestrel' 2025-03-27T12:27:22Z [Information] 'Transfer-Encoding': 'chunked' 2025-03-27T12:27:22Z [Information] 'X-CORRELATION-ID': 'REDACTED' 2025-03-27T12:27:22Z [Verbose] event={ "client_id": null, "data": {}, "params": {}, "response": { "access_token": "********", "expires_in": 86358, "refresh_in": 43179, "resource": "https://storage.azure.com", "token_type": "Bearer" }, "scope": [ "https://storage.azure.com" ], "token_endpoint": "https://localhost/managed_identity" } 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [msal.token_cache] DEBUG: event={ 2025-03-27T12:27:22Z [Information] "client_id": null, 2025-03-27T12:27:22Z [Information] "data": {}, 2025-03-27T12:27:22Z [Information] "params": {}, 2025-03-27T12:27:22Z [Information] "response": { 2025-03-27T12:27:22Z [Information] "access_token": "********", 2025-03-27T12:27:22Z [Information] "expires_in": 86358, 2025-03-27T12:27:22Z [Information] "refresh_in": 43179, 2025-03-27T12:27:22Z [Information] "resource": "https://storage.azure.com", 2025-03-27T12:27:22Z [Information] "token_type": "Bearer" 2025-03-27T12:27:22Z [Information] }, 2025-03-27T12:27:22Z [Information] "scope": [ 2025-03-27T12:27:22Z [Information] "https://storage.azure.com" 2025-03-27T12:27:22Z [Information] ], 2025-03-27T12:27:22Z [Information] "token_endpoint": "https://localhost/managed_identity" 2025-03-27T12:27:22Z [Information] } 2025-03-27T12:27:22Z [Information] AppServiceCredential.get_token_info succeeded 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.msal_managed_identity_client] INFO: AppServiceCredential.get_token_info succeeded 2025-03-27T12:27:22Z [Information] ManagedIdentityCredential.get_token_info succeeded 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.decorators] INFO: ManagedIdentityCredential.get_token_info succeeded 2025-03-27T12:27:22Z [Verbose] [Authenticated account] Client ID: 894918ea-f282-42a1-b4ae-329965baea9c. Tenant ID: 715bbffd-4af5-413a-bb58-30d31fd704b8. User Principal Name: unavailableUpn. Object ID (user): c6e153ce-062b-404f-b533-d397205fc871 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.identity._internal.decorators] DEBUG: [Authenticated account] Client ID: 894918ea-f282-42a1-b4ae-329965baea9c. Tenant ID: 715bbffd-4af5-413a-bb58-30d31fd704b8. User Principal Name: unavailableUpn. Object ID (user): c6e153ce-062b-404f-b533-d397205fc871 2025-03-27T12:27:22Z [Information] Request URL: 'https://mantinadescrawleraccount.blob.core.windows.net/?comp=REDACTED&prefix=REDACTED&include=REDACTED' Request method: 'GET' Request headers: 'x-ms-version': 'REDACTED' 'Accept': 'application/xml' 'User-Agent': 'azsdk-python-storage-blob/12.25.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' 'x-ms-date': 'REDACTED' 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 'Authorization': 'REDACTED' No body was attached to the request 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Request URL: 'https://mantinadescrawleraccount.blob.core.windows.net/?comp=REDACTED&prefix=REDACTED&include=REDACTED' 2025-03-27T12:27:22Z [Information] Request method: 'GET' 2025-03-27T12:27:22Z [Information] Request headers: 2025-03-27T12:27:22Z [Information] 'x-ms-version': 'REDACTED' 2025-03-27T12:27:22Z [Information] 'Accept': 'application/xml' 2025-03-27T12:27:22Z [Information] 'User-Agent': 'azsdk-python-storage-blob/12.25.0 Python/3.10.16 (Linux-5.10.102.2-microsoft-standard-x86_64-with-glibc2.31)' 2025-03-27T12:27:22Z [Information] 'x-ms-date': 'REDACTED' 2025-03-27T12:27:22Z [Information] 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 2025-03-27T12:27:22Z [Information] 'Authorization': 'REDACTED' 2025-03-27T12:27:22Z [Information] No body was attached to the request 2025-03-27T12:27:22Z [Verbose] Starting new HTTPS connection (1): mantinadescrawleraccount.blob.core.windows.net:443 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): mantinadescrawleraccount.blob.core.windows.net:443 2025-03-27T12:27:22Z [Verbose] https://mantinadescrawleraccount.blob.core.windows.net:443 "GET /?comp=list&prefix=data&include= HTTP/1.1" 200 None 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [urllib3.connectionpool] DEBUG: https://mantinadescrawleraccount.blob.core.windows.net:443 "GET /?comp=list&prefix=data&include= HTTP/1.1" 200 None 2025-03-27T12:27:22Z [Information] Response status: 200 Response headers: 'Transfer-Encoding': 'chunked' 'Content-Type': 'application/xml' 'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' 'x-ms-request-id': '78e32b60-401e-005c-2713-9f808b000000' 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 'x-ms-version': 'REDACTED' 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT' 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [azure.core.pipeline.policies.http_logging_policy] INFO: Response status: 200 2025-03-27T12:27:22Z [Information] Response headers: 2025-03-27T12:27:22Z [Information] 'Transfer-Encoding': 'chunked' 2025-03-27T12:27:22Z [Information] 'Content-Type': 'application/xml' 2025-03-27T12:27:22Z [Information] 'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' 2025-03-27T12:27:22Z [Information] 'x-ms-request-id': '78e32b60-401e-005c-2713-9f808b000000' 2025-03-27T12:27:22Z [Information] 'x-ms-client-request-id': 'd51c6d52-0b06-11f0-90d9-00155d85f35f' 2025-03-27T12:27:22Z [Information] 'x-ms-version': 'REDACTED' 2025-03-27T12:27:22Z [Information] 'Date': 'Thu, 27 Mar 2025 12:27:21 GMT' 2025-03-27T12:27:22Z [Information] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-03-27T12:27:22Z [Information] Telnet console listening on 127.0.0.1:6025 2025-03-27T12:27:22Z [Information] 2025-03-27 12:27:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025 2025-03-27T12:27:22Z [Error] Executed 'Functions.crawl_mantinades' (Failed, Id=c6ef5b30-5aed-4d0c-b004-22cd608a4d15, Duration=105ms)
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-27T20:06:43.4933333+00:00 Hello Lykos, Manos,
Thanks for sharing the details.
The Scrapy crawler's inability to crawl pages in your Azure Function is likely due to either missing dependencies, spider configuration errors, or Azure Functions environment-specific limitations. To resolve this, begin by ensuring a complete
requirements.txt
file, generated from a local virtual environment, is deployed with your function, guaranteeing all necessary Scrapy dependencies are included. Next, meticulously verify your spider'sstart_urls
andallowed_domains
settings, and strategically add logging statements within the spider's code to track execution.Temporarily disable any custom pipelines to isolate potential pipeline-related errors. Test network connectivity by attempting to crawl a simple website. Ensure the Scrapy
CrawlerProcess
is instantiated and started only once per function invocation, and for long-running crawls, consider migrating to Azure Durable Functions or utilizing an App Service Plan for increased resources. Implement comprehensive logging using Application Insights to capture detailed error information. Finally, if timeouts are suspected, increase thefunctionTimeout
value within yourhost.json
file. Verify successful deployment via Azure portal logs.Please try the above and let us know if you have any further concerns, we will guide you accordingly.
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-28T21:26:43.3766667+00:00 Hello Lykos, Manos,
Just checking in to see if the above provided information was helpful and if you have any further concerns, please feel free to drop them here.
-
Loknathsatyasaivarma Mahali • 2,740 Reputation points • Microsoft External Staff • Moderator
2025-03-31T19:12:47.1266667+00:00 Hello Lykos, Manos,
We wanted to follow up on the issue you encountered. Please let us know if it's resolved or if you need further assistance. -
Lykos, Manos • 0 Reputation points
2025-04-01T14:52:03.9033333+00:00 Hello @Loknathsatyasaivarma Mahali ,
I found the following solution. I used the subprocess library in order to run scrapy from its terminal. But this surfaced two problems;
- It showed me that the dependencies are not there even though I have the correct requirements.txt but I fixed that by running the pip command using subprocess.
- When the crawler runs, even though I use a custom user agent I get 403 error which is something I'm trying to fix. Do you know if this happens due to some networking restriction by azure?
-
Lykos, Manos • 0 Reputation points
2025-04-01T17:44:36.7233333+00:00 If this cannot be overcomed I think I have 2 solutions:
- Use BeautifulSoup where I will probably have the same problem.
- Deploy my scrapy code into a cloud like Zyte or Scrapyd and have my function just use their API to get the results
-
Dasari Kamali • 425 Reputation points • Microsoft External Staff • Moderator
2025-04-02T04:38:18.2866667+00:00 Hi Lykos, Manos,
Your issue is caused by Scrapy using Twisted’s reactor, which Azure Functions does not allow to be restarted within the same process. Instead of
process.start()
, useprocess.start(stop_after_crawl=False)
, or consider running Scrapy as a separate process usingsubprocess.Popen(['scrapy', 'crawl', 'X'])
. -
Dasari Kamali • 425 Reputation points • Microsoft External Staff • Moderator
2025-04-03T11:45:35.0966667+00:00 Hi Lykos, Manos,
Just following up to see if you had a chance to review my previous message. -
Dasari Kamali • 425 Reputation points • Microsoft External Staff • Moderator
2025-04-04T11:46:16.26+00:00 Hi Lykos, Manos,
Following up to check if the issue is resolved or if you need any further assistance. Let me know how it's going! -
Lykos, Manos • 0 Reputation points
2025-04-05T15:37:18.9133333+00:00 The problem now is that I get 403 with a USER_AGENT that works locally probably due to something related to Networking
-
Dasari Kamali • 425 Reputation points • Microsoft External Staff • Moderator
2025-04-07T04:33:26.26+00:00 Hi Lykos, Manos,
If possible, could you share your GitHub repository?
Sign in to comment