Web scraper azure setup

Question

Web scraper azure setup

Monteiro Del Prete 20

I developed a web scraper that operates on several days. I've tried container app as execution environment and it ran for 10 days with a very strange pattern.

immagine

As you can see from above, suddenly the CPU percentage usage started going up and down. During this time the scraper responses were sporadic, as the execution was hiccup-paused. Then, without any other message of conclusion I got this CPU timeline

immagine

I was expecting a database insertion at the end of the scraping process, but the only received message is that it is waiting for new requests (the scraper is API base and starts with a specific API call).

Container configuration:

4 CPU cores
8Gi memory size

immagine

Loknathsatyasaivarma Mahali 2,665 Reputation points Microsoft External Staff Moderator

2025-03-22T00:38:50.3066667+00:00

Hello Monteiro Del Prete,
Your web scraper, running in a containerized environment, is experiencing fluctuating CPU usage and sporadic responses, likely due to resource constraints, inefficient code, or issues with external services. Potential causes include hitting CPU or memory limits within the container, garbage collection pauses (if using languages like Python or Java), network latency, database connection problems, or concurrency issues like deadlocks. To address this, monitor CPU, memory, and network usage with tools like Prometheus and Grafana.

Review your scraper’s code for inefficiencies, implement error handling and backoff strategies for API rate limits, and optimize database inserts. Ensure proper container configuration, including resource allocation and health checks, to prevent unexpected restarts. By systematically addressing these factors, you can identify and resolve performance issues, leading to more stable and efficient scraper behavior.
Loknathsatyasaivarma Mahali 2,665 Reputation points Microsoft External Staff Moderator

2025-03-24T18:34:53.6333333+00:00

Hello Monteiro Del Prete,

Just checking in to see if the information above was helpful. If you have any further concerns on this issue, please feel free to drop them here.
Monteiro Del Prete 20 Reputation points

2025-03-25T07:50:35.33+00:00

@Loknathsatyasaivarma Mahali Thank you for the information. Actually I'm running the scraper as a container app job. Meanwhile I checked for code inefficiencies and database insertion. I'd like to give you updates in the next days.
Monteiro Del Prete 20 Reputation points

2025-03-26T07:46:07.0666667+00:00

@Loknathsatyasaivarma Mahali It is experiencing again a fluctuating CPU after about two days of execution.

The best I can do is to use the tool you named. Below there is the network in bytes.

Furthermore, shouldn't the CPU be saturated in case of code inefficiencies? The graphic says the opposite. I honestly think the problem could be the garbage collector pauses.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-31T07:48:51.28+00:00

Hope the above suggestion was helpful.
Monteiro Del Prete 20 Reputation points

2025-03-31T08:01:03.61+00:00

It was really helpful, thank you Arko. At the moment the script is running since March 27 and it seems it is not experiencing CPU fluctuation. What I did was restarting the Selenium driver once in a while. Now I'm trying to optimize it looking and GC profiling/logs.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-04-01T09:50:02.8133333+00:00

Good to know Monteiro Del Prete. Since my suggestion was helpful. I am converting it to answer. Please accept the solution. Thank you.

Accepted answer

0 additional answers

Your answer

Loknathsatyasaivarma Mahali 2,665 Reputation points Microsoft External Staff Moderator

2025-03-22T00:38:50.3066667+00:00

Hello Monteiro Del Prete,
Your web scraper, running in a containerized environment, is experiencing fluctuating CPU usage and sporadic responses, likely due to resource constraints, inefficient code, or issues with external services. Potential causes include hitting CPU or memory limits within the container, garbage collection pauses (if using languages like Python or Java), network latency, database connection problems, or concurrency issues like deadlocks. To address this, monitor CPU, memory, and network usage with tools like Prometheus and Grafana.

Review your scraper’s code for inefficiencies, implement error handling and backoff strategies for API rate limits, and optimize database inserts. Ensure proper container configuration, including resource allocation and health checks, to prevent unexpected restarts. By systematically addressing these factors, you can identify and resolve performance issues, leading to more stable and efficient scraper behavior.
Loknathsatyasaivarma Mahali 2,665 Reputation points Microsoft External Staff Moderator

2025-03-24T18:34:53.6333333+00:00

Hello Monteiro Del Prete,

Just checking in to see if the information above was helpful. If you have any further concerns on this issue, please feel free to drop them here.
Monteiro Del Prete 20 Reputation points

2025-03-25T07:50:35.33+00:00

@Loknathsatyasaivarma Mahali Thank you for the information. Actually I'm running the scraper as a container app job. Meanwhile I checked for code inefficiencies and database insertion. I'd like to give you updates in the next days.
Monteiro Del Prete 20 Reputation points

2025-03-26T07:46:07.0666667+00:00

@Loknathsatyasaivarma Mahali It is experiencing again a fluctuating CPU after about two days of execution.

The best I can do is to use the tool you named. Below there is the network in bytes.

Furthermore, shouldn't the CPU be saturated in case of code inefficiencies? The graphic says the opposite. I honestly think the problem could be the garbage collector pauses.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-03-31T07:48:51.28+00:00

Hope the above suggestion was helpful.
Monteiro Del Prete 20 Reputation points

2025-03-31T08:01:03.61+00:00

It was really helpful, thank you Arko. At the moment the script is running since March 27 and it seems it is not experiencing CPU fluctuation. What I did was restarting the Selenium driver once in a while. Now I'm trying to optimize it looking and GC profiling/logs.
Arko 4,150 Reputation points Microsoft External Staff Moderator

2025-04-01T09:50:02.8133333+00:00

Good to know Monteiro Del Prete. Since my suggestion was helpful. I am converting it to answer. Please accept the solution. Thank you.

Answer 1

It seems like you are dealing with GC-related pauses or memory fragmentation that increase over time (a "slow leak" scenario). This is especially common in long-running Python, Java, or Node.js apps doing heavy in-memory operations. If memory usage spikes or objects are held in memory unnecessarily (e.g., large lists/dicts), GC eventually struggles to clean up.

Would recommend you enable GC Profiling / Logs:

If using Python: use gc.set_debug(gc.DEBUG_STATS) and log to stdout.
If using .NET: enable GC ETW events or use Diagnostic Tools.
This can confirm if GC activity aligns with the drop in CPU/network.

one more workaround is since this happens every 2 days, you can schedule a job restart every 48 hours as a temporary workaround via an automation rule or CRON-triggered stop/start. Add liveness probes if not already configured — so the system can restart the job if it becomes unresponsive.

Share via

Web scraper azure setup

0 additional answers

Your answer