It seems like you are dealing with GC-related pauses or memory fragmentation that increase over time (a "slow leak" scenario). This is especially common in long-running Python, Java, or Node.js apps doing heavy in-memory operations. If memory usage spikes or objects are held in memory unnecessarily (e.g., large lists/dicts), GC eventually struggles to clean up.
Would recommend you enable GC Profiling / Logs:
- If using Python: use
gc.set_debug(gc.DEBUG_STATS)
and log to stdout. - If using .NET: enable GC ETW events or use Diagnostic Tools.
- This can confirm if GC activity aligns with the drop in CPU/network.
one more workaround is since this happens every 2 days, you can schedule a job restart every 48 hours as a temporary workaround via an automation rule or CRON-triggered stop/start. Add liveness probes if not already configured — so the system can restart the job if it becomes unresponsive.