Hi Rishav Arora,
It seems like you’re encountering intermittent ConnectionResetError (104, 'Connection reset by peer')
errors when using the Azure Search SDK (azure-search-documents==11.5.2
) during paging result iteration in a Python app deployed on AKS. This is a known type of issue, especially when the connection between the client and Azure Search is interrupted due to factors like network instability, idle timeouts, or socket reuse patterns in containerized environments like AKS.
- Use
RetryPolicy
fromazure-core
to automatically retry transient errors like connection resets:
from azure.core.pipeline.policies import RetryPolicy
from azure.core.pipeline.transport import RequestsTransport
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
retry_policy = RetryPolicy(
retry_total=5,
retry_connect=2,
retry_read=2,
retry_status=2,
retry_backoff_factor=0.8,
retry_backoff_max=30
)
transport = RequestsTransport(retry_policy=retry_policy)
search_client = SearchClient(
endpoint=SEARCH_ENDPOINT,
index_name=INDEX_NAME,
credential=AzureKeyCredential(API_KEY),
transport=transport
)
- Socket timeouts can help prevent hanging connections. Update
RequestsTransport
:
transport = RequestsTransport(connection_timeout=10, read_timeout=30)
- Avoid Long-Lived Singleton Clients in AKS. In container environments like AKS, TCP connections may become stale when pods restart. You can:
- Re-create the
SearchClient
per request, or - Use a TTL-based cache with periodic re-instantiation (e.g., every 5–10 minutes)
This avoids reuse of broken connections.
- Handle Errors During Iteration - Wrap your paging loop to catch and retry on failure:
from azure.core.exceptions import ServiceResponseError
import time
for attempt in range(3):
try:
for result in results:
process(result)
break
except ServiceResponseError as e:
log.error(f"Search iteration failed: {e}")
time.sleep(2 ** attempt)
This captures failures during streaming and retries cleanly.
- If numerous outbound connections happen (especially on load balancer), you might hit SNAT port limits. https://github.com/jometzg/diagnosing-aks-port-exhaustion?
- Use Azure NAT Gateway, which supports >1 million SNAT ports and is resilient https://learn.microsoft.com/en-us/azure/aks/nat-gateway
- Adjust Standard Load Balancer outbound rules: set 4–5-minute TCP idle timeout, tune ports per instance
Simplest: deploy NAT Gateway on your AKS subnet.
https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/connectivity/snat-port-exhaustion?tabs=for-a-linux-pod
Hope this helps, if you have any further concerns or queries, please feel free to reach out to us.