Share via

File SDK 1.18.124 hangs indefinitely

Tamir Aviv 0 Reputation points
2026-05-28T14:27:01.9333333+00:00

Issue

File SDK 1.18.124 on Linux: default HttpDelegate hangs indefinitely during PolicyEngine fetch (dataservice.protection.outlook.com/PsorWebService) — request submitted, response never read

Versions

MIP SDK version: mip_sdk_file_ubuntu2404_1.18.124 (also reproduces with mip_sdk_file_rhel9_1.18.124)

Environment: Ubuntu 24.04 container; reproduces on both native amd64 (Intel Xeon, Ubuntu 20.04 host) and amd64-under-qemu (Apple Silicon). Same code worked correctly on 1.16.126.

Description

Symptom

FileEngine::AddEngineAsync future never resolves when MipConfiguration is created with isOfflineOnly=false. Reproduces 100% on first call after MipContext::Create / FileProfile::LoadAsync.

Last log lines before the wedge


http_director_impl.cpp:150  "Sending HTTP request: ID: {…}, Type: GET,

  Url: https://dataservice.protection.outlook.com/PsorWebService/v1/ClientSyncFile/MipPolicies,

  Headers['Authorization'] = 'UOID:…;Tenant:…;Audience:https://syncservice.o365syncservice.com/;Roles:UnifiedPolicy.Tenant.Read;…'"

Nothing is ever logged after this line. No HttpResponse, no error, no timeout. Our custom AuthDelegate returned a valid token in 710 ms (also logged by MIP at auth_request_transformer.cpp:180).

Smoking gun — kernel state while wedged

While hung, in the container:


$ cat /proc/<pid>/net/tcp        # decoded IPs

local     remote                              state

…:39476   52.102.115.192:443 (dataservice…)   CLOSE_WAIT

CLOSE_WAIT means the server sent FIN and the kernel delivered it, but the SDK never closed the socket or read the response. This is application-side, not network.

Confirmation — no MIP thread is polling sockets


$ for t in /proc/<pid>/task/*; do

    echo "$(basename $t) $(cat $t/comm) $(cat $t/wchan)"

  done

…

26 OneDS Task Disp   futex_wait_queue

28 Policy Profile    futex_wait_queue

29 Policy Profile    futex_wait_queue

…   (60+ MIP threads, all in futex_wait_queue)

18 cdm_function_ap   do_epoll_wait     ← Go runtime netpoller, not MIP

Every MIP thread is sleeping on a futex. The single epoll_wait thread is the host application's Go netpoller, not MIP's. No MIP thread is waiting on the socket events that the kernel has ready. The HTTP director's internal libcurl multi-handle / event loop has wedged.

Network is fine

From inside the same container, same libcurl.so.4, same /etc/ssl/certs:


$ curl -v --max-time 30 https://dataservice.protection.outlook.com/PsorWebService/v1/ClientSyncFile/MipPolicies

< HTTP/2 401   (in 10 ms — 401 is expected, no token)

< policysync-auth-error: No valid OAuth token or Client Certificate were found.

Endpoint is reachable, cert chain validates, response is fast.

Workaround that confirms the bug is in MIP's HTTP layer

Implementing a custom mip::HttpDelegate using curl_easy_perform (synchronous, one easy handle per request — same libcurl.so.4 MIP uses internally) and registering it via MipConfiguration::SetHttpDelegate(...) resolves the hang completely:

  • AddEngineAsync future resolves
  • HTTP requests complete in expected timeframes (~100–700 ms)
  • Both success (200) and failure (400/401/500) responses surface cleanly to MIP and propagate up as expected NetworkError / NoAuthTokenError

This pinpoints the bug to MIP's default HTTP delegate (the libcurl multi-handle / event-loop wrapper), not to libcurl, OpenSSL, the network, or the Authorization header format.

Azure Information Protection
Azure Information Protection

An Azure service that is used to control and help secure email, documents, and sensitive data that are shared outside the company.


2 answers

Sort by: Most helpful
  1. Tamir Aviv 0 Reputation points
    2026-06-07T18:00:36.78+00:00

    Update — root cause found; this is not a MIP SDK regression.

    After deeper investigation, the hang was caused by a libcurl symbol collision in our own process, not by the MIP File SDK.

    Another dependency in our binary — TensorFlow's libtensorflow_framework.so — statically bundles its own copy of libcurl and exports the curl symbols (curl_easy_init, curl_easy_perform, curl_global_init, the full curl_multi_* set, etc.) unversioned into the global dynamic-symbol namespace.
    Because that library is loaded before the system libcurl.so.4, the MIP SDK's versioned references (curl_easy_init@CURL_OPENSSL_4, etc.) bound to TensorFlow's bundled curl instead of the system libcurl. Verified with LD_DEBUG=bindings:

    binding file /lib/libmip_core.so to .../libtensorflow_framework.so.2: symbol `curl_easy_init' [CURL_OPENSSL_4]

    binding file /lib/libmip_core.so to .../libtensorflow_framework.so.2: symbol `curl_easy_perform' [CURL_OPENSSL_4]

    So the SDK's default HTTP delegate was unknowingly running on a foreign curl build — which is what produced the CLOSE_WAIT socket with no MIP thread polling it.

    Confirmation that the SDK is not at fault: keeping MIP 1.18.124 with its default HttpDelegate and only forcing the curl symbols to resolve to the system libcurl (LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcurl.so.4) fully resolves the hang — AddEngineAsync completes and decryption works, with TensorFlow still loaded. The earlier 1.16-vs-1.18 correlation was coincidental with another build change introducing the bundled-curl dependency.

    One optional hardening suggestion:

    it may be worth a documentation note that the SDK requires its libcurl symbols to resolve to a compatible system libcurl, and that other libraries which statically bundle libcurl with default-visibility symbols can interpose on it and wedge the HTTP layer — it's a subtle footgun that's hard to diagnose.

    Was this answer helpful?

    0 comments No comments

  2. Jerald Felix 13,500 Reputation points Volunteer Moderator
    2026-05-29T05:31:41.1066667+00:00

    Hello Tamir Aviv

    Greetings! Thanks for raising this question in Q&A forum.

    This is an exceptionally well-investigated bug report you've done a thorough job isolating the issue right down to the kernel TCP state and thread-level analysis. Let me summarise what's happening and guide you on the best path forward.

    Based on your findings, the root cause is clear: MIP SDK 1.18.124's default HTTP delegate (the internal libcurl multi-handle / event-loop wrapper) is failing to poll the socket for responses after the HTTP request is submitted. The CLOSE_WAIT TCP state confirms the server sent its response and closed the connection, but no MIP thread is reading it all 60+ MIP threads are sleeping on a futex, and the epoll thread belongs to your Go runtime, not MIP. The network, libcurl, OpenSSL, and your auth token are all confirmed healthy. This is a regression in MIP's internal HTTP layer introduced between version 1.16.126 (working) and 1.18.124 (broken).

    Here's the recommended course of action:

    Step 1: Keep Your Custom HttpDelegate as the Immediate Workaround

    Your custom mip::HttpDelegate using curl_easy_perform is the correct and fully supported workaround for now. Since it resolves the hang completely and surfaces errors properly, continue using it in production while this SDK bug is being fixed. This is not a hack registering a custom HttpDelegate via MipConfiguration::SetHttpDelegate(...) is an officially supported pattern in MIP SDK.

    Step 2: Report This as a Bug to Microsoft via the MIP SDK GitHub Repository

    This is clearly a product bug and needs to be reported so the MIP SDK team can fix it in an upcoming release. Please open a bug report at the official Microsoft Information Protection SDK GitHub repository: https://github.com/MicrosoftDocs/Azure-RMSDocs/issues

    In your bug report, include all the excellent diagnostic details you've already gathered — the SDK version, OS, the CLOSE_WAIT TCP state evidence, the futex thread dump, the confirmation that 1.16.126 works, and that the custom delegate resolves the issue. This level of detail will help the SDK team reproduce and fix it quickly.

    Step 3: Open a Microsoft Support Ticket to Escalate the Fix

    In parallel, open a support ticket at https://aka.ms/azuresupport under Azure Information Protection / MIP SDK. Reference this Q&A thread and your GitHub issue. Escalating through support in addition to GitHub increases visibility with the product team and may result in a faster fix or a hotfix build being prioritised. Ask specifically whether a fixed build or a newer patch version is available or on the roadmap for the libcurl multi-handle event loop regression in 1.18.124.

    Step 4: Test Against 1.17.x if Available

    If your deployment constraints allow, try the SDK version 1.17.x (if available for Ubuntu 24.04) to confirm the regression was introduced in 1.18.x versus 1.17.x. This binary bisect information will be very useful for the SDK team to identify exactly which commit introduced the event-loop bug.

    Step 5: Check for a Newer Patch Version

    Check the MIP SDK NuGet/package feed and the release notes for any patch version newer than 1.18.124 that may have already addressed this: https://learn.microsoft.com/en-us/information-protection/develop/version-release-history

    If a newer build is available, test it with the default HttpDelegate to see if the regression has been fixed upstream.

    Again, your diagnosis is excellent and your workaround is solid. The custom HttpDelegate approach is the right short-term fix, and getting this reported through both GitHub and Microsoft Support is the right way to get a permanent fix in the SDK itself.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.