Web crawling an Azure web application - service / crawl account - azure active directory single sign-on?

nicholas dipiazza 26 Reputation points

We have a subset of azure web applications on a private azure cloud. These web apps are just a bunch of dynamic web pages.

So we want to run a web crawler and crawl that content.

-- Clarification -- We not crawling SharePoint. We are crawling azure web application sites.

But how do we authenticate? When we go to the page, it prompts for microsoft single sign-on. Username/password method such as NTLM, or Form based auth (using http or selenium) is not available. We only by default allow single-sign-on through azure active directory cloud login.

We know application registrations, service account, maybe oauth might be involved in this... but we have a hard time finding the specifics of what exactly to do here.

What is the method of obtaining Federated Auth/Spoidcrl cookies for crawling azure web sites?

Should we use an SDK? Or is it something we can set up in curl, postman, etc?

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
35,748 questions
{count} votes

3 answers

Sort by: Most helpful
  1. 2022-09-05T06:38:09.033+00:00

    Hello @nicholas dipiazza and thanks for reaching out. In order to crawl Azure AD protected web apps the crawler should not worry about the specific protocol used (OIDC, OAuth, SAML, etc.) since web apps usually abstract them, but will need to interact with the Azure AD login UI, pass credentials and also react to additional prompts such as MFA. This requires, between others, core browser capabilities such as cookie management, client side storage and JavaScript content rendering (the latter two for JavaScript enabled web apps).

    Let us know if you need additional assistance. If the answer was helpful, please accept it and complete the quality survey so that others can find a solution.

  2. nicholas dipiazza 26 Reputation points

    @Alfredo Revilla - Senior Freelance SWE, SWA, IAM hi sorry it took me so long to get the information together.

    Had to send this as an answer as it is the only way i could get the forum to allow me to post the reply.

    We are able to obtain the bearer token:

    curl --location --request POST 'https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/oauth2/v2.0/token' \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --header 'Cookie: fpc=xxxxxxxxxxxxxxxxxxxx; stsservicecookie=estsfd; x-ms-gateway-slice=estsfd' \
    --data-urlencode 'client_secret=xxxxxxxxxxxxxxxxxxxxxxxxxx' \
    --data-urlencode 'scope=api://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/.default' \
    --data-urlencode 'client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' \
    --data-urlencode 'grant_type=client_credentials'

    Then when attempting to access the site using our bearer token we get the 401.71 2147500037 error code.

    The application log shows:

     HTTP Error 401.71 - Unauthorized  
     You do not have permission to view this directory or page.  
     Most likely causes:  
     The authenticated user does not have access to a resource needed to process the request.  
     Things you can try:  
     Create a tracing rule to track failed requests for this HTTP status code. For more information about creating a tracing rule for failed requests, click here.  
     Detailed Error Information:  
     Module       EasyAuthModule_32bit  
     Notification       AuthenticateRequest  
     Handler       ExtensionlessUrlHandler-Integrated-4.0  
     Error Code       0x80004005  
     Requested URL       http://xxxxxxxxxx:80/  
     Physical Path       D:\home\site\wwwroot  
     Logon Method       Not yet determined  
     Logon User       Not yet determined  
     More Information:  
     This is the generic Access Denied error returned by IIS. Typically, there is a substatus code associated with this error that describes why the server denied the request. Check the IIS Log file to determine whether a substatus code is associated with this failure.  
     View more information »  

    Then the IIS log shows this:

     2022-09-21 16:21:19 XXXXXXXXXXXXXXXXX GET / X-ARR-LOG-ID=2017f9cd-64d3-4305-924d-029d37c53390 80 - ::1 AlwaysOn ARRAffinity=270fc76c7a748acb7bb3a328ed3b3e85783de79ee41831feff7c3c2118b4802a - XXXXXXXXXXXXXXXXX.azurewebsites.net 401 71 2147500037 705 693 13  

    So it looks like the bearer token is letting me in. But then I'm getting some sort of failure due to lack of permissions.

    When I set this same thing up on my test site, it works and I can access the page.

    The Azure web application folder permissions are probably the culprit here but I don't really know what to look at in terms of how to grant this access.

    So my enterprise azure web app team needs to update something but we don't know what.

    Should we open a support ticket to get assistance with that?

  3. nicholas dipiazza 26 Reputation points

    note i had to write my comment as an answer as the "Comment" functionality appears to be broken. I cannot reply to your comment.

    0 comments No comments