Scraping internal website pages for content source in AOAI

Jai-6363 60 Reputation points

I am implementing the Enterprise ChatGPT accelerator and am looking to add functionality to scrape internal website pages for content to be used as a source. With many web scraping options available, what would be the best or ideal option that fits well with AOAI?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
1,691 questions
0 comments No comments
{count} votes

Accepted answer
  1. Ramr-msft 14,521 Reputation points

    Jai-6363 Thanks for the question, If you are using the cognitive search then you try using Cognitive Search Indexer to crawl that website and add that to the index.

    Here is the document for Indexer to crawl.

1 additional answer

Sort by: Most helpful
  1. Sina Salam 1,356 Reputation points

    HI @Jai-6363

    Welcome to Microsoft Q&A and thank you for posting your questions here.

    For more clarity, you would like to know the best or ideal option for web scrapping that fits well with Azure OpenAI.

    Sure, I can provide you with a compilation of web scraping frameworks and tools that you might find useful, for your web scraping activities while working with Azure and possibly OpenAI integration.

    1. Beautiful Soup,
    2. Scrapy,
    3. Selenium,
    4. Requests,
    5. Puppeteer,
    6. MechanicalSoup,
    7. Apache Nutch,
    8. Octoparse,
    10. ParseHub,
    11. Pyppeteer,
    12. Apify.

    Keep in mind that when deciding on the tool to use consider factors such, as your level of expertise, in the programming language the intricacy of the website the extent of scraping required and any integration you wish to establish with Azure services or OpenAI. It is crucial to adhere to the websites terms of use and legal obligations when engaging in scraping activities.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please remember to "Accept Answer" if answer helped, so that others in the community facing similar issues can easily find the solution.

    Best Regards,

    Sina Salam

    0 comments No comments