Scraping website content into .md format for OpenAI training

Sariga Rahul 146 Reputation points
2023-07-17T09:50:15.92+00:00

What are some effective ways to scrape website data for training GPT models on Azure's OpenAI Bring Your Own Data API? Are there any recommended tools available? The desired output format is .md.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,101 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 53,971 Reputation points Moderator
    2023-07-17T20:49:23.4833333+00:00

    Hello @Sariga Rahul

    Thanks for reaching out to us, there are a lot of online tools which can help you convert website content like HTML or XML to .md, personally I just use those online tools which is very convenient.

    Besides that, there are several popular ways to scrape website data for training GPT models on Azure's OpenAI Bring Your Own Data API:

    Beautiful Soup: Beautiful Soup is a Python library that can be used to extract data from HTML and XML files. It provides a simple API for navigating and searching the parse tree, and can be used to extract text, links, and other data from web pages. You can use Beautiful Soup to scrape website data and save it in Markdown format.

    Scrapy: Scrapy is a Python framework for web scraping that provides a powerful set of tools for extracting data from websites. It allows you to define the structure of the data you want to extract using XPath or CSS selectors, and provides a pipeline for processing and storing the data. You can use Scrapy to scrape website data and save it in Markdown format.

    Pandoc: Pandoc is a command-line tool that can convert between various document formats, including HTML, Markdown, and LaTeX. You can use Pandoc to convert the HTML content scraped from websites into Markdown format.

    Python Markdown: Python Markdown is a Python library that can be used to convert Markdown text into HTML and other formats. You can use Python Markdown to convert the Markdown content scraped from websites into HTML, and then use Pandoc to convert it into Markdown format.

    I hope those help.

    Regards,

    Yutong

    -Please kindly accept the answer and vote 'Yes' if you feel helpful to support the community, thank you.

    2 people found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.