Using Open AI or Other Microsoft Resource to Filter/Categorize Unstructured Data

Education is the Key 85 Reputation points
2024-05-16T23:22:33.9533333+00:00

Our Team is completing a special project that requires the customized categorization and filtering of a large amount of unstructured data (about 300,000 unduplicated PDFs) for an LLM-driven, AI application.

Is there a specific Microsoft resource available that can handle that task? Microsoft Purview has been recommended, but it doesn't seem like an ideal solution.

Thanks!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,978 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,285 questions
Azure Startups
Azure Startups
Azure: A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.Startups: Companies that are in their initial stages of business and typically developing a business model and seeking financing.
403 questions
{count} votes

Accepted answer
  1. AshokPeddakotla-MSFT 35,091 Reputation points
    2024-05-17T04:07:29.58+00:00

    Education is the Key Greetings & Welcome to Microsoft Q&A forum!

    Our Team is completing a special project that requires the customized categorization and filtering of a large amount of unstructured data (about 300,000 unduplicated PDFs) for an LLM-driven, AI application. Is there a specific Microsoft resource available that can handle that task? Microsoft Purview has been recommended, but it doesn't seem like an ideal solution.

    Please see below suggestions to your scenario.

    Based on your requirements, you can explore the following Microsoft Azure resources and decide on the best possible solution.

    Azure AI Search is a cloud search service that provides a rich search experience for custom applications. It can be used to extract text and metadata from unstructured data sources, including PDFs, and then index and search the data. Check custom text classification to learn more.

    Azure Machine Learning is a cloud-based service that provides tools to build, train, and deploy machine learning models. You can explore using Azure Machine Learning to build a custom model that can categorize and filter the unstructured data. Please see Azure Machine Learning Overview for more details.

    Regarding Azure OpenAI: Azure OpenAI On Your Data supports the following file types:

    • .txt
    • .md
    • .html
    • .docx
    • .pptx
    • .pdf

    There's an upload limit, and there are some caveats about document structure and how it might affect the quality of responses from the model:

    • If you're converting data from an unsupported format into a supported format, optimize the quality of the model response by ensuring the conversion:
      • Doesn't lead to significant data loss.
      • Doesn't add unexpected noise to your data.
    • If your files have special formatting, such as tables and columns, or bullet points, prepare your data with the data preparation script available on GitHub.
    • For documents and datasets with long text, you should use the available data preparation script. The script chunks data so that the model's responses are more accurate. This script also supports scanned PDF files and images.

    Do let me know if that helps or have any other queries.

    If the response helped, please do click Accept Answer and Yes for was this answer helpful.

    Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.