Data preprocessing tips and tricks before indexing

Dezső Kántor 0 Reputation points
2024-09-03T15:09:11.5133333+00:00

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for AI search indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
937 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,880 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.