Data preprocessing tips and tricks before indexing
I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for AI search indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?