@Dezső Kántor Azure OpenAI On Your Data supports various file types, including HTML, PDF, and Markdown files. If you're converting data from an unsupported format into a supported format, you should ensure that the conversion doesn't lead to significant data loss or add unexpected noise to your data. Additionally, if your files have special formatting, such as tables and columns, or bullet points, you should prepare your data with the data preparation script available on GitHub.
For documents and datasets with long text, you should use the available data preparation script. The script chunks data so that the model's responses are more accurate. This script also supports scanned PDF files and images.
Regarding the most effective input format for AI search indexing, it depends on the specific requirements of your chatbot and the type of data you're working with. However, JSON is a commonly used format for input data in AI search indexing.
Further tips:
- Data Cleaning and Normalization
- Remove Noise: Eliminate unnecessary characters, HTML tags, and other non-informative elements.
- Normalize Text: Convert text to a consistent format (e.g., lowercasing, removing punctuation).
- Data Transformation
- Convert Formats: Ensure all data is in a supported format (HTML, PDF, Markdown). Use reliable tools to convert unsupported formats without losing data integrity.
- Structure Data: Maintain the structure of tables, columns, and bullet points. This helps in preserving the context and relationships within the data.
- Chunking Long Texts
- Use Data Preparation Scripts: Utilize scripts to chunk long documents into smaller, manageable pieces. This improves the accuracy of the model’s responses.
- Handle Special Formats: Ensure that special formats like scanned PDFs and images are processed correctly. Use OCR (Optical Character Recognition) for scanned documents to convert them into text.
- Metadata Enrichment
- Add Metadata: Enrich your documents with metadata such as titles, authors, dates, and keywords. This helps in improving search relevance and indexing efficiency.
- Validation and Testing
- Validate Data: Ensure that the converted data is accurate and complete. Check for any data loss or corruption during the conversion process.
- Test Indexing: Perform test indexing to identify any issues early and adjust preprocessing steps accordingly.
- Optimal Input Format
- JSON Objects: Converting files into JSON objects is a good practice. JSON is flexible and can easily represent complex data structures, making it suitable for indexing.