Azure AI Search document preprocessing

Question

Azure AI Search document preprocessing

Dezső Kántor 0

To ensure optimal input for indexing and vectorizing, it's important to prepare your data properly. Azure OpenAI On Your Data supports various file types, including HTML, PDF, and Markdown files. If you're converting data from an unsupported format into a supported format, you should ensure that the conversion doesn't lead to significant data loss or add unexpected noise to your data. Additionally, if your files have special formatting, such as tables and columns, or bullet points, you should prepare your data with the data preparation script available on GitHub. For documents and datasets with long text, you should use the available data preparation script. The script chunks data so that the model's responses are more accurate. This script also supports scanned PDF files and images. Thank you in advance!

Dezső Kántor 0 Reputation points

2024-09-03T15:56:33.6966667+00:00

Sorry, i can't update my original question. But, here it is:

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T15:59:23.6633333+00:00

Sorry, I can't update my original question due to this error: *You are not authorized to make this response. If you believe this to be in error, please refresh the page and try again.*My original question is:
I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T16:07:51.3766667+00:00

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T17:31:32.71+00:00

sorry, but I can’t edit my original question.

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T17:39:34.95+00:00

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?

1 answer

Your answer

Dezső Kántor 0 Reputation points

2024-09-03T15:56:33.6966667+00:00

Sorry, i can't update my original question. But, here it is:

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T15:59:23.6633333+00:00

Sorry, I can't update my original question due to this error: *You are not authorized to make this response. If you believe this to be in error, please refresh the page and try again.*My original question is:
I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T16:07:51.3766667+00:00

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T17:31:32.71+00:00

sorry, but I can’t edit my original question.

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?
Dezső Kántor 0 Reputation points

2024-09-03T17:39:34.95+00:00

I am in the process of developing an AI-based QA chatbot that will provide answers using my own data. My question is fairly broad. I am working with various data formats, including HTML files, PDFs, and Markdown files. What is the best way to preprocess this data before indexing and vectorizing to ensure optimal input for indexing? I found a promising approach by using OpenAI to convert the files into JSON objects, which I then used as input for the index. What are the best practices for data preprocessing? Additionally, what is the most effective input format for AI search indexing?

Answer 1

@Dezső Kántor Azure OpenAI On Your Data supports various file types, including HTML, PDF, and Markdown files. If you're converting data from an unsupported format into a supported format, you should ensure that the conversion doesn't lead to significant data loss or add unexpected noise to your data. Additionally, if your files have special formatting, such as tables and columns, or bullet points, you should prepare your data with the data preparation script available on GitHub.

For documents and datasets with long text, you should use the available data preparation script. The script chunks data so that the model's responses are more accurate. This script also supports scanned PDF files and images.

Regarding the most effective input format for AI search indexing, it depends on the specific requirements of your chatbot and the type of data you're working with. However, JSON is a commonly used format for input data in AI search indexing.

Further tips:

Data Cleaning and Normalization
1. Remove Noise: Eliminate unnecessary characters, HTML tags, and other non-informative elements.
2. Normalize Text: Convert text to a consistent format (e.g., lowercasing, removing punctuation).
Data Transformation
1. Convert Formats: Ensure all data is in a supported format (HTML, PDF, Markdown). Use reliable tools to convert unsupported formats without losing data integrity.
2. Structure Data: Maintain the structure of tables, columns, and bullet points. This helps in preserving the context and relationships within the data.
Chunking Long Texts
1. Use Data Preparation Scripts: Utilize scripts to chunk long documents into smaller, manageable pieces. This improves the accuracy of the model’s responses.
2. Handle Special Formats: Ensure that special formats like scanned PDFs and images are processed correctly. Use OCR (Optical Character Recognition) for scanned documents to convert them into text.
Metadata Enrichment
1. Add Metadata: Enrich your documents with metadata such as titles, authors, dates, and keywords. This helps in improving search relevance and indexing efficiency.
Validation and Testing
1. Validate Data: Ensure that the converted data is accurate and complete. Check for any data loss or corruption during the conversion process.
2. Test Indexing: Perform test indexing to identify any issues early and adjust preprocessing steps accordingly.
Optimal Input Format
1. JSON Objects: Converting files into JSON objects is a good practice. JSON is flexible and can easily represent complex data structures, making it suitable for indexing.

Dezső Kántor 0 Reputation points

2024-10-04T08:16:12.6633333+00:00

For documents and datasets with long text, you should use the available data preparation script. The script chunks data so that the model's responses are more accurate. This script also supports scanned PDF files and images.

Ty for your answer. WHere can I find this data prep script?

Share via

Azure AI Search document preprocessing

1 answer

Your answer