Preparation

Azure AI services
Azure AI Search
Azure OpenAI Service
Azure Machine Learning

The first phase of Retrieval-Augmented Generation (RAG) development and experimentation is the preparation phase. During this phase, you first define the business domain for your solution. Once you have the domain defined, you begin the parallel process of performing document analysis, gathering documents, and gathering sample questions that are pertinent to the domain. The steps are done in parallel because they're interrelated. The document analysis helps you determine what test documents and test queries you should gather. They're further interrelated in that the questions must be answerable by content in the documents and the documents must answer relevant questions.

This article is part of a series. Read the introduction.

Determine solution domain

The first step in this process is to clearly define the business requirements for the solution or the use case. These requirements help determine what kind of questions the solution intends to address and what source data or documents help address those questions. In later stages, the solution domain helps inform your embedding model strategy.

Document analysis

The goal of document analysis is to gather enough information about your document corpus to help you understand:

  • The different classifications of documents - For example, do you have product specifications, quarterly reports, car insurance contracts, health insurance contracts, etc.
  • The different document types - For example, do you have PDFs, Markdown files, HTML files, DOCX files, etc.
  • The security constraints - For example, whether the documents are publicly accessible or not, or whether they require authentication and authorization to access them
  • The structure of the documents - For example, the length of documents, topic breaks, and whether they have contextually relevant images or tabular data

The following sections discuss how this information helps inform your loading and chunking strategies.

Classification of documents

You need to understand the different classifications of documents to help you determine the number of test documents you require. This part of the analysis should tell you not only the high-level classifications such as insurance or finance, but also subclassifications, such as health insurance vs. car insurance documents. You also want to understand if the subclassifications have different structures or content.

The goal is to understand all of the different document variants you have. This understanding helps you determine the number and breakdown of test documents you require. You don't want to over or underrepresent a specific document classification in your experimentation.

Document types

Understanding the different file formats in your corpus helps you determine the number and breakdown of test documents. For example, if you have PDF and Office Open XML document types for quarterly reports, you need test documents for each document type. Understanding your document types also helps you understand your technical requirements for loading and chunking your documents, such as specific libraries suited for processing those file formats.

Security constraints

Understanding security constraints is crucial for determining your loading and chunking strategies. For example, you need to identify whether some or all of your documents require authentication, authorization, or network visibility. If the documents are within a secure perimeter, ensure your code can access them, or implement a process to securely replicate the documents to an accessible location for your processing code.

Be aware that documents sometimes reference multimedia such as images or audio that are important to the context of the document. That media might also be subject to similar access controls as the document itself. If that media requires authentication or network line of sight, you again need to either make sure your code can access the media, or you have a prior process that has access that can replicate that content.

If your workload requires that different users only have access to distinct documents or document segments, ensure you understand how you are going to retain those access permissions in your chunking solution.

Document structure

You need to understand the structure of the document, including how it's laid out and the types of content in the document. Understanding the structure and content of your documents helps you make the following determinations:

  • Whether the document requires preprocessing to clean up noise, extract media, reformat, or annotate items to ignore
  • What in the document you want to ignore or exclude
  • What in the document you want to capture
  • How you want to chunk the document
  • How you want to handle images, tables, charts, and other embedded media

The following are some categorized questions you can use to help you make some of these determinations.

Questions about common items you can consider ignoring

Some structural elements might not add meaning to the document and can safely be ignored when chunking. In some situations, these elements can add valuable context and aid in relevancy queries to your index, but not all. The following are some questions about common document features you need to evaluate to see if they add relevancy or should be ignored.

  • Does the document contain a table of contents?
  • Are there headers and footers?
  • Are there copyrights or disclaimers?
  • Are there footnotes or endnotes?
  • Are there watermarks?
  • Are there annotations or comments?

Questions that help inform preprocessing and chunking strategy

The following questions about the structure of the document gives you insight that helps you understand if you need to preprocess the document to make it easier to process and helps inform your chunking strategy.

  • Is there multi-column data or multi-column paragraphs? You don't want to parse multi-column content as though it were a single column.
  • How is the document structured? For example, HTML files sometimes use tables for their layout that need to be differentiated from embedded tabular data.
  • How many paragraphs are there? How long are the paragraphs? Are the paragraphs roughly equal length?
  • What languages, language variant, or dialects are in the documents?
  • Does the document contain Unicode characters?
  • How are numbers formatted? Are they using commas or decimals? Are they consistent?
  • What in the document is uniform and what isn't uniform?
  • Is there a header structure where semantic meaning can be extracted?
  • Are there bullets or meaningful indentations?

Questions about images

Understanding the images in your document helps you determine your image processing strategy. You need to understand information like what kind of images you have, whether they have sufficient resolution to process, and whether the image contains all the required information. The following questions help you understand your image processing requirements.

  • Does the document contain images?
  • What resolution are the images?
  • Is there text embedded in the images?
  • Are there abstract images that don't add value? For example, icons may not add any semantic value. Adding a description for images may actually be detrimental, as the icon visual generally has little to do with the document content.
  • What is the relationship between the image and surrounding text? Determine whether the images have stand-alone content or whether there's context around the image you should use when passing it to a large language model to get the textual representation. Captions are an example of surrounding text that may have valuable context not included in the image.
  • Is there rich textual representation of the images, such as accessibility descriptions?

Questions about tables, charts, and other rich content

Understanding what information is encapsulated in tables, charts, and other media helps you understand what and how you want to process it. The following questions help you understand your tables, charts, and other media processing requirements.

  • Does the document have charts with numbers?
  • Does the document contain tables?
    • Are the tables complex (nested tables) or noncomplex?
    • Are there captions for the tables?
    • What are the lengths of the tables? Long tables may require repeating headers in chunks.
  • Are there other types of embedded media like videos or audio?
  • Are there any mathematical equations/scientific notations in the document?

Gather representative test documents

In this step, you're gathering documents that are the best representation of the documents that you use in your production solution. The documents must address the defined use case and be able to answer the questions gathered in the question gathering parallel phase.

Considerations

Consider these areas when evaluating potential representative test documents:

  • Pertinence - The documents must meet the business requirements of the conversational application. For example, if you're building a chat bot tasked with helping customers perform banking operations, the documents should match that requirement, such as documents showing how to open or close a bank account. The documents must be able to address the test questions that are being gathered in the parallel step. If the documents don't have the information relevant to the questions, it cannot produce a valid response.
  • Representative - The documents should be representative of the different types of documents that your solution uses. For example, a car insurance document is different to a health insurance or life insurance document. Suppose the use case requires the solution to support all three types, and you only had two car insurance documents. Your solution would perform poorly for both health and life insurance. You should have at least 2 for each variation.
  • Physical document quality - The documents need to be in a usable shape. Scanned images, for example, might not allow you to extract usable information.
  • Document content quality - The documents must have high content quality. There shouldn't be misspellings or grammatical errors. Large language models don't perform well if you provide them with poor quality content.

The success factor in this step is being qualitatively confident that you have a good representation of test documents for your particular domain.

Test document guidance

  • Prefer real documents over synthetic. Real documents must go through a cleaning process to remove personally identifiable information (PII).
  • To ensure you're handling all kinds of scenarios, including predicted future scenarios, consider selectively augmenting your documents with synthetic data.
    • If you must use synthetic data, do your best to make it as close to real data as possible.
  • Make sure that the documents can address the questions that are being gathered.
  • You should have at least two documents for each document variant.
  • You can use large language models or other tools to help evaluate the document quality.

Gather test queries

In this step, you're gathering test queries that you use to evaluate your chunks, search solution, and your prompt engineering. You do this in lockstep with gathering the representative documents, as you're not only gathering the queries, you're also gathering how the representative documents address the queries. Having both the sample queries, combined with the parts of the sample documents that address those queries, allow you to evaluate every stage of the RAG solution as you're experimenting with different strategies and approaches.

Gather test query output

The output of this phase includes content from both the Gather representative test queries step, and the Gather representative test documents step. The output is a collection containing the following data:

  • Query - The question, representing a legitimate user's potential prompt.
  • Context - A collection of all the actual text in the documents that address the query. For each bit of context, you should include the page and the actual text.
  • Answer - A valid response to the query. The response be content directly from the documents or it might be rephrased from one or more pieces of context.

Creating synthetic queries

It's often challenging for the subject matter experts (SMEs) for a particular domain to put together a comprehensive list of questions for the use case. One solution to this challenge is to generate synthetic questions from the representative test documents that were gathered. The following is a real-world approach for generating synthetic questions from representative documents:

  1. Chunk the documents - Break down the documents into chunks. This chunking step isn't using the chunking strategy for your overall solution. It's a one-off step that you use to generate synthetic queries. The chunking can be done manually if the number of documents is reasonable.

  2. Generate queries per chunk - For each chunk, generate queries either manually or using a large language model. When using a large language model, we generally start by generating two queries per chunk. The large language model can also be used to create the answer. The following example shows a prompt that generates questions and answers for a chunk.

    Please read the following CONTEXT and generate two question and answer json objects in an array based on the CONTEXT provided. The questions should require deep reading comprehension, logical inference, deduction, and connecting ideas across the text. Avoid simplistic retrieval or pattern matching questions. Instead, focus on questions that test the ability to reason about the text in complex ways, draw subtle conclusions, and combine multiple pieces of information to arrive at an answer. Ensure that the questions are relevant, specific, and cover the key points of the CONTEXT.  Provide concise answers to each question, directly quoting the text from provided context. Provide the array output in strict JSON format as shown in output format. Ensure that the generated JSON is 100 percent structurally correct, with proper nesting, comma placement, and quotation marks. There should not be any comma after last element in the array.
    
    Output format:
    [
      {
        "question": "Question 1",
        "answer": "Answer 1"
      },
      {
        "question": "Question 2",
        "answer": "Answer 2"
      }
    ]
    
    CONTEXT:
    
  3. Verify output - Verify that the questions are pertinent to the use case and that the answers address the question. This verification should be performed by a SME.

Unaddressed queries

It's important to gather queries that the documents don't address, along with queries that are addressed. When you test your solution, particularly when you test the large language model, you need to determine how the solution should respond to queries it doesn't have sufficient context to answer. Approaches to responding to queries you can't address include:

  • Responding that you don't know
  • Responding that you don't know and providing a link where the user might find more information

Gather test queries for embedded media

Like with text, you should gather a diverse set of questions that involve using the embedded media to generate highly relevant answers. If you have images with graphs, tables, screenshots, make sure you have questions that cover all the use cases. If you determined in the images portion of the document analysis section that the text before or after the image is required to answer some questions, make sure you have those questions in your test queries.

Gather test queries guidance

  • Determine whether there's a system that contains real customer questions that you can use. For example, if you're building a chat bot to answer customer questions, you might be able to use customer questions from your help desk, FAQs, or ticketing system.
  • The customer or SME for the use case should act as a quality gate to determine whether or not the gathered documents, the associated test queries, and the answers to the queries from the documents are comprehensive, representative, and are correct.
  • Reviewing the body of questions and answers should be done periodically to ensure that they continue to accurately reflect the source documents.

Next steps