Edit

Share via


Process custom data for AI applications

In this quickstart, you learn how to create a data ingestion pipeline to process and prepare custom data for AI applications. The app uses the Microsoft.Extensions.DataIngestion library to read documents, enrich content with AI, chunk text semantically, and store embeddings in a vector database for semantic search.

Data ingestion is essential for retrieval-augmented generation (RAG) scenarios where you need to process large amounts of unstructured data and make it searchable for AI applications.

Prerequisites

Create the app

Complete the following steps to create a .NET console app.

  1. In an empty directory on your computer, use the dotnet new command to create a new console app:

    dotnet new console -o ProcessDataAI
    
  2. Change directory into the app folder:

    cd ProcessDataAI
    
  3. Install the required packages:

    dotnet add package Azure.AI.OpenAI
    dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease
    dotnet add package Microsoft.Extensions.Configuration
    dotnet add package Microsoft.Extensions.Configuration.UserSecrets
    dotnet add package Microsoft.Extensions.DataIngestion --prerelease
    dotnet add package Microsoft.Extensions.DataIngestion.Markdig --prerelease
    dotnet add package Microsoft.Extensions.Logging.Console
    dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase
    dotnet add package Microsoft.SemanticKernel.Connectors.SqliteVec --prerelease
    

Create the AI service

  1. To provision an Azure OpenAI service and model, complete the steps in the Create and deploy an Azure OpenAI Service resource article. For this quickstart, you need to provision two models: gpt-5 and text-embedding-3-small.

  2. From a terminal or command prompt, navigate to the root of your project directory.

  3. Run the following commands to configure your Azure OpenAI endpoint and API key for the sample app:

    dotnet user-secrets init
    dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-Azure-OpenAI-endpoint>
    dotnet user-secrets set AZURE_OPENAI_API_KEY <your-Azure-OpenAI-API-key>
    

Open the app in an editor

Open the app in Visual Studio Code (or your editor of choice).

code .

Create the sample data

  1. Copy the sample.md file to a folder named data in your project directory.
  2. Configure the project to copy this file to the output directory. If you're using Visual Studio, right-click on the file in Solution Explorer, select Properties, and then set Copy to Output Directory to Copy if newer.

Add the app code

The data ingestion pipeline consists of several components that work together to process documents:

  • Document reader: Reads Markdown files from a directory.
  • Document processor: Enriches images with AI-generated alternative text.
  • Chunker: Splits documents into semantic chunks using embeddings.
  • Chunk processor: Generates AI summaries for each chunk.
  • Vector store writer: Stores chunks with embeddings in a SQLite database.
  1. In the Program.cs file, delete any existing code and add the following code to configure the document reader:

    // Configure document reader.
    IngestionDocumentReader reader = new MarkdownReader();
    

    The MarkdownReader class reads Markdown documents and converts them into a unified format that works well with large language models.

  2. Add code to configure logging for the pipeline:

    using ILoggerFactory loggerFactory =
        LoggerFactory.Create(builder => builder.AddSimpleConsole());
    
  3. Add code to configure the AI client for enrichment and chat:

    // Configure IChatClient to use Azure OpenAI.
    IConfigurationRoot config = new ConfigurationBuilder()
        .AddUserSecrets<Program>()
        .Build();
    
    string endpoint = config["AZURE_OPENAI_ENDPOINT"];
    string apiKey = config["AZURE_OPENAI_API_KEY"];
    string chatModel = "gpt-5";
    string embeddingModel = "text-embedding-3-small";
    
    AzureOpenAIClient azureClient = new(
        new Uri(endpoint),
        new AzureKeyCredential(apiKey));
    
    IChatClient chatClient =
        azureClient.GetChatClient(chatModel).AsIChatClient();
    
  4. Add code to configure the document processor that enriches images with AI-generated descriptions:

    // Configure document processor.
    EnricherOptions enricherOptions = new(chatClient)
    {
        // Enricher failures should not fail the whole ingestion pipeline,
        // as they are best-effort enhancements.
        // This logger factory can create loggers to log such failures.
        LoggerFactory = loggerFactory
    };
    
    IngestionDocumentProcessor imageAlternativeTextEnricher =
        new ImageAlternativeTextEnricher(enricherOptions);
    

    The ImageAlternativeTextEnricher uses large language models to generate descriptive alternative text for images within documents. That text makes them more accessible and improves their semantic meaning.

  5. Add code to configure the embedding generator for creating vector representations:

    // Configure embedding generator.
    IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator =
        azureClient.GetEmbeddingClient(embeddingModel).AsIEmbeddingGenerator();
    

    Embeddings are numerical representations of the semantic meaning of text, which enables vector similarity search.

  6. Add code to configure the chunker that splits documents into semantic chunks:

    // Configure chunker to split text into semantic chunks.
    IngestionChunkerOptions chunkerOptions = new(TiktokenTokenizer.CreateForModel(chatModel))
    {
        MaxTokensPerChunk = 2000,
        OverlapTokens = 0
    };
    
    IngestionChunker<string> chunker =
        new SemanticSimilarityChunker(embeddingGenerator, chunkerOptions);
    

    The SemanticSimilarityChunker intelligently splits documents by analyzing the semantic similarity between sentences, ensuring that related content stays together. This process produces chunks that preserve meaning and context better than simple character or token-based chunking.

  7. Add code to configure the chunk processor that generates summaries:

    // Configure chunk processor to generate summaries for each chunk.
    IngestionChunkProcessor<string> summaryEnricher = new SummaryEnricher(enricherOptions);
    

    The SummaryEnricher automatically generates concise summaries for each chunk, which can improve retrieval accuracy by providing a high-level overview of the content.

  8. Add code to configure the SQLite vector store for storing embeddings:

    // Configure SQLite Vector Store.
    using SqliteVectorStore vectorStore = new(
        "Data Source=vectors.db;Pooling=false",
        new()
        {
            EmbeddingGenerator = embeddingGenerator
        });
    
    // The writer requires the embedding dimension count to be specified.
    using VectorStoreWriter<string> writer = new(
        vectorStore,
        dimensionCount: 1536,
        new VectorStoreWriterOptions { CollectionName = "data" });
    

    The vector store stores chunks along with their embeddings, enabling fast semantic search capabilities.

  9. Add code to compose all the components into a complete pipeline:

    // Compose data ingestion pipeline
    using IngestionPipeline<string> pipeline =
        new(reader, chunker, writer, loggerFactory: loggerFactory)
    {
        DocumentProcessors = { imageAlternativeTextEnricher },
        ChunkProcessors = { summaryEnricher }
    };
    

    The IngestionPipeline<T> combines all the components into a cohesive workflow that processes documents from start to finish.

  10. Add code to process documents from a directory:

    await foreach (IngestionResult result in pipeline.ProcessAsync(
        new DirectoryInfo("./data"),
        searchPattern: "*.md"))
    {
        Console.WriteLine($"Completed processing '{result.DocumentId}'. " +
            $"Succeeded: '{result.Succeeded}'.");
    }
    

    The pipeline processes all Markdown files in the ./data directory and reports the status of each document.

  11. Add code to enable interactive search of the processed documents:

    // Search the vector store collection and display results
    VectorStoreCollection<object, Dictionary<string, object?>> collection =
        writer.VectorStoreCollection;
    
    while (true)
    {
        Console.Write("Enter your question (or 'exit' to quit): ");
        string? searchValue = Console.ReadLine();
        if (string.IsNullOrEmpty(searchValue) || searchValue == "exit")
        {
            break;
        }
    
        Console.WriteLine("Searching...\n");
        await foreach (VectorSearchResult<Dictionary<string, object?>> result in
            collection.SearchAsync(searchValue, top: 3))
        {
            Console.WriteLine($"Score: {result.Score}\n\tContent: {result.Record["content"]}");
        }
    }
    

    The search functionality converts user queries into embeddings and finds the most semantically similar chunks in the vector store.

Run the app

  1. Use the dotnet run command to run the app:

    dotnet run
    

    The app processes all Markdown files in the ./data directory and displays the processing status for each document. Once processing is complete, you can enter natural language questions to search the processed content.

  2. Enter a question at the prompt to search the data:

    Enter your question (or 'exit' to quit): What is data ingestion?
    

    The app returns the most relevant chunks from your documents along with their similarity scores.

  3. Type exit to quit the application.

Clean up resources

If you no longer need them, delete the Azure OpenAI resource and model deployment.

  1. In the Azure Portal, navigate to the Azure OpenAI resource.
  2. Select the Azure OpenAI resource, and then select Delete.

Next steps