Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this quickstart, you learn how to create a data ingestion pipeline to process and prepare custom data for AI applications. The app uses the Microsoft.Extensions.DataIngestion library to read documents, enrich content with AI, chunk text semantically, and store embeddings in a vector database for semantic search.
Data ingestion is essential for retrieval-augmented generation (RAG) scenarios where you need to process large amounts of unstructured data and make it searchable for AI applications.
Prerequisites
- .NET 8.0 SDK or higher - Install the .NET 8 SDK.
- An Azure subscription - Create one for free.
- Azure Developer CLI (optional) - Install or update the Azure Developer CLI.
Create the app
Complete the following steps to create a .NET console app.
In an empty directory on your computer, use the
dotnet newcommand to create a new console app:dotnet new console -o ProcessDataAIChange directory into the app folder:
cd ProcessDataAIInstall the required packages:
dotnet add package Azure.AI.OpenAI dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease dotnet add package Microsoft.Extensions.Configuration dotnet add package Microsoft.Extensions.Configuration.UserSecrets dotnet add package Microsoft.Extensions.DataIngestion --prerelease dotnet add package Microsoft.Extensions.DataIngestion.Markdig --prerelease dotnet add package Microsoft.Extensions.Logging.Console dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase dotnet add package Microsoft.SemanticKernel.Connectors.SqliteVec --prerelease
Create the AI service
To provision an Azure OpenAI service and model, complete the steps in the Create and deploy an Azure OpenAI Service resource article. For this quickstart, you need to provision two models:
gpt-5andtext-embedding-3-small.From a terminal or command prompt, navigate to the root of your project directory.
Run the following commands to configure your Azure OpenAI endpoint and API key for the sample app:
dotnet user-secrets init dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-Azure-OpenAI-endpoint> dotnet user-secrets set AZURE_OPENAI_API_KEY <your-Azure-OpenAI-API-key>
Open the app in an editor
Open the app in Visual Studio Code (or your editor of choice).
code .
Create the sample data
- Copy the sample.md file to a folder named
datain your project directory. - Configure the project to copy this file to the output directory. If you're using Visual Studio, right-click on the file in Solution Explorer, select Properties, and then set Copy to Output Directory to Copy if newer.
Add the app code
The data ingestion pipeline consists of several components that work together to process documents:
- Document reader: Reads Markdown files from a directory.
- Document processor: Enriches images with AI-generated alternative text.
- Chunker: Splits documents into semantic chunks using embeddings.
- Chunk processor: Generates AI summaries for each chunk.
- Vector store writer: Stores chunks with embeddings in a SQLite database.
In the
Program.csfile, delete any existing code and add the following code to configure the document reader:// Configure document reader. IngestionDocumentReader reader = new MarkdownReader();The MarkdownReader class reads Markdown documents and converts them into a unified format that works well with large language models.
Add code to configure logging for the pipeline:
using ILoggerFactory loggerFactory = LoggerFactory.Create(builder => builder.AddSimpleConsole());Add code to configure the AI client for enrichment and chat:
// Configure IChatClient to use Azure OpenAI. IConfigurationRoot config = new ConfigurationBuilder() .AddUserSecrets<Program>() .Build(); string endpoint = config["AZURE_OPENAI_ENDPOINT"]; string apiKey = config["AZURE_OPENAI_API_KEY"]; string chatModel = "gpt-5"; string embeddingModel = "text-embedding-3-small"; AzureOpenAIClient azureClient = new( new Uri(endpoint), new AzureKeyCredential(apiKey)); IChatClient chatClient = azureClient.GetChatClient(chatModel).AsIChatClient();Add code to configure the document processor that enriches images with AI-generated descriptions:
// Configure document processor. EnricherOptions enricherOptions = new(chatClient) { // Enricher failures should not fail the whole ingestion pipeline, // as they are best-effort enhancements. // This logger factory can create loggers to log such failures. LoggerFactory = loggerFactory }; IngestionDocumentProcessor imageAlternativeTextEnricher = new ImageAlternativeTextEnricher(enricherOptions);The ImageAlternativeTextEnricher uses large language models to generate descriptive alternative text for images within documents. That text makes them more accessible and improves their semantic meaning.
Add code to configure the embedding generator for creating vector representations:
// Configure embedding generator. IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator = azureClient.GetEmbeddingClient(embeddingModel).AsIEmbeddingGenerator();Embeddings are numerical representations of the semantic meaning of text, which enables vector similarity search.
Add code to configure the chunker that splits documents into semantic chunks:
// Configure chunker to split text into semantic chunks. IngestionChunkerOptions chunkerOptions = new(TiktokenTokenizer.CreateForModel(chatModel)) { MaxTokensPerChunk = 2000, OverlapTokens = 0 }; IngestionChunker<string> chunker = new SemanticSimilarityChunker(embeddingGenerator, chunkerOptions);The SemanticSimilarityChunker intelligently splits documents by analyzing the semantic similarity between sentences, ensuring that related content stays together. This process produces chunks that preserve meaning and context better than simple character or token-based chunking.
Add code to configure the chunk processor that generates summaries:
// Configure chunk processor to generate summaries for each chunk. IngestionChunkProcessor<string> summaryEnricher = new SummaryEnricher(enricherOptions);The SummaryEnricher automatically generates concise summaries for each chunk, which can improve retrieval accuracy by providing a high-level overview of the content.
Add code to configure the SQLite vector store for storing embeddings:
// Configure SQLite Vector Store. using SqliteVectorStore vectorStore = new( "Data Source=vectors.db;Pooling=false", new() { EmbeddingGenerator = embeddingGenerator }); // The writer requires the embedding dimension count to be specified. using VectorStoreWriter<string> writer = new( vectorStore, dimensionCount: 1536, new VectorStoreWriterOptions { CollectionName = "data" });The vector store stores chunks along with their embeddings, enabling fast semantic search capabilities.
Add code to compose all the components into a complete pipeline:
// Compose data ingestion pipeline using IngestionPipeline<string> pipeline = new(reader, chunker, writer, loggerFactory: loggerFactory) { DocumentProcessors = { imageAlternativeTextEnricher }, ChunkProcessors = { summaryEnricher } };The IngestionPipeline<T> combines all the components into a cohesive workflow that processes documents from start to finish.
Add code to process documents from a directory:
await foreach (IngestionResult result in pipeline.ProcessAsync( new DirectoryInfo("./data"), searchPattern: "*.md")) { Console.WriteLine($"Completed processing '{result.DocumentId}'. " + $"Succeeded: '{result.Succeeded}'."); }The pipeline processes all Markdown files in the
./datadirectory and reports the status of each document.Add code to enable interactive search of the processed documents:
// Search the vector store collection and display results VectorStoreCollection<object, Dictionary<string, object?>> collection = writer.VectorStoreCollection; while (true) { Console.Write("Enter your question (or 'exit' to quit): "); string? searchValue = Console.ReadLine(); if (string.IsNullOrEmpty(searchValue) || searchValue == "exit") { break; } Console.WriteLine("Searching...\n"); await foreach (VectorSearchResult<Dictionary<string, object?>> result in collection.SearchAsync(searchValue, top: 3)) { Console.WriteLine($"Score: {result.Score}\n\tContent: {result.Record["content"]}"); } }The search functionality converts user queries into embeddings and finds the most semantically similar chunks in the vector store.
Run the app
Use the
dotnet runcommand to run the app:dotnet runThe app processes all Markdown files in the
./datadirectory and displays the processing status for each document. Once processing is complete, you can enter natural language questions to search the processed content.Enter a question at the prompt to search the data:
Enter your question (or 'exit' to quit): What is data ingestion?The app returns the most relevant chunks from your documents along with their similarity scores.
Type
exitto quit the application.
Clean up resources
If you no longer need them, delete the Azure OpenAI resource and model deployment.
- In the Azure Portal, navigate to the Azure OpenAI resource.
- Select the Azure OpenAI resource, and then select Delete.