Edit

Data ingestion

Data ingestion is the process of collecting, reading, and preparing data from different sources such as files, databases, APIs, or cloud services so it can be used in downstream applications. In practice, this process follows the Extract-Transform-Load (ETL) workflow:

  • Extract data from its original source, whether that is a PDF, Word document, audio file, or web API.
  • Transform the data by cleaning, chunking, enriching, or converting formats.
  • Load the data into a destination like a database, vector store, or AI model for retrieval and analysis.

For AI and machine learning scenarios, especially retrieval-augmented generation (RAG), data ingestion is not just about converting data from one format to another. It is about making data usable for intelligent applications. This means representing documents in a way that preserves their structure and meaning, splitting them into manageable chunks, enriching them with metadata or embeddings, and storing them so they can be retrieved quickly and accurately.

Why data ingestion matters for AI applications

Imagine you're building a RAG-powered chatbot to help employees find information across your company's vast collection of documents. These documents might include PDFs, Word files, PowerPoint presentations, and web pages scattered across different systems.

Your chatbot needs to understand and search through thousands of documents to provide accurate, contextual answers. But raw documents aren't suitable for AI systems. You need to transform them into a format that preserves meaning while making them searchable and retrievable.

This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context.

Data ingestion building blocks

The Microsoft.Extensions.DataIngestion library is built around several key components that work together to create a complete data processing pipeline. This section explores each component and how they fit together.

Documents and document readers

At the foundation of the library is the IngestionDocument type, which provides a unified way to represent any file format without losing important information. IngestionDocument is Markdown-centric because large language models work best with Markdown formatting.

The IngestionDocumentReader abstraction handles loading documents from various sources, whether local files or streams. A few readers are available:

More readers (including LlamaParse and Azure Document Intelligence) will be added in the future.

This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.

Document processing

Document processors apply transformations at the document level to enhance and prepare content. The library provides the ImageAlternativeTextEnricher class as a built-in processor that uses large language models to generate descriptive alternative text for images within documents.

Chunks and chunking strategies

Once you have a document loaded, you typically need to break it down into smaller pieces called chunks. Chunks represent subsections of a document that can be efficiently processed, stored, and retrieved by AI systems. This chunking process is essential for retrieval-augmented generation scenarios where you need to find the most relevant pieces of information quickly.

The library provides several chunking strategies to fit different use cases:

  • Header-based chunking to split on headers.
  • Section-based chunking to split on sections (for example, pages).
  • Semantic-aware chunking to preserve complete thoughts.

These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information.

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-5");
IngestionChunkerOptions options = new(tokenizer)
{
    MaxTokensPerChunk = 2000,
    OverlapTokens = 0
};
IngestionChunker<string> chunker = new HeaderChunker(options);

Chunk processing and enrichment

After documents are split into chunks, you can apply processors to enhance and enrich the content. Chunk processors work on individual pieces and can perform:

  • Content enrichment including automatic summaries (SummaryEnricher), sentiment analysis (SentimentEnricher), and keyword extraction (KeywordEnricher).
  • Classification for automated content categorization based on predefined categories (ClassificationEnricher).

These processors use Microsoft.Extensions.AI.Abstractions to leverage large language models for intelligent content transformation, making your chunks more useful for downstream AI applications.

Document writer and storage

IngestionChunkWriter<T> stores processed chunks into a data store for later retrieval. The library, which uses Microsoft.Extensions.AI and Microsoft.Extensions.VectorData, provides the VectorStoreWriter<T> class. This writer supports storing chunks in any vector store supported by Microsoft.Extensions.VectorData.

Vector stores include popular options like Qdrant, SQL Server, CosmosDB, MongoDB, and ElasticSearch. For more information about providers, see Out-of-the-box Vector Store providers. (Despite the inclusion of "SemanticKernel" in the package names, these providers have nothing to do with Semantic Kernel and are usable anywhere in .NET, including Agent Framework.)

The writer can also automatically generate embeddings for your chunks using Microsoft.Extensions.AI, readying them for semantic search and retrieval scenarios.

OpenAIClient openAIClient = new(
    new ApiKeyCredential(Environment.GetEnvironmentVariable("GITHUB_TOKEN")!),
    new OpenAIClientOptions { Endpoint = new Uri("https://models.github.ai/inference") });

IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator =
    openAIClient.GetEmbeddingClient("text-embedding-3-small").AsIEmbeddingGenerator();

using SqliteVectorStore vectorStore = new(
    "Data Source=vectors.db;Pooling=false",
    new()
    {
        EmbeddingGenerator = embeddingGenerator
    });

// The writer requires the embedding dimension count to be specified.
// For OpenAI's `text-embedding-3-small`, the dimension count is 1536.
using VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);

Document processing pipeline

The IngestionPipeline<T> API allows you to chain together the various data ingestion components into a complete workflow. You can combine:

  • Readers to load documents from various sources.
  • Processors to transform and enrich document content.
  • Chunkers to break documents into manageable pieces.
  • Writers to store the final results in your chosen data store.

This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows.

using IngestionPipeline<string> pipeline = new(reader, chunker, writer, loggerFactory: loggerFactory)
{
    DocumentProcessors = { imageAlternativeTextEnricher },
    ChunkProcessors = { summaryEnricher }
};

await foreach (var result in pipeline.ProcessAsync(new DirectoryInfo("."), searchPattern: "*.md"))
{
    Console.WriteLine($"Completed processing '{result.DocumentId}'. Succeeded: '{result.Succeeded}'.");
}

A single document ingestion failure shouldn't fail the whole pipeline. That's why IngestionPipeline<T>.ProcessAsync implements partial success by returning IAsyncEnumerable<IngestionResult>. The caller is responsible for handling any failures (for example, by retrying failed documents or stopping on first error).