Introducing Data Ingestion Building Blocks in .NET for AI Applications
Luis Quintanilla and Adam Sitnik announce a preview of modular .NET building blocks for data ingestion, detailing how developers can build scalable, flexible AI pipelines for intelligent applications.
Introducing Data Ingestion Building Blocks in .NET for AI Applications
Authors: Luis Quintanilla, Adam Sitnik
The .NET team has announced a preview release of open and modular data ingestion building blocks designed to empower developers in the .NET ecosystem to create scalable AI and ML data pipelines, especially for Retrieval-Augmented Generation (RAG) scenarios.
Why Context Engineering Matters for AI
AI models have a knowledge cutoff and do not access personal or organizational data by default. To enable high-quality answers in AI apps, developers need robust context engineering: providing AI models with relevant, processed data at the right time.
Data Ingestion Challenges for .NET Developers
- Efficiently ingest, transform, and retrieve data for AI and context-aware experiences
- Move and manipulate data in ETL workflows (Extract, Transform, Load) for quality, structure, and usefulness
- Make data usable for AI/ML: chunking, enriching, embedding, chunk management
Preview: Data Ingestion Building Blocks for .NET
The new data ingestion library provides:
- Unified document representation: Support for diverse file types (PDF, Word, images, etc.) with a consistent structure for large language models
- Flexible ingestion: Readers for both cloud and local data sources
- AI-powered enrichment: Automatic enhancements (summaries, sentiment analysis, keyword extraction, classification)
- Customizable chunking: Token-based, section-based, semantic-aware chunkers
- Persistent storage: Vector databases and document stores with embedding support
- Composable pipelines: Chain together readers, processors, chunkers, and writers via the
IngestionPipelineAPI - Scalability: Components built for large-scale, enterprise-grade data processing and integration
- Extensibility: Designed for plugin capability, letting developers add logic, connectors, and integrations as their needs evolve
Key Library Dependencies
- Microsoft.ML.Tokenizers: Tokenizer support for document chunking
- Microsoft.Extensions.AI: AI enrichments like summarization, sentiment, embedding
- Microsoft.Extensions.VectorData: Unified interface for vector store backends (Qdrant, SQL Server, CosmosDB, MongoDB, ElasticSearch, SQLite, etc.)
Building a Data Ingestion Pipeline: Step-by-Step Example
1. Set Up a Project & Install Packages
ni DataIngestion.cs # Powershell
touch DataIngestion.cs # Bash
# Add key packages
Microsoft.Extensions.DataIngestion@10.0.1-preview.1.25571.5
Microsoft.Extensions.DataIngestion.Markdig@10.0.1-preview.1.25571.5
Microsoft.Extensions.AI.OpenAI@10.0.1-preview.1.25571.5
Microsoft.Extensions.Logging.Console@10.0.0
Microsoft.ML.Tokenizers.Data.Cl100kBase@2.0.0
Microsoft.SemanticKernel.Connectors.SqliteVec@1.67.1-preview
2. Document Reading
IngestionDocumentReader reader = new MarkdownReader();
3. Document Processing/Enrichment
Use AI-based enrichment (e.g., add alt text for images):
ILoggerFactory loggerFactory = LoggerFactory.Create(builder => builder.AddSimpleConsole());
OpenAIClient openAIClient = new(...);
IChatClient chatClient = openAIClient.GetChatClient("gpt-4.1").AsIChatClient();
IngestionDocumentProcessor imageAlternativeTextEnricher = new ImageAlternativeTextEnricher(new EnricherOptions(chatClient) { LoggerFactory = loggerFactory });
4. Chunking Documents
IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator = openAIClient.GetEmbeddingClient("text-embedding-3-small").AsIEmbeddingGenerator();
IngestionChunker<string> chunker = new SemanticSimilarityChunker(embeddingGenerator, new IngestionChunkerOptions(tokenizer) { MaxTokensPerChunk = 2000 });
5. Chunk Processing/Enrichment
Generate summaries for each chunk:
IngestionChunkProcessor<string> summaryEnricher = new SummaryEnricher(enricherOptions);
6. Store Processed Chunks
SqliteVectorStore vectorStore = new(...);
VectorStoreWriter<string> writer = new(vectorStore, dimensionCount: 1536);
7. Compose Pipeline & Run Processing
IngestionPipeline<string> pipeline = new(reader, chunker, writer, loggerFactory: loggerFactory)
{
DocumentProcessors = { imageAlternativeTextEnricher },
ChunkProcessors = { summaryEnricher }
};
await foreach (var result in pipeline.ProcessAsync(new DirectoryInfo("."), searchPattern: "*.md")) {
Console.WriteLine($"Completed processing '{result.DocumentId}'. Succeeded: '{result.Succeeded}'.");
}
8. Vector Search on Stored Chunks
var collection = writer.VectorStoreCollection;
while (true) {
Console.Write("Enter your question (or 'exit' to quit): ");
string? searchValue = Console.ReadLine();
if (string.IsNullOrEmpty(searchValue) || searchValue == "exit") break;
Console.WriteLine("Searching...\n");
await foreach (var result in collection.SearchAsync(searchValue, top: 3)) {
Console.WriteLine($"Score: {result.Score}\n\tContent: {result.Record[\"content\"]}");
}
}
End-to-End Sample: AI Chat Web Template
Try the official AI Web Chat Template for a guided experience integrating MarkItDown for parsing, Qdrant for chunk storage, and distributed orchestration with .NET Aspire. Observability is handled via OpenTelemetry tracing.
Observability with Aspire
Enable distributed tracing for data ingestion workflows:
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => {
tracing.AddSource("Experimental.Microsoft.Extensions.AI");
tracing.AddSource("Experimental.Microsoft.Extensions.DataIngestion");
});
Get Started
- Install the AI Web Chat Template: Quickstart
- Sample code: DataIngestion repo
- Extend abstractions: NuGet Data Ingestion Abstractions
Developers and ecosystem authors can extend, integrate, and compose these building blocks for custom data scenarios in AI and ML.
This post appeared first on “Microsoft .NET Blog”. Read the entire article here