-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Add data ingestion quickstart for processing custom data #50558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: gewarren <24882762+gewarren@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds a new quickstart tutorial for the Microsoft.Extensions.DataIngestion library, demonstrating how to build an ETL pipeline for RAG scenarios. The quickstart shows users how to read Markdown documents, enrich them with AI, chunk them semantically, and store them in a vector database for semantic search. Additionally, the PR includes cleanup changes to other quickstart files, removing hardcoded model names from user secrets in favor of inline string values.
Key Changes
- New quickstart documentation showing end-to-end data ingestion pipeline for AI applications
- Sample code demonstrating pipeline composition with readers, enrichers, chunkers, and vector storage
- Code cleanup across existing quickstarts (text-to-image, structured-output) to simplify configuration
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/ai/toc.yml | Adds new quickstart entry under "Chat with your data (RAG)" section |
| docs/ai/quickstarts/process-data.md | New quickstart documentation for data ingestion pipeline |
| docs/ai/quickstarts/snippets/process-data/Program.cs | Complete C# example implementing data ingestion with Azure OpenAI |
| docs/ai/quickstarts/snippets/process-data/ProcessData.csproj | Project file with required NuGet packages |
| docs/ai/quickstarts/snippets/process-data/data/sample.md | Sample Markdown document for testing the pipeline |
| docs/ai/quickstarts/text-to-image.md | Removed unnecessary model name from user secrets configuration |
| docs/ai/quickstarts/structured-output.md | Removed unnecessary model name from user secrets configuration |
| docs/ai/quickstarts/snippets/text-to-image/azure-openai/Program.cs | Hardcoded model name instead of reading from user secrets |
| docs/ai/quickstarts/snippets/structured-output/Program.cs | Hardcoded model name instead of reading from user secrets |
| docs/ai/how-to/snippets/access-data/ArgumentsExample.cs | Hardcoded model name instead of reading from user secrets |
Summary
Adds quickstart documentation for Microsoft.Extensions.DataIngestion library, demonstrating complete ETL pipeline for RAG scenarios.
Contributes to #50534
Changes
Documentation
docs/ai/quickstarts/process-data.mdCode Snippets
Based on sample from https://github.com/luisquintanilla/DataIngestion and blog announcement at https://devblogs.microsoft.com/dotnet/introducing-data-ingestion-building-blocks-preview/
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
devblogs.microsoft.com/usr/bin/curl curl -s REDACTED(dns block)/usr/bin/wget wget -q -O /tmp/blog.html REDACTED(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.
Internal previews