Kheiss/centralization by kheiss-uwzoo · Pull Request #1416 · NVIDIA/nv-ingest

kheiss-uwzoo · 2026-02-20T21:28:08Z

Summary

Refactors the NV-Ingest docs so each technical fact has one canonical place, removes duplication, and adds clear patterns and guardrails for future doc updates.

Changes

Single source of truth
Product naming: The note “NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest” lives only in overview.md. All other pages use a one-line pointer to the overview.
Environment variables: environment-config.md is the canonical env reference. nimclient.md and scaling-modes.md reference it; scaling-related vars are documented in scaling-modes and linked from environment-config.
Broken links: Replaced environment-variables.md with environment-config.md in chunking and content-metadata. Fixed metadata schema link in user-defined-functions 1.
Centralization of technical facts
Support Matrix is the canonical reference for pipeline NIMs and hardware; other pages (e.g. Python API, chunking) link to it.
Chunking is the canonical place for tokenizer and split defaults; benchmarking and quickstart-library-mode reference it.
dense_dim: Canonical guidance (1024 for llama-3.2 embedder, 2048 for e5-v5) is in data-store.md; quickstart-guide and quickstart-library-mode link there instead of repeating it.
Known issues: troubleshoot.md has a short “Known issues and release-specific caveats” section that points to Release Notes and Support Matrix; Release Notes added to Related Topics.
Canonical tokenizer section
Added a single canonical subsection Token-based splitting and tokenizers in chunking.md covering: default tokenizer behavior (pre-downloaded Llama vs e5, runtime fallback), optional Llama tokenizer and Hugging Face gating, and env vars DOWNLOAD_LLAMA_TOKENIZER and HF_ACCESS_TOKEN (build vs runtime).
Aligned wording with current code/container (e.g. split_text.py, post_build_triggers.py, docker-compose defaults).
All other tokenizer mentions (environment-config, nv-ingest-python-api, releasenotes, support-matrix) now give a short note and link to this subsection.
Documentation maintenance guidance 2
Canonical reference table: Maps each topic (product naming, env vars, scaling, tokenizer, NIMs, metadata, data-store, known issues) to its canonical page/section.
Pattern for any architecture/code change: (1) Identify the concept, (2) Choose canonical home, (3) Update canonical description first, (4) De-duplicate and re-point (search and link). Tokenizer is documented as an example.
Guardrails: Do not add a second long description of the same concept without moving it into the canonical section or updating that section and linking back; avoid two competing “main” descriptions.
Style and linking: Use consistent “For full details / most up-to-date … see [canonical section]” phrasing; use stable relative links; if a canonical section is renamed or moved, update all inbound links in the same commit.
When tokenizer behavior changes: Update the tokenizer canonical subsection first, then search for tokenizer-related terms and align/link all hits.

Result

One canonical place per concept; other pages summarize and link.
Future changes (tokenizer, env vars, models, schema, etc.) follow the same flow: update canonical section → search → de-duplicate and re-point.
Consistent cross-reference phrasing and stable links to reduce drift and duplicate “main” descriptions.

kheiss-uwzoo added 2 commits February 19, 2026 10:36

Update PDF blueprint architecture diagram

cd3c368

Single source of truth

f721116

kheiss-uwzoo added the doc Improvements or additions to documentation label Feb 20, 2026

kheiss-uwzoo added 2 commits February 20, 2026 13:29

Update audio.md

2457e94

Update chunking.md

7bf09aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Kheiss/centralization#1416

Kheiss/centralization#1416
kheiss-uwzoo wants to merge 4 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/centralization

kheiss-uwzoo commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

kheiss-uwzoo commented Feb 20, 2026

Summary

Changes

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant