Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the NV-Ingest docs so each technical fact has one canonical place, removes duplication, and adds clear patterns and guardrails for future doc updates.
Changes
Single source of truth
Product naming: The note “NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest” lives only in overview.md. All other pages use a one-line pointer to the overview.
Environment variables: environment-config.md is the canonical env reference. nimclient.md and scaling-modes.md reference it; scaling-related vars are documented in scaling-modes and linked from environment-config.
Broken links: Replaced environment-variables.md with environment-config.md in chunking and content-metadata. Fixed metadata schema link in user-defined-functions 1.
Centralization of technical facts
Support Matrix is the canonical reference for pipeline NIMs and hardware; other pages (e.g. Python API, chunking) link to it.
Chunking is the canonical place for tokenizer and split defaults; benchmarking and quickstart-library-mode reference it.
dense_dim: Canonical guidance (1024 for llama-3.2 embedder, 2048 for e5-v5) is in data-store.md; quickstart-guide and quickstart-library-mode link there instead of repeating it.
Known issues: troubleshoot.md has a short “Known issues and release-specific caveats” section that points to Release Notes and Support Matrix; Release Notes added to Related Topics.
Canonical tokenizer section
Added a single canonical subsection Token-based splitting and tokenizers in chunking.md covering: default tokenizer behavior (pre-downloaded Llama vs e5, runtime fallback), optional Llama tokenizer and Hugging Face gating, and env vars DOWNLOAD_LLAMA_TOKENIZER and HF_ACCESS_TOKEN (build vs runtime).
Aligned wording with current code/container (e.g. split_text.py, post_build_triggers.py, docker-compose defaults).
All other tokenizer mentions (environment-config, nv-ingest-python-api, releasenotes, support-matrix) now give a short note and link to this subsection.
Documentation maintenance guidance 2
Canonical reference table: Maps each topic (product naming, env vars, scaling, tokenizer, NIMs, metadata, data-store, known issues) to its canonical page/section.
Pattern for any architecture/code change: (1) Identify the concept, (2) Choose canonical home, (3) Update canonical description first, (4) De-duplicate and re-point (search and link). Tokenizer is documented as an example.
Guardrails: Do not add a second long description of the same concept without moving it into the canonical section or updating that section and linking back; avoid two competing “main” descriptions.
Style and linking: Use consistent “For full details / most up-to-date … see [canonical section]” phrasing; use stable relative links; if a canonical section is renamed or moved, update all inbound links in the same commit.
When tokenizer behavior changes: Update the tokenizer canonical subsection first, then search for tokenizer-related terms and align/link all hits.
Result
One canonical place per concept; other pages summarize and link.
Future changes (tokenizer, env vars, models, schema, etc.) follow the same flow: update canonical section → search → de-duplicate and re-point.
Consistent cross-reference phrasing and stable links to reduce drift and duplicate “main” descriptions.