Skip to content

Comments

Kheiss/centralization#1416

Draft
kheiss-uwzoo wants to merge 4 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/centralization
Draft

Kheiss/centralization#1416
kheiss-uwzoo wants to merge 4 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/centralization

Conversation

@kheiss-uwzoo
Copy link
Collaborator

Summary

Refactors the NV-Ingest docs so each technical fact has one canonical place, removes duplication, and adds clear patterns and guardrails for future doc updates.

Changes

  1. Single source of truth
    Product naming: The note “NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest” lives only in overview.md. All other pages use a one-line pointer to the overview.
    Environment variables: environment-config.md is the canonical env reference. nimclient.md and scaling-modes.md reference it; scaling-related vars are documented in scaling-modes and linked from environment-config.
    Broken links: Replaced environment-variables.md with environment-config.md in chunking and content-metadata. Fixed metadata schema link in user-defined-functions 1.

  2. Centralization of technical facts
    Support Matrix is the canonical reference for pipeline NIMs and hardware; other pages (e.g. Python API, chunking) link to it.
    Chunking is the canonical place for tokenizer and split defaults; benchmarking and quickstart-library-mode reference it.
    dense_dim: Canonical guidance (1024 for llama-3.2 embedder, 2048 for e5-v5) is in data-store.md; quickstart-guide and quickstart-library-mode link there instead of repeating it.
    Known issues: troubleshoot.md has a short “Known issues and release-specific caveats” section that points to Release Notes and Support Matrix; Release Notes added to Related Topics.

  3. Canonical tokenizer section
    Added a single canonical subsection Token-based splitting and tokenizers in chunking.md covering: default tokenizer behavior (pre-downloaded Llama vs e5, runtime fallback), optional Llama tokenizer and Hugging Face gating, and env vars DOWNLOAD_LLAMA_TOKENIZER and HF_ACCESS_TOKEN (build vs runtime).
    Aligned wording with current code/container (e.g. split_text.py, post_build_triggers.py, docker-compose defaults).
    All other tokenizer mentions (environment-config, nv-ingest-python-api, releasenotes, support-matrix) now give a short note and link to this subsection.

  4. Documentation maintenance guidance 2
    Canonical reference table: Maps each topic (product naming, env vars, scaling, tokenizer, NIMs, metadata, data-store, known issues) to its canonical page/section.
    Pattern for any architecture/code change: (1) Identify the concept, (2) Choose canonical home, (3) Update canonical description first, (4) De-duplicate and re-point (search and link). Tokenizer is documented as an example.
    Guardrails: Do not add a second long description of the same concept without moving it into the canonical section or updating that section and linking back; avoid two competing “main” descriptions.
    Style and linking: Use consistent “For full details / most up-to-date … see [canonical section]” phrasing; use stable relative links; if a canonical section is renamed or moved, update all inbound links in the same commit.
    When tokenizer behavior changes: Update the tokenizer canonical subsection first, then search for tokenizer-related terms and align/link all hits.

Result

  • One canonical place per concept; other pages summarize and link.

  • Future changes (tokenizer, env vars, models, schema, etc.) follow the same flow: update canonical section → search → de-duplicate and re-point.

  • Consistent cross-reference phrasing and stable links to reduce drift and duplicate “main” descriptions.

@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant