Skip to content

[BUG]: Clean up dependencies in api/pyproject.toml #1392

@charlesbluca

Description

@charlesbluca

Version

main

Which installation method(s) does this occur on?

Source

Describe the bug.

Summary

Audit of api/ found missing core dependencies (used in api/src but not declared) and dependencies that are only needed for tests and should be moved to an optional test extra.


1. Add missing core dependencies

These packages are imported in api/src but not listed in api/pyproject.toml. They should be added under [project] dependencies.

Package PyPI name Notes
numpy numpy Used in yolox, ocr, pdfium, transforms, table_and_chart, etc.
pypdfium2 pypdfium2 Used in pdf util, PDF engines, metadata aggregators, pptx_helper.
requests requests Used in rest client, nim client, helpers, tika engine.
OpenCV opencv-python Imported as cv2 in transforms and model_interface/helpers.
Pillow Pillow Imported as PIL in transforms, aggregators, image_helpers, cached.
gRPC grpcio Imported as grpc in parakeet model interface.
scikit-learn scikit-learn Imported as sklearn; used in table_and_chart.py for sklearn.cluster.DBSCAN.
redis redis Used in util/service_clients/redis/redis_client.py.
python-docx python-docx Imported as docx in docx extractor (internal/extract/docx/.../docxreader.py).
python-pptx python-pptx Imported as pptx in pptx helper (internal/extract/pptx/engines/pptx_helper.py).
minio minio Used in internal/store/embed_text_upload.py for Minio client.
pymilvus pymilvus Used in internal/store/embed_text_upload.py for Collection, connections, bulk writer.
aiohttp aiohttp Used in internal/extract/pdf/engines/llama.py for async HTTP.
scipy scipy Used in internal/primitives/nim/model_interface/parakeet.py (scipy.io.wavfile).
nvidia-riva-client nvidia-riva-client Imported as riva.client in parakeet model interface.
unstructured-client unstructured-client Used in internal/extract/pdf/engines/unstructured_io.py.
tqdm tqdm Used in util/dataloader/dataloader.py.
python-dateutil python-dateutil Imported as dateutil in util/converters/datetools.py.
fastparquet fastparquet Used in util/converters/dftools.py.

Optional: Add openai if the LLM summarizer UDF (api/src/udfs/llm_summarizer_udf.py) is part of the shipped package.

GPU / optional: cudf is used in util/converters/dftools.py; consider adding as an optional extra (e.g. gpu or cudf) rather than a core dependency.


2. Move test-only dependencies out of core

These are currently in dependencies but are only used by tests. Move them into [project.optional-dependencies] (e.g. a test extra).

Package Action
moviepy Remove from core dependencies; add to optional-dependencies (e.g. test). Only used in api_tests/util/dataloader/ (dataloader_test_tools, test_dataloader_video).
pydantic-settings Remove from core dependencies (not used in api src or api_tests). Add to an optional extra later if needed.

Acceptance criteria

  • All 19 core packages above are listed in api/pyproject.toml under dependencies (and optionally openai if applicable; consider cudf as an optional extra).
  • moviepy and pydantic-settings are removed from core dependencies.
  • An optional-dependencies group (e.g. test) exists and includes moviepy (and optionally pytest/ray if desired for test runs).
  • Install with no extras works for production code; install with the test extra works for running the full test suite.

Minimum reproducible example

Relevant log output

Other/Misc.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions