Skip to content

fix: make bitsandbytes optional for CPU-only systems#101

Open
ricyoung wants to merge 1 commit intop-e-w:masterfrom
ricyoung:feat/cpu-only-mode
Open

fix: make bitsandbytes optional for CPU-only systems#101
ricyoung wants to merge 1 commit intop-e-w:masterfrom
ricyoung:feat/cpu-only-mode

Conversation

@ricyoung
Copy link
Contributor

@ricyoung ricyoung commented Jan 8, 2026

Summary

Makes bitsandbytes an optional dependency so Heretic can run on CPU-only systems where bitsandbytes (which requires CUDA) is not available.

Changes

  • Made bitsandbytes import conditional with graceful fallback (HAS_BITSANDBYTES flag)
  • Added platform condition in pyproject.toml to exclude bitsandbytes on Intel Macs (where it won't work)
  • Added helpful error message when 4-bit quantization is requested but bitsandbytes is unavailable
  • Added documentation comments in config.default.toml about CPU mode and quantization requirements

Usage

For CPU-only systems, users can set device_map = "cpu" in their config (this was already supported by Accelerate).

Test plan

  • Verified bitsandbytes import is skipped gracefully on systems without CUDA
  • Verified clear error message when quantization requested without bitsandbytes
  • Verified existing GPU workflow still works unchanged

Closes #12

@gemini-code-assist
Copy link

Summary of Changes

Hello @ricyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's flexibility and compatibility by introducing explicit control over the compute device. Users can now specify whether to run on CPU, CUDA, or MPS, making the application more adaptable to various hardware configurations, including CPU-only systems. The changes also include robust handling of dependencies like bitsandbytes and optimized dtype selection for CPU, ensuring a smoother experience across different environments.

Highlights

  • Explicit Device Selection: Introduced a new --device CLI option allowing users to explicitly choose the compute device for inference, with options for auto (default), cpu, cuda, and mps.
  • Conditional bitsandbytes Import: The bitsandbytes library is now imported conditionally, ensuring graceful fallback and preventing errors on systems where it's not available or compatible (e.g., Intel Macs).
  • CPU-Optimized Behavior: When running in CPU-only mode, 4-bit quantization is automatically disabled (as bitsandbytes requires CUDA), and float32 dtype is prioritized for better compatibility.
  • Device Availability Validation: The system now validates the availability of explicitly requested devices (CUDA, MPS) and provides informative error messages if the chosen device is not found.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable --device flag, enhancing flexibility by allowing explicit device selection and enabling CPU-only execution. The changes are well-structured, with conditional dependency management for bitsandbytes and clear configuration options. The logic for handling different devices, including graceful fallbacks and user warnings, is robust. I have one minor suggestion to simplify a piece of logic for improved readability, but overall, this is an excellent contribution.

Comment on lines 92 to 95
if "float32" in dtypes_to_try:
dtypes_to_try = ["float32"] + [d for d in dtypes_to_try if d != "float32"]
else:
dtypes_to_try = ["float32"] + list(dtypes_to_try)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for prioritizing float32 on CPU can be simplified. The if/else block is functionally correct but can be expressed more concisely with a single list comprehension that handles both cases (whether "float32" is already in the list or not).

            dtypes_to_try = ["float32"] + [d for d in dtypes_to_try if d != "float32"]

@p-e-w
Copy link
Owner

p-e-w commented Jan 8, 2026

This PR adds a --device CLI option that allows users to explicitly select the compute device, enabling Heretic to run on systems without a GPU.

Isn't this what already happens if you install the CPU version of PyTorch?

@ricyoung
Copy link
Contributor Author

ricyoung commented Jan 8, 2026

Good question! You're right that CPU-only PyTorch works, but this PR addresses a few additional issues:

  1. bitsandbytes dependency - Even with CPU PyTorch installed, bitsandbytes will fail to import on systems without CUDA (it's a hard dependency in pyproject.toml). This PR makes the import conditional and adds a platform marker to skip it entirely on Intel Macs where it won't even install.

  2. Explicit device control - Sometimes users want to force CPU even when a GPU is available (e.g., for testing, debugging, because they are crazy? or when GPU memory is limited). --device cpu provides that option.

  3. Better UX for edge cases - When running on a system without GPU support:

    • Shows a clear warning: "Running in CPU-only mode. This will be slow."
    • Automatically disables quantization with an explanation
    • Prefers float32 dtype for maximum compatibility

The main motivation was the original issue #12 where bitsandbytes caused installation/import failures on Intel Macs. Rather than just patching that specific case, this provides a more general solution.

That said, if you think the simpler approach (just making bitsandbytes optional) is preferable, I'm happy to slim down the PR to just that change.

dtypes_to_try = settings.dtypes
if settings.device == DeviceType.CPU:
# Put float32 first for CPU, as it's most compatible
dtypes_to_try = ["float32"] + [d for d in dtypes_to_try if d != "float32"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. There is a setting for dtypes, and the program should use it as specified. Not magically do something else in a certain configuration. If the user wants something else, they can configure that.

Note that many newer CPUs have native support for BF16, so this often does the wrong thing anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess - some CPU have support for BF16.

Intel Xeon: Support started with 3rd Generation Intel Xeon Scalable Processors (Cooper Lake) and continues in later generations, utilizing the AVX-512 BF16 extensions.

AMD Zen: Modern AMD processors based on the Zen architecture, including certain AMD EPYC (Zen 4) and Ryzen AI series, incorporate BF16 support.

Good point

# "cpu" - Force CPU-only mode (slow but works anywhere)
# "cuda" - Force NVIDIA CUDA GPU
# "mps" - Force Apple Metal Performance Shaders (Apple Silicon)
device = "auto"
Copy link
Owner

@p-e-w p-e-w Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does this setting offer that the device_map setting doesn't?

You can set device_map to cpu already, and it will offload the entire model to CPU RAM. This is explicitly documented in Accelerate.

- Make bitsandbytes import conditional with graceful fallback
- Exclude bitsandbytes dependency on Intel Macs where it can't work
- Provide clear error message when quantization requested without bitsandbytes
- Add documentation for CPU-only usage (device_map = "cpu")

This allows Heretic to run on systems without CUDA support by using
device_map = "cpu" and quantization = "none".

Closes p-e-w#12
@ricyoung ricyoung changed the title feat: add --device flag for CPU-only and explicit device selection fix: make bitsandbytes optional for CPU-only systems Jan 12, 2026

# Quantization method to use when loading the model.
# Options: "none" (no quantization), "bnb_4bit" (4-bit quantization using bitsandbytes).
# Note: 4-bit quantization requires bitsandbytes, which needs CUDA support.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That contradicts the bnb docs: https://huggingface.co/docs/transformers/en/quantization/bitsandbytes

bitsandbytes is supported on NVIDIA GPUs for CUDA versions 11.8 - 13.0, Intel XPU, Intel Gaudi (HPU), and CPU.

There are several mentions on that page of CPU-only usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that ROCm-via-HIP is also (partially) supported: https://github.com/bitsandbytes-foundation/bitsandbytes#legend

But yeah it seems like CPUs should be supported as well, except for "8-bit Optimizers" (whatever that means).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants