Skip to content

feat: add support for plain text files as prompt datasets#103

Open
ricyoung wants to merge 4 commits intop-e-w:masterfrom
ricyoung:feat/txt-dataset-support
Open

feat: add support for plain text files as prompt datasets#103
ricyoung wants to merge 4 commits intop-e-w:masterfrom
ricyoung:feat/txt-dataset-support

Conversation

@ricyoung
Copy link
Contributor

@ricyoung ricyoung commented Jan 8, 2026

Summary

  • Adds support for loading prompts from plain text files (one prompt per line)
  • Provides a simpler alternative to creating HuggingFace datasets for custom prompts
  • Slice notation works (e.g., split = "[:100]" for first 100 lines)

Changes

  • Modified load_prompts() in utils.py to detect and parse plain text files
  • Added _get_split_slice() helper function for robust split parsing (reuses ReadInstruction)
  • Made split and column fields optional in DatasetSpecification (ignored for text files)
  • Added documentation in config.default.toml explaining the dataset options

Example usage

[bad_prompts]
dataset = "/path/to/my_prompts"
split = "[:100]"  # Optional: use first 100 lines

Test plan

  • Tested loading full text file
  • Tested slice notation ([:3], [1:3], [:2])
  • Verified prefix/suffix options work correctly
  • Verified existing HF dataset loading still works (fallback path)
  • Verified proper error messages when split/column missing for HF datasets

Closes #98

Adds the ability to load prompts from .txt files (one prompt per line)
as an alternative to Hugging Face datasets. This provides a simpler
way for users to create custom prompt datasets without needing to
format them as HF datasets.

Features:
- Detects .txt files by extension and reads them line by line
- Supports slice notation (e.g., split = "[:100]" for first 100 lines)
- Prefix and suffix options work as expected
- Falls back to existing HF dataset loading for non-.txt paths

Closes p-e-w#98
@gemini-code-assist
Copy link

Summary of Changes

Hello @ricyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement by allowing prompt datasets to be loaded from simple plain text files. This provides users with a more straightforward and flexible method for defining and utilizing custom prompts, complementing the existing HuggingFace dataset integration. The change simplifies the process for quick experimentation and custom prompt management.

Highlights

  • Plain Text File Support: The system now supports loading prompt datasets directly from plain .txt files, where each line represents a single prompt.
  • Simplified Prompt Management: This feature offers a simpler and more direct alternative for users to manage custom prompts, bypassing the need to create HuggingFace datasets.
  • Slice Notation Compatibility: Existing slice notation (e.g., [:100]) can still be used with .txt files to specify a subset of prompts.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a convenient feature to load prompt datasets from plain text files. The implementation is straightforward, but the manual parsing of the split string is fragile and can lead to crashes with invalid input. I've suggested a more robust approach using the datasets library's own parsing capabilities, which also adds support for more complex split specifications for free, making it consistent with how HuggingFace datasets are handled.

Comment on lines 157 to 167
# Apply split specification (e.g., "train[:100]" becomes "[:100]")
# Extract slice notation if present
if "[" in split_str and "]" in split_str:
slice_part = split_str[split_str.index("[") : split_str.index("]") + 1]
# Parse slice notation like "[:100]" or "[50:150]"
slice_content = slice_part[1:-1] # Remove brackets
if ":" in slice_content:
parts = slice_content.split(":")
start = int(parts[0]) if parts[0] else None
end = int(parts[1]) if parts[1] else None
prompts = prompts[start:end]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for parsing the split string is brittle and can lead to unexpected behavior or crashes.

For example:

  • An invalid slice like [1a:10] will cause a ValueError and crash the program because of int('1a').
  • It only supports slice notation with a colon (:), but not single-index selection like [100] or percentage-based slicing like [:10%], which are supported by the datasets library.

It's better to reuse datasets.ReadInstruction for this, which is more robust, feature-complete, and handles various edge cases and invalid inputs gracefully. This also makes the behavior consistent with how HuggingFace datasets are handled elsewhere in the code.

Suggested change
# Apply split specification (e.g., "train[:100]" becomes "[:100]")
# Extract slice notation if present
if "[" in split_str and "]" in split_str:
slice_part = split_str[split_str.index("[") : split_str.index("]") + 1]
# Parse slice notation like "[:100]" or "[50:150]"
slice_content = slice_part[1:-1] # Remove brackets
if ":" in slice_content:
parts = slice_content.split(":")
start = int(parts[0]) if parts[0] else None
end = int(parts[1]) if parts[1] else None
prompts = prompts[start:end]
# Apply split specification using the robust parsing from the `datasets` library.
# We treat the list of prompts as a single split, so the name is arbitrary.
try:
instruction = ReadInstruction.from_spec(split_str)
name2len = {"_": len(prompts)}
abs_instruction = instruction.to_absolute(name2len)[0]
prompts = prompts[abs_instruction.from_ : abs_instruction.to]
except (ValueError, IndexError):
# Fall back to using all prompts if split_str is invalid.
# A warning could be logged here for better user feedback.
pass

ricyoung and others added 2 commits January 8, 2026 07:32
Replaces manual string parsing with datasets.ReadInstruction for:
- Better error handling (invalid slices don't crash)
- Support for percentage-based slicing (e.g., [:40%])
- Consistency with HuggingFace dataset handling elsewhere in code

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ricyoung
Copy link
Contributor Author

ricyoung commented Jan 8, 2026

Applied @gemini-code-assist's suggestion to use ReadInstruction from the datasets library instead of manual string parsing. This provides better error handling and supports percentage-based slicing (e.g., [:40%]).

@gemini-code-assist
Copy link

That's excellent to hear, @ricyoung! I'm glad the suggestion to use ReadInstruction was helpful. This indeed provides more robust error handling and expands the flexibility for slice notation, including percentage-based slicing. Thanks for incorporating it!

)
# Parse the split instructions.
# Support for plain text files (one prompt per line)
if path.endswith(".txt") and os.path.isfile(path):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if path.endswith(".txt") and os.path.isfile(path):
if os.path.isfile(path):

I don't think we should require a specific extension, as plain text files are often stored without extensions on Unix systems.

# But also don't use cached data, as the dataset may have changed on disk.
download_mode=DownloadMode.FORCE_REDOWNLOAD,
)
prompts = prompts[abs_instruction.from_ : abs_instruction.to]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates code below, and should be moved to a common helper function.

# - One prompt per line (UTF-8 encoding)
# - Empty lines and whitespace-only lines are ignored
# - Leading/trailing whitespace is stripped from each prompt
# - The "column" field is ignored for text files
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be required then, and neither should the split field.

- Remove .txt extension requirement for text files (any file is now supported)
- Extract ReadInstruction parsing into _get_split_slice() helper function
- Make 'split' and 'column' fields optional in DatasetSpecification
- Add validation errors for HuggingFace datasets missing required fields
- Update documentation to clarify text file requirements
@p-e-w
Copy link
Owner

p-e-w commented Jan 13, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for loading prompts from plain text files, offering a simpler alternative to HuggingFace datasets. The implementation is clean and well-executed, with logical changes in heretic/utils.py for file parsing and slicing, and necessary modifications to DatasetSpecification in heretic/config.py. The extraction of slicing logic into the _get_split_slice helper function is a nice improvement for code clarity and reuse. The documentation updates in config.default.toml are also clear and helpful. My review includes a couple of minor style suggestions to align with the repository's coding conventions.

dataset = load_from_disk(path)
assert not isinstance(dataset, DatasetDict), (
"Loading dataset dicts is not supported"
# Support for plain text files (one prompt per line)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is missing a period at the end. According to the repository's style guide, comments should end with a period.

Suggested change
# Support for plain text files (one prompt per line)
# Support for plain text files (one prompt per line).
References
  1. Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. (link)

pass

else:
# Load from HuggingFace datasets (local directory or Hub)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is missing a period at the end. The repository's style guide requires comments to end with a period.

Suggested change
# Load from HuggingFace datasets (local directory or Hub)
# Load from HuggingFace datasets (local directory or Hub).
References
  1. Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. (link)

start, end = _get_split_slice(split_str, len(prompts), "_")
prompts = prompts[start:end]
except (ValueError, IndexError):
# If split_str doesn't contain valid slice notation, use all prompts.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should throw an error. If the user put something in there they expect it to be used. If it can't be, we need to tell them.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply remove that try/except block. That's what we do for the directory case as well.

@p-e-w
Copy link
Owner

p-e-w commented Jan 13, 2026

This looks good apart from the error thing.

dissociativity added a commit to dissociativity/heretic that referenced this pull request Jan 19, 2026
Feat/txt dataset support
Read the comments on this, add suggestions to my fork:
p-e-w#103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add option to custom datasets via txt file

2 participants