feat: add support for plain text files as prompt datasets#103
feat: add support for plain text files as prompt datasets#103ricyoung wants to merge 4 commits intop-e-w:masterfrom
Conversation
Adds the ability to load prompts from .txt files (one prompt per line) as an alternative to Hugging Face datasets. This provides a simpler way for users to create custom prompt datasets without needing to format them as HF datasets. Features: - Detects .txt files by extension and reads them line by line - Supports slice notation (e.g., split = "[:100]" for first 100 lines) - Prefix and suffix options work as expected - Falls back to existing HF dataset loading for non-.txt paths Closes p-e-w#98
Summary of ChangesHello @ricyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement by allowing prompt datasets to be loaded from simple plain text files. This provides users with a more straightforward and flexible method for defining and utilizing custom prompts, complementing the existing HuggingFace dataset integration. The change simplifies the process for quick experimentation and custom prompt management. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a convenient feature to load prompt datasets from plain text files. The implementation is straightforward, but the manual parsing of the split string is fragile and can lead to crashes with invalid input. I've suggested a more robust approach using the datasets library's own parsing capabilities, which also adds support for more complex split specifications for free, making it consistent with how HuggingFace datasets are handled.
src/heretic/utils.py
Outdated
| # Apply split specification (e.g., "train[:100]" becomes "[:100]") | ||
| # Extract slice notation if present | ||
| if "[" in split_str and "]" in split_str: | ||
| slice_part = split_str[split_str.index("[") : split_str.index("]") + 1] | ||
| # Parse slice notation like "[:100]" or "[50:150]" | ||
| slice_content = slice_part[1:-1] # Remove brackets | ||
| if ":" in slice_content: | ||
| parts = slice_content.split(":") | ||
| start = int(parts[0]) if parts[0] else None | ||
| end = int(parts[1]) if parts[1] else None | ||
| prompts = prompts[start:end] |
There was a problem hiding this comment.
The current implementation for parsing the split string is brittle and can lead to unexpected behavior or crashes.
For example:
- An invalid slice like
[1a:10]will cause aValueErrorand crash the program because ofint('1a'). - It only supports slice notation with a colon (
:), but not single-index selection like[100]or percentage-based slicing like[:10%], which are supported by thedatasetslibrary.
It's better to reuse datasets.ReadInstruction for this, which is more robust, feature-complete, and handles various edge cases and invalid inputs gracefully. This also makes the behavior consistent with how HuggingFace datasets are handled elsewhere in the code.
| # Apply split specification (e.g., "train[:100]" becomes "[:100]") | |
| # Extract slice notation if present | |
| if "[" in split_str and "]" in split_str: | |
| slice_part = split_str[split_str.index("[") : split_str.index("]") + 1] | |
| # Parse slice notation like "[:100]" or "[50:150]" | |
| slice_content = slice_part[1:-1] # Remove brackets | |
| if ":" in slice_content: | |
| parts = slice_content.split(":") | |
| start = int(parts[0]) if parts[0] else None | |
| end = int(parts[1]) if parts[1] else None | |
| prompts = prompts[start:end] | |
| # Apply split specification using the robust parsing from the `datasets` library. | |
| # We treat the list of prompts as a single split, so the name is arbitrary. | |
| try: | |
| instruction = ReadInstruction.from_spec(split_str) | |
| name2len = {"_": len(prompts)} | |
| abs_instruction = instruction.to_absolute(name2len)[0] | |
| prompts = prompts[abs_instruction.from_ : abs_instruction.to] | |
| except (ValueError, IndexError): | |
| # Fall back to using all prompts if split_str is invalid. | |
| # A warning could be logged here for better user feedback. | |
| pass |
Replaces manual string parsing with datasets.ReadInstruction for: - Better error handling (invalid slices don't crash) - Support for percentage-based slicing (e.g., [:40%]) - Consistency with HuggingFace dataset handling elsewhere in code Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Applied @gemini-code-assist's suggestion to use |
|
That's excellent to hear, @ricyoung! I'm glad the suggestion to use |
src/heretic/utils.py
Outdated
| ) | ||
| # Parse the split instructions. | ||
| # Support for plain text files (one prompt per line) | ||
| if path.endswith(".txt") and os.path.isfile(path): |
There was a problem hiding this comment.
| if path.endswith(".txt") and os.path.isfile(path): | |
| if os.path.isfile(path): |
I don't think we should require a specific extension, as plain text files are often stored without extensions on Unix systems.
src/heretic/utils.py
Outdated
| # But also don't use cached data, as the dataset may have changed on disk. | ||
| download_mode=DownloadMode.FORCE_REDOWNLOAD, | ||
| ) | ||
| prompts = prompts[abs_instruction.from_ : abs_instruction.to] |
There was a problem hiding this comment.
This duplicates code below, and should be moved to a common helper function.
config.default.toml
Outdated
| # - One prompt per line (UTF-8 encoding) | ||
| # - Empty lines and whitespace-only lines are ignored | ||
| # - Leading/trailing whitespace is stripped from each prompt | ||
| # - The "column" field is ignored for text files |
There was a problem hiding this comment.
It shouldn't be required then, and neither should the split field.
- Remove .txt extension requirement for text files (any file is now supported) - Extract ReadInstruction parsing into _get_split_slice() helper function - Make 'split' and 'column' fields optional in DatasetSpecification - Add validation errors for HuggingFace datasets missing required fields - Update documentation to clarify text file requirements
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for loading prompts from plain text files, offering a simpler alternative to HuggingFace datasets. The implementation is clean and well-executed, with logical changes in heretic/utils.py for file parsing and slicing, and necessary modifications to DatasetSpecification in heretic/config.py. The extraction of slicing logic into the _get_split_slice helper function is a nice improvement for code clarity and reuse. The documentation updates in config.default.toml are also clear and helpful. My review includes a couple of minor style suggestions to align with the repository's coding conventions.
| dataset = load_from_disk(path) | ||
| assert not isinstance(dataset, DatasetDict), ( | ||
| "Loading dataset dicts is not supported" | ||
| # Support for plain text files (one prompt per line) |
There was a problem hiding this comment.
This comment is missing a period at the end. According to the repository's style guide, comments should end with a period.
| # Support for plain text files (one prompt per line) | |
| # Support for plain text files (one prompt per line). |
References
- Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. (link)
| pass | ||
|
|
||
| else: | ||
| # Load from HuggingFace datasets (local directory or Hub) |
There was a problem hiding this comment.
This comment is missing a period at the end. The repository's style guide requires comments to end with a period.
| # Load from HuggingFace datasets (local directory or Hub) | |
| # Load from HuggingFace datasets (local directory or Hub). |
References
- Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. (link)
| start, end = _get_split_slice(split_str, len(prompts), "_") | ||
| prompts = prompts[start:end] | ||
| except (ValueError, IndexError): | ||
| # If split_str doesn't contain valid slice notation, use all prompts. |
There was a problem hiding this comment.
No, it should throw an error. If the user put something in there they expect it to be used. If it can't be, we need to tell them.
There was a problem hiding this comment.
Simply remove that try/except block. That's what we do for the directory case as well.
|
This looks good apart from the error thing. |
Feat/txt dataset support Read the comments on this, add suggestions to my fork: p-e-w#103
Summary
split = "[:100]"for first 100 lines)Changes
load_prompts()inutils.pyto detect and parse plain text files_get_split_slice()helper function for robust split parsing (reusesReadInstruction)splitandcolumnfields optional inDatasetSpecification(ignored for text files)config.default.tomlexplaining the dataset optionsExample usage
Test plan
[:3],[1:3],[:2])Closes #98