feat: add support for plain text files as prompt datasets by ricyoung · Pull Request #103 · p-e-w/heretic

ricyoung · 2026-01-08T15:15:19Z

Summary

Adds support for loading prompts from plain text files (one prompt per line)
Provides a simpler alternative to creating HuggingFace datasets for custom prompts
Slice notation works (e.g., split = "[:100]" for first 100 lines)

Changes

Modified load_prompts() in utils.py to detect and parse plain text files
Added _get_split_slice() helper function for robust split parsing (reuses ReadInstruction)
Made split and column fields optional in DatasetSpecification (ignored for text files)
Added documentation in config.default.toml explaining the dataset options

Example usage

[bad_prompts]
dataset = "/path/to/my_prompts"
split = "[:100]"  # Optional: use first 100 lines

Test plan

Tested loading full text file
Tested slice notation ([:3], [1:3], [:2])
Verified prefix/suffix options work correctly
Verified existing HF dataset loading still works (fallback path)
Verified proper error messages when split/column missing for HF datasets

Closes #98

Adds the ability to load prompts from .txt files (one prompt per line) as an alternative to Hugging Face datasets. This provides a simpler way for users to create custom prompt datasets without needing to format them as HF datasets. Features: - Detects .txt files by extension and reads them line by line - Supports slice notation (e.g., split = "[:100]" for first 100 lines) - Prefix and suffix options work as expected - Falls back to existing HF dataset loading for non-.txt paths Closes p-e-w#98

gemini-code-assist · 2026-01-08T15:15:33Z

Summary of Changes

Hello @ricyoung, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement by allowing prompt datasets to be loaded from simple plain text files. This provides users with a more straightforward and flexible method for defining and utilizing custom prompts, complementing the existing HuggingFace dataset integration. The change simplifies the process for quick experimentation and custom prompt management.

Highlights

Plain Text File Support: The system now supports loading prompt datasets directly from plain .txt files, where each line represents a single prompt.
Simplified Prompt Management: This feature offers a simpler and more direct alternative for users to manage custom prompts, bypassing the need to create HuggingFace datasets.
Slice Notation Compatibility: Existing slice notation (e.g., [:100]) can still be used with .txt files to specify a subset of prompts.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a convenient feature to load prompt datasets from plain text files. The implementation is straightforward, but the manual parsing of the split string is fragile and can lead to crashes with invalid input. I've suggested a more robust approach using the datasets library's own parsing capabilities, which also adds support for more complex split specifications for free, making it consistent with how HuggingFace datasets are handled.

gemini-code-assist · 2026-01-08T15:16:55Z

src/heretic/utils.py

+        # Apply split specification (e.g., "train[:100]" becomes "[:100]")
+        # Extract slice notation if present
+        if "[" in split_str and "]" in split_str:
+            slice_part = split_str[split_str.index("[") : split_str.index("]") + 1]
+            # Parse slice notation like "[:100]" or "[50:150]"
+            slice_content = slice_part[1:-1]  # Remove brackets
+            if ":" in slice_content:
+                parts = slice_content.split(":")
+                start = int(parts[0]) if parts[0] else None
+                end = int(parts[1]) if parts[1] else None
+                prompts = prompts[start:end]


The current implementation for parsing the split string is brittle and can lead to unexpected behavior or crashes.

For example:

An invalid slice like [1a:10] will cause a ValueError and crash the program because of int('1a').

It only supports slice notation with a colon (:), but not single-index selection like [100] or percentage-based slicing like [:10%], which are supported by the datasets library.

It's better to reuse datasets.ReadInstruction for this, which is more robust, feature-complete, and handles various edge cases and invalid inputs gracefully. This also makes the behavior consistent with how HuggingFace datasets are handled elsewhere in the code.

Suggested change

# Apply split specification (e.g., "train[:100]" becomes "[:100]")

# Extract slice notation if present

if "[" in split_str and "]" in split_str:

slice_part = split_str[split_str.index("[") : split_str.index("]") + 1]

# Parse slice notation like "[:100]" or "[50:150]"

slice_content = slice_part[1:-1] # Remove brackets

if ":" in slice_content:

parts = slice_content.split(":")

start = int(parts[0]) if parts[0] else None

end = int(parts[1]) if parts[1] else None

prompts = prompts[start:end]

# Apply split specification using the robust parsing from the `datasets` library.

# We treat the list of prompts as a single split, so the name is arbitrary.

try:

instruction = ReadInstruction.from_spec(split_str)

name2len = {"_": len(prompts)}

abs_instruction = instruction.to_absolute(name2len)[0]

prompts = prompts[abs_instruction.from_ : abs_instruction.to]

except (ValueError, IndexError):

# Fall back to using all prompts if split_str is invalid.

# A warning could be logged here for better user feedback.

pass

Replaces manual string parsing with datasets.ReadInstruction for: - Better error handling (invalid slices don't crash) - Support for percentage-based slicing (e.g., [:40%]) - Consistency with HuggingFace dataset handling elsewhere in code Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ricyoung · 2026-01-08T16:06:36Z

Applied @gemini-code-assist's suggestion to use ReadInstruction from the datasets library instead of manual string parsing. This provides better error handling and supports percentage-based slicing (e.g., [:40%]).

gemini-code-assist · 2026-01-08T16:06:46Z

That's excellent to hear, @ricyoung! I'm glad the suggestion to use ReadInstruction was helpful. This indeed provides more robust error handling and expands the flexibility for slice notation, including percentage-based slicing. Thanks for incorporating it!

p-e-w · 2026-01-09T08:38:04Z

src/heretic/utils.py

-            )
-            # Parse the split instructions.
+    # Support for plain text files (one prompt per line)
+    if path.endswith(".txt") and os.path.isfile(path):


Suggested change

if path.endswith(".txt") and os.path.isfile(path):

if os.path.isfile(path):

I don't think we should require a specific extension, as plain text files are often stored without extensions on Unix systems.

p-e-w · 2026-01-09T08:40:52Z

src/heretic/utils.py

-                # But also don't use cached data, as the dataset may have changed on disk.
-                download_mode=DownloadMode.FORCE_REDOWNLOAD,
-            )
+            prompts = prompts[abs_instruction.from_ : abs_instruction.to]


This duplicates code below, and should be moved to a common helper function.

p-e-w · 2026-01-09T08:42:50Z

config.default.toml

+#   - One prompt per line (UTF-8 encoding)
+#   - Empty lines and whitespace-only lines are ignored
+#   - Leading/trailing whitespace is stripped from each prompt
+#   - The "column" field is ignored for text files


It shouldn't be required then, and neither should the split field.

- Remove .txt extension requirement for text files (any file is now supported) - Extract ReadInstruction parsing into _get_split_slice() helper function - Make 'split' and 'column' fields optional in DatasetSpecification - Add validation errors for HuggingFace datasets missing required fields - Update documentation to clarify text file requirements

p-e-w · 2026-01-13T11:15:57Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable feature for loading prompts from plain text files, offering a simpler alternative to HuggingFace datasets. The implementation is clean and well-executed, with logical changes in heretic/utils.py for file parsing and slicing, and necessary modifications to DatasetSpecification in heretic/config.py. The extraction of slicing logic into the _get_split_slice helper function is a nice improvement for code clarity and reuse. The documentation updates in config.default.toml are also clear and helpful. My review includes a couple of minor style suggestions to align with the repository's coding conventions.

gemini-code-assist · 2026-01-13T11:16:52Z

src/heretic/utils.py

-            dataset = load_from_disk(path)
-            assert not isinstance(dataset, DatasetDict), (
-                "Loading dataset dicts is not supported"
+    # Support for plain text files (one prompt per line)


This comment is missing a period at the end. According to the repository's style guide, comments should end with a period.

Suggested change

# Support for plain text files (one prompt per line)

# Support for plain text files (one prompt per line).

References

Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. ^(link)

gemini-code-assist · 2026-01-13T11:16:52Z

src/heretic/utils.py

+                pass
+
+    else:
+        # Load from HuggingFace datasets (local directory or Hub)


This comment is missing a period at the end. The repository's style guide requires comments to end with a period.

Suggested change

# Load from HuggingFace datasets (local directory or Hub)

# Load from HuggingFace datasets (local directory or Hub).

References

Rule 4 of the repository style guide states that comments should start with a capital letter and end with a period. ^(link)

p-e-w · 2026-01-13T12:08:03Z

src/heretic/utils.py

+                start, end = _get_split_slice(split_str, len(prompts), "_")
+                prompts = prompts[start:end]
+            except (ValueError, IndexError):
+                # If split_str doesn't contain valid slice notation, use all prompts.


No, it should throw an error. If the user put something in there they expect it to be used. If it can't be, we need to tell them.

Simply remove that try/except block. That's what we do for the directory case as well.

p-e-w · 2026-01-13T12:14:56Z

This looks good apart from the error thing.

Feat/txt dataset support Read the comments on this, add suggestions to my fork: p-e-w#103

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

ricyoung and others added 2 commits January 8, 2026 07:32

docs: add detailed text file format instructions

aa999a2

p-e-w requested changes Jan 9, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

p-e-w reviewed Jan 13, 2026

View reviewed changes

dissociativity added a commit to dissociativity/heretic that referenced this pull request Jan 19, 2026

Merge pull request #9 from ricyoung/feat/txt-dataset-support

0782880

Feat/txt dataset support Read the comments on this, add suggestions to my fork: p-e-w#103

	if path.endswith(".txt") and os.path.isfile(path):
	if os.path.isfile(path):

	# Support for plain text files (one prompt per line)
	# Support for plain text files (one prompt per line).

	# Load from HuggingFace datasets (local directory or Hub)
	# Load from HuggingFace datasets (local directory or Hub).

Conversation

ricyoung commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Example usage

Test plan

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ricyoung commented Jan 8, 2026

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Uh oh!

p-e-w Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ricyoung commented Jan 8, 2026 •

edited

Loading