Grep + Raptor: Transform messy, unstructured text into clean, grep-friendly data for agentic search workflows.
Claude Code has proven that agentic search (ripgrep + filesystem traversal + iterative investigation) is powerful enough for complex code navigation tasks. But what about textual data like documents, transcripts, posts, articles, notes, and reports?
Greptor is a library that helps you with this. It ingests and indexes unstructured text into a format that agents can easily search using simple tools like ripgrep.
RAG worked around small context windows by chunking documents and retrieving "relevant" fragments. That approach has recurring pain points:
- Chunking breaks structure: Tables, section hierarchies, and cross-references get lost.
- Embeddings are fuzzy: They struggle with exact terms, numbers, and identifiers.
- Complexity overhead: Hybrid search + rerankers add latency, cost, and moving parts.
- Error cascade: If retrieval misses the right chunk, the answer can't be correct.
Agentic search flips the approach: with larger context windows and better tool use, agents can search, open files, follow references, and refine queries — more like a human analyst.
Greptor's job is to clean, chunk, and add structure to your documents, making them easily searchable with text tools like ripgrep. No complex indices, no retrievers, no vector databases. Just minimal initial processing + maximal grep-ability.
npm install greptor
# or
bun add greptorCreate a Greptor instance with your base path, topic, and model config.
import { createGreptor } from 'greptor';
// Create Greptor instance
const greptor = await createGreptor({
basePath: './projects/investing/content',
topic: 'Investing, stock market, financial, and macroeconomics',
tagSchema: YOUR_TAG_SCHEMA, // Required. See "Tag Schemas" below.
model: {
provider: "@ai-sdk/openai",
model: "gpt-5-mini",
},
});
// Start background processing workers
await greptor.start();basePath: Base directory where data will be stored.topic: Helps Greptor understand your data better and generate a relevant tag schema.tagSchema: Required. Define your tag fields (or generate them withgreptor generate tags).model: A config object withprovider,model, and optionaloptionsfor the Vercel AI SDK.
Greptor will automatically create and manage the following structure in your basePath:
raw/- immediate raw content writesprocessed/- enriched/processed content from background workers
Greptor uses an LLM (via Greptor) to process content. You'll need to:
-
Choose a provider from the AI SDK ecosystem:
@ai-sdk/openai- OpenAI (GPT-4, GPT-4o, etc.)@ai-sdk/anthropic- Anthropic (Claude)@ai-sdk/groq- Groq (fast inference)@ai-sdk/openai-compatible- OpenAI-compatible endpoints (NVIDIA NIM, OpenRouter, etc.)- And many more...
-
Get an API key from your provider and set it as an environment variable:
export OPENAI_API_KEY="sk-..." # or add to ~/.bashrc, ~/.zshrc, etc.
-
Provide it in the model config when creating Greptor.
const greptor = await createGreptor({ basePath: './projects/investing/content', topic: 'Investing, stock market, financial, and macroeconomics', tagSchema: YOUR_TAG_SCHEMA, model: { provider: "@ai-sdk/openai-compatible", model: "z-ai/glm4.7", name: "nvidia", options: { baseURL: "https://integrate.api.nvidia.com/v1", apiKey: process.env.NVIDIA_API_KEY, }, }, });
await greptor.start(); ```
await greptor.eat({
id: 'QwwVJfvfqN8',
source: 'youtube',
publisher: '@JosephCarlsonShow',
format: 'text',
label: 'Top Five AI Stocks I\'m Buying Now',
content: '{fetch and populate video transcript here}',
creationDate: new Date('2025-11-15'),
tags: {
// Optional custom tags specific to the source or document
channelTitle: 'Joseph Carlson',
channelSubscribers: 496000
},
});
await greptor.eat({
id: 'tesla_reports_418227_deliveries_for_the_fourth',
source: 'reddit',
publisher: 'investing', // For Reddit, publisher is the subreddit name
format: 'text',
label: 'Tesla reports 418,227 deliveries for the fourth quarter, down 16%',
content: '{fetch and populate Reddit post with comments here}',
creationDate: new Date('2025-12-03'),
tags: {
// Optional custom tags
upvotes: 1400
},
});Greptor writes your input to a raw Markdown file immediately. After you call await greptor.start(), background workers run enrichment (LLM cleaning + chunking + tagging) and write a processed Markdown file. You can grep the raw files right away, and the processed files will appear shortly after.
Navigate to your workspace directory and run:
greptor generate skillsThe CLI will prompt you to pick an agent type (claude code, codex, or opencode)
Then it writes the appropriate skill file for your chosen agent.
The skill is customized for the sources you provide and includes search tips based on the tag schema. You can always customize it manually further for better results.
By this point, you should have the following structure in your basePath:
./projects/investing/content/
.claude/
skills/
search-youtube-reddit/
SKILL.md
raw/
youtube/
JosephCarlsonShow/
2025-12/
2025-12-01-Top-Five-AI-Stocks-Im-Buying-Now.md
reddit/
investing/
2025-12/
2025-12-03-Tesla-reports-418227-deliveries-for-the-fourth-quarter-down-16.md
processed/
youtube/
JosephCarlsonShow/
2025-12/
2025-12-01-Top-Five-AI-Stocks-Im-Buying-Now.md
reddit/
investing/
2025-12/
2025-12-03-Tesla-reports-418227-deliveries-for-the-fourth-quarter-down-16.md
If you chose Codex or OpenCode, the skill file will be written to:
.codex/skills/search-*.md(Codex).opencode/skills/search-*.md(OpenCode)
Now run your chosen agent in this folder and ask questions about your data or perform research tasks!
For better results:
- Connect MCP servers like Yahoo Finance or other relevant financial/stock market MCP servers for up-to-date information.
- Add personal financial information, such as your portfolio holdings, watchlists, and risk profile.
- Create custom skills, slash commands, or subagents for researching specific tickers, sectors, topics, or managing your portfolio.
Now you have a personal investment research assistant with access to your portfolio, sentiment data (YouTube, Reddit), news, and market data! You don't have to manually watch dozens of YouTube channels or spend hours scrolling Reddit and other sources.
eat() writes the input to a raw Markdown file with YAML frontmatter. You can grep it right away.
Workers pick up new documents and run a one-time pipeline:
- LLM clean + chunk + tag (single prompt): Remove boilerplate, split into semantic chunks, and inline grep-friendly per-chunk tags.
Here's an example of a processed file:
---
title: "NVIDIA Q4 2024 Earnings: AI Boom Continues"
source: "youtube"
publisher: "Wall Street Millennial"
date: 2025-11-15
ticker: "NVDA"
videoId: "dQw4w9WgXcQ"
url: "https://youtube.com/watch?v=dQw4w9WgXcQ"
---
## 01 Revenue Growth Analysis
topics=earnings,revenue,data_center
sentiment=positive
tickers=NVDA
NVIDIA reported Q4 revenue of $35.1 billion, beating estimates...
## 02 AI Chip Demand Outlook
topics=ai,competition,market_share
sentiment=bullish
tickers=NVDA,AMD,INTC
timeframe=next_quarter
The demand for AI accelerators continues to outpace supply...Your "index" is the YAML frontmatter (document-level) plus the per-chunk tag lines. Agents can search it deterministically.
Basic search examples:
# Simple tag search with context
rg -n -C 6 "ticker=NVDA" content/processed/
# Search for any value in a tag field
rg -n -C 6 "sentiment=" content/processed/
# Case-insensitive full-text search
rg -i -n -C 3 "artificial intelligence" content/processed/
# Search within a specific source
rg -n -C 6 "sector=technology" content/processed/youtube/Date-filtered searches:
# Content from December 2025
rg -n -C 6 "ticker=TSLA" content/processed/ --glob "**/2025-12/*.md"
# Q4 2025 content
rg -n -C 6 "sentiment=bullish" content/processed/ --glob "**/2025-1[0-2]/*.md"
# Specific month and source
rg -n -C 6 "asset_type=etf" content/processed/reddit/ --glob "**/2025-11/*.md"Combined tag filters:
# Match chunks with two specific tags (using file list)
rg -l "sector=technology" content/processed/ | xargs rg -n -C 6 "sentiment=bullish"
# Pipeline filter for complex queries
rg -n -C 6 "ticker=AAPL" content/processed/ | rg "recommendation=.*buy"
# Three-way filter: tech stocks with bullish sentiment and buy recommendation
rg -l "sector=technology" content/processed/ | xargs rg -l "sentiment=bullish" | xargs rg -n -C 6 "recommendation=buy"
# Find AI narrative discussions with specific tickers
rg -n -C 6 "narrative=.*ai" content/processed/ | rg "ticker=NVDA\|ticker=.*,NVDA"Discovery and exploration:
# List all unique tickers mentioned
rg -o "ticker=[^\n]+" content/processed/ | cut -d= -f2 | tr ',' '\n' | sort -u
# Count occurrences of each sentiment
rg -o "sentiment=[^\n]+" content/processed/ | cut -d= -f2 | sort | uniq -c | sort -rn
# Top 20 most discussed companies
rg -o "company=[^\n]+" content/processed/ | cut -d= -f2 | tr ',' '\n' | sort | uniq -c | sort -rn | head -20
# Find all files discussing dividend investing
rg -l "investment_style=dividend" content/processed/
# See what narratives exist in the data
rg -o "narrative=[^\n]+" content/processed/ | cut -d= -f2 | tr ',' '\n' | sort -uAnalysis patterns:
# Sentiment distribution for a specific ticker
rg -n -C 6 "ticker=TSLA" content/processed/ | rg -o "sentiment=[^\n]+" | cut -d= -f2 | sort | uniq -c
# Most discussed sectors
rg -o "sector=[^\n]+" content/processed/ | cut -d= -f2 | tr ',' '\n' | sort | uniq -c | sort -rn
# Track narrative evolution over time
for month in 2025-{10..12}; do
echo "=== $month ==="
rg -o "narrative=[^\n]+" content/processed/ --glob "**/$month/*.md" | cut -d= -f2 | tr ',' '\n' | sort | uniq -c | sort -rn | head -5
done
# Compare sentiment across sources for a stock
for source in youtube reddit; do
echo "=== $source ==="
rg -n -C 6 "ticker=AAPL" content/processed/$source/ | rg -o "sentiment=[^\n]+" | cut -d= -f2 | tr ',' '\n' | sort | uniq -c
done
# Find all strong buy recommendations by sector
for sector in technology healthcare financials; do
echo "=== $sector ==="
rg -l "sector=$sector" content/processed/ | xargs rg -n -C 3 "recommendation=strong_buy" | head -5
doneAdvanced multi-criteria searches:
# Large-cap tech stocks with bullish sentiment
rg -l "market_cap=large_cap" content/processed/ | xargs rg -l "sector=technology" | xargs rg -n -C 6 "sentiment=bullish"
# Growth investing discussions about mega-cap stocks
rg -n -C 6 "investment_style=growth" content/processed/ | rg "market_cap=mega_cap"
# ETF recommendations from specific time period
rg -n -C 6 "asset_type=etf" content/processed/ --glob "**/2025-12/*.md" | rg "recommendation=buy\|recommendation=strong_buy"
# Bearish sentiment on specific narrative
rg -n -C 6 "narrative=ev_transition" content/processed/ | rg "sentiment=bearish"You can override the default processing prompt for specific sources to tailor how content is processed:
const greptor = await createGreptor({
basePath: './projects/investing/content',
topic: 'Investing, stock market, financial, and macroeconomics',
tagSchema: YOUR_TAG_SCHEMA,
model: {
provider: "@ai-sdk/openai",
model: "gpt-5-mini",
},
customProcessingPrompts: {
// Custom prompt for Twitter/X content
'twitter': `
# INSTRUCTIONS
Process this Twitter/X content for investment research. Focus on:
- Investment signals, predictions, or analysis
- Key metrics and numbers mentioned
- Influencer sentiment and conviction level
# CONTENT TO PROCESS:
{CONTENT}
`,
// Custom prompt for SEC filings
'sec_filing': `
# INSTRUCTIONS
Process this SEC filing with extreme precision:
- Preserve all financial figures, dates, and legal language exactly
- Extract key financial metrics and risk factors
- Maintain formal, factual tone throughout
# CONTENT TO PROCESS:
{CONTENT}
`,
// Custom prompt for earnings transcripts
'earnings': `
# INSTRUCTIONS
Process this earnings call transcript:
- Extract forward-looking statements and guidance
- Preserve exact numbers, percentages, and ranges
- Capture management sentiment and key Q&A points
# CONTENT TO PROCESS:
{CONTENT}
`,
},
});
await greptor.start();Usage notes:
- Use
{CONTENT}as a placeholder where the raw content will be inserted - Each custom prompt should include the placeholder exactly once
- If no custom prompt is defined for a source, Greptor falls back to the default processing prompt
- Custom prompts are matched against the document's
sourcefield (e.g.,youtube,reddit,twitter)
Greptor provides optional hooks to monitor document processing. These are useful for logging, metrics, progress tracking, or building custom UIs.
const greptor = await createGreptor({
basePath: './projects/investing/content',
topic: 'Investing, stock market, financial, and macroeconomics',
tagSchema: YOUR_TAG_SCHEMA,
model: {
provider: "@ai-sdk/openai",
model: "gpt-5-mini",
},
hooks: {
onDocumentProcessingStarted: ({ source, publisher, label, documentsCount }) => {
const count = documentsCount[source] || { fetched: 0, processed: 0 };
console.log(`Processing: ${source}/${publisher}/${label} (${count.fetched} fetched, ${count.processed} processed)`);
},
onDocumentProcessingCompleted: (event) => {
if (event.success) {
const { source, publisher, label, documentsCount, elapsedMs, totalTokens } = event;
const count = documentsCount[source] || { fetched: 0, processed: 0 };
console.log(`✓ Completed: ${source}/${publisher}/${label} (${elapsedMs}ms, ${totalTokens} tokens, ${count.processed}/${count.fetched} processed)`);
} else {
const { source, publisher, label, error } = event;
console.error(`✗ Failed: ${source}/${publisher}/${label} - ${error}`);
}
},
},
});
await greptor.start();| Hook | When Called | Event Data |
|---|---|---|
onDocumentProcessingStarted |
Before processing each document | source, publisher?, label, documentsCount: SourceCounts |
onDocumentProcessingCompleted |
After processing succeeds or fails | Union type: • Success: success: true, source, publisher?, label, documentsCount, elapsedMs, inputTokens, outputTokens, totalTokens• Failure: success: false, error: string, source, publisher?, label |
Greptor requires a tag schema. For best results, provide a custom tag schema (or generate one with greptor generate tags).
Here's a comprehensive example for investment research:
const greptor = await createGreptor({
basePath: './projects/investing/content',
topic: 'Investing, stock market, financial, and macroeconomics',
model: {
provider: "@ai-sdk/openai",
model: "gpt-5-mini",
},
tagSchema: [
{
name: 'company',
type: 'string[]',
description: 'Canonical company names in snake_case (e.g. apple, tesla, microsoft)',
},
{
name: 'ticker',
type: 'string[]',
description: 'Canonical stock tickers, UPPERCASE only (e.g. AAPL, TSLA, MSFT, SPY)',
},
{
name: 'sector',
type: 'enum[]',
description: 'GICS sector classification for stocks/companies discussed',
enumValues: [
'technology', 'healthcare', 'financials', 'consumer_discretionary',
'consumer_staples', 'energy', 'utilities', 'industrials',
'materials', 'real_estate', 'communication_services',
'etf', 'index', 'commodity', 'bond', 'mixed'
],
},
{
name: 'industry',
type: 'string[]',
description: 'Specific industry/sub-sector in snake_case (e.g. semiconductors, biotech, banking)',
},
{
name: 'market_cap',
type: 'enum[]',
description: 'Market capitalization category of the company',
enumValues: ['mega_cap', 'large_cap', 'mid_cap', 'small_cap', 'micro_cap'],
},
{
name: 'investment_style',
type: 'enum[]',
description: 'Investment approach or style discussed',
enumValues: [
'value', 'growth', 'dividend', 'momentum', 'index',
'passive', 'active', 'day_trading', 'swing_trading', 'long_term_hold'
],
},
{
name: 'asset_type',
type: 'enum[]',
description: 'Type of financial instrument discussed',
enumValues: [
'stock', 'etf', 'mutual_fund', 'option', 'bond',
'reit', 'commodity', 'crypto', 'cash'
],
},
{
name: 'narrative',
type: 'string[]',
description: 'Investment or market narratives in snake_case (e.g. ai_boom, ev_transition, rate_cuts)',
},
{
name: 'sentiment',
type: 'enum[]',
description: 'Directional stance on the stock/market',
enumValues: ['bullish', 'bearish', 'neutral', 'mixed', 'cautious'],
},
{
name: 'recommendation',
type: 'enum[]',
description: 'Analyst or influencer recommendation type',
enumValues: ['strong_buy', 'buy', 'hold', 'sell', 'strong_sell'],
},
],
});
await greptor.start();MIT © Sergii Vashchyshchuk