Skip to content

Conversation

@royisme
Copy link
Owner

@royisme royisme commented Nov 5, 2025

Comprehensive implementation of automatic memory extraction features using LLM analysis.

Major Features:

  • Extract memories from conversations using LLM
  • Analyze git commits for decisions and experiences
  • Mine code comments (TODO, FIXME, NOTE, DECISION markers)
  • Auto-suggest memories from knowledge base queries
  • Batch extract from entire repository

Components Added/Modified:

  1. services/memory_extractor.py (full implementation)

    • Conversation analysis with confidence scoring
    • Git commit classification and extraction
    • Code comment parsing and categorization
    • Query-based memory suggestions
    • Batch repository scanning
  2. api/memory_routes.py (5 new endpoints)

    • POST /api/v1/memory/extract/conversation
    • POST /api/v1/memory/extract/commit
    • POST /api/v1/memory/extract/comments
    • POST /api/v1/memory/suggest
    • POST /api/v1/memory/extract/batch
  3. MCP Tools (5 new tools)

    • extract_from_conversation
    • extract_from_git_commit
    • extract_from_code_comments
    • suggest_memory_from_query
    • batch_extract_from_repository
  4. Documentation (CLAUDE.md)

    • Updated tool count from 25 to 30
    • Added v0.7 feature documentation
    • API and MCP usage examples

Technical Details:

  • Uses LlamaIndex Settings.llm for LLM access
  • JSON response parsing from LLM outputs
  • Confidence threshold for auto-saving (default 0.7)
  • Support for conventional commit types
  • AST-based comment extraction for Python
  • Pattern matching for other languages
  • Git subprocess integration for commit history

Auto-save Logic:

  • Memories with confidence >= threshold automatically saved
  • Lower confidence memories returned as suggestions
  • Metadata tracking for extraction source

Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)

Comprehensive implementation of automatic memory extraction features using LLM analysis.

Major Features:
- Extract memories from conversations using LLM
- Analyze git commits for decisions and experiences
- Mine code comments (TODO, FIXME, NOTE, DECISION markers)
- Auto-suggest memories from knowledge base queries
- Batch extract from entire repository

Components Added/Modified:
1. services/memory_extractor.py (full implementation)
   - Conversation analysis with confidence scoring
   - Git commit classification and extraction
   - Code comment parsing and categorization
   - Query-based memory suggestions
   - Batch repository scanning

2. api/memory_routes.py (5 new endpoints)
   - POST /api/v1/memory/extract/conversation
   - POST /api/v1/memory/extract/commit
   - POST /api/v1/memory/extract/comments
   - POST /api/v1/memory/suggest
   - POST /api/v1/memory/extract/batch

3. MCP Tools (5 new tools)
   - extract_from_conversation
   - extract_from_git_commit
   - extract_from_code_comments
   - suggest_memory_from_query
   - batch_extract_from_repository

4. Documentation (CLAUDE.md)
   - Updated tool count from 25 to 30
   - Added v0.7 feature documentation
   - API and MCP usage examples

Technical Details:
- Uses LlamaIndex Settings.llm for LLM access
- JSON response parsing from LLM outputs
- Confidence threshold for auto-saving (default 0.7)
- Support for conventional commit types
- AST-based comment extraction for Python
- Pattern matching for other languages
- Git subprocess integration for commit history

Auto-save Logic:
- Memories with confidence >= threshold automatically saved
- Lower confidence memories returned as suggestions
- Metadata tracking for extraction source

Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)
@royisme royisme requested a review from Copilot November 5, 2025 16:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive automatic memory extraction features (v0.7) that leverage LLM analysis to intelligently extract and save project knowledge from various sources. The implementation moves from placeholder skeleton code to full production functionality across conversation analysis, git commit mining, code comment extraction, knowledge query analysis, and batch repository processing.

Key Changes:

  • Implemented 5 new automatic extraction methods using LLM-powered analysis with confidence scoring and auto-save capabilities
  • Added 5 new API endpoints and 5 new MCP tools for automatic memory extraction
  • Enhanced documentation with complete v0.7 feature descriptions and usage examples

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
services/memory_extractor.py Complete implementation of 5 extraction methods with LLM integration, confidence thresholds, git subprocess handling, and AST-based comment parsing
api/memory_routes.py Added 5 new REST endpoints for automatic memory extraction with request validation models
mcp_tools/tool_definitions.py Added 5 new MCP tool definitions for extraction features, updated tool count from 25 to 30
mcp_tools/memory_handlers.py Implemented 5 handler functions for new extraction tools
mcp_tools/init.py Exported new extraction handlers
mcp_server.py Imported memory_extractor service and new handlers
CLAUDE.md Updated documentation with v0.7 feature details, usage examples, and tool count changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 639 to 645
response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)
if response_text:
response_text = response_text.group(1)
elif "```" in response_text:
response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)
if response_text:
response_text = response_text.group(1)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable reassignment issue: response_text is reassigned from a string to a Match object or None. If re.search() returns None, the subsequent response_text.group(1) will fail. The search result should be stored in a separate variable to avoid this type confusion and potential AttributeError.

Suggested change
response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)
if response_text:
response_text = response_text.group(1)
elif "```" in response_text:
response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)
if response_text:
response_text = response_text.group(1)
match = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)
if match:
response_text = match.group(1)
elif "```" in response_text:
match = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)
if match:
response_text = match.group(1)

Copilot uses AI. Check for mistakes.
title=mem_data["title"],
content=mem_data["content"],
reason=mem_data.get("reason"),
tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]],
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError if file has no extension. Path(file_path).suffix returns an empty string for files without extensions, and accessing [1:] on an empty string returns empty string, but this creates a list with an empty string tag. Consider checking if suffix exists before adding it as a tag.

Suggested change
tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]],
tags=mem_data.get("tags", ["code-comment"]) + ([Path(file_path).suffix[1:]] if Path(file_path).suffix else []),

Copilot uses AI. Check for mistakes.
logger.info(f"Analyzing last {max_commits} git commits...")
commits = self._get_recent_commits(repo_path, max_commits)

for commit in commits[:20]: # Focus on most recent 20 for efficiency
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 20 should be extracted as a named constant or parameter. The method accepts max_commits parameter but then hard-codes a different limit of 20 for processing. This creates confusion about the actual behavior and makes it difficult to adjust the processing limit.

Suggested change
for commit in commits[:20]: # Focus on most recent 20 for efficiency
for commit in commits[:max_commits]: # Focus on most recent max_commits for efficiency

Copilot uses AI. Check for mistakes.
source_files.extend(repo_path_obj.rglob(pattern))

# Sample files to avoid overload
sampled_files = list(source_files)[:30]
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 30 should be extracted as a named constant or configurable parameter. This hard-coded limit for file sampling is not exposed to the caller and may need adjustment based on repository size or performance requirements.

Copilot uses AI. Check for mistakes.
combined = []
for mem_type, items in grouped.items():
sorted_items = sorted(items, key=lambda x: x.get("importance", 0), reverse=True)
combined.extend(sorted_items[:3])
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 3 should be extracted as a named constant. This limit for top items per type is not configurable and the choice of 3 is not explained in comments or documentation.

Copilot uses AI. Check for mistakes.
# Extract first few paragraphs as project overview
lines = content.split('\n')
description = []
for line in lines[1:20]: # Skip first line (usually title)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 20 should be extracted as a named constant. This limit for README line processing is hard-coded and not configurable.

Copilot uses AI. Check for mistakes.
"should_save": True,
"suggested_memory": suggested_memory,
"query": query,
"answer_excerpt": answer[:200]
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 200 should be extracted as a named constant. Multiple string truncation limits (200, 500) are used throughout the file without explanation of why these specific values were chosen.

Copilot uses AI. Check for mistakes.
@royisme
Copy link
Owner Author

royisme commented Nov 5, 2025

@copilot open a new pull request to apply changes based on the comments in this thread and fix this:FAILED tests/test_mcp_integration.py::TestToolDefinitions::test_get_tool_definitions_count - AssertionError: Should have exactly 25 tools

Copy link
Contributor

Copilot AI commented Nov 5, 2025

@royisme I've opened a new pull request, #15, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 3 commits November 5, 2025 19:10
…ble reassignment, update test count

Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Address code review feedback: extract magic numbers and fix variable reassignment bug
@royisme royisme merged commit dbad4d6 into main Nov 5, 2025
3 checks passed
@royisme royisme deleted the claude/auto-extract-memories-011CUq28jWKZ2GA714gTGsj2 branch November 5, 2025 19:24
royisme pushed a commit that referenced this pull request Nov 6, 2025
…n system

BREAKING CHANGE: Default docker-compose.yml now points to minimal mode

## Docker Infrastructure

### Multi-Mode Deployment
- **Minimal**: Code Graph only (No LLM required) - 500MB image
- **Standard**: Code Graph + Memory (Embedding required) - 600MB image
- **Full**: All features (LLM + Embedding) - 800MB image

### Files Added
- docker/Dockerfile.{base,minimal,standard,full}
- docker/docker-compose.{minimal,standard,full}.yml
- docker/.env.template/.env.{minimal,standard,full}
- docker-compose.yml (default, points to minimal)

### Automation
- Makefile with convenience commands (docker-minimal, docker-standard, docker-full)
- scripts/docker-deploy.sh - Interactive deployment wizard
- GitHub Actions for automated Docker builds (royisme/codebase-rag)
- Multi-arch support (AMD64, ARM64)

## Documentation System

### MkDocs Material
- Configured for docs.vantagecraft.dev
- English-first documentation
- Dark mode support
- Search, code highlighting, Mermaid diagrams

### Documentation Pages
- index.md - Homepage with feature comparison table
- getting-started/quickstart.md - 5-minute quick start guide
- deployment/overview.md - Comprehensive mode comparison
- deployment/production.md - Production deployment (K8s, Docker Swarm, Nginx)

### CI/CD
- .github/workflows/docs-deploy.yml - Auto-deploy to GitHub Pages
- .github/workflows/docker-build.yml - Auto-build Docker images
- docs/CNAME - Domain configuration

## Features by Mode

| Feature | Minimal | Standard | Full |
|---------|---------|----------|------|
| Code Graph | ✅ | ✅ | ✅ |
| Memory Store | ❌ | ✅ | ✅ |
| Auto Extraction | ❌ | ❌ | ✅ |
| Knowledge RAG | ❌ | ❌ | ✅ |
| LLM Required | ❌ | ❌ | ✅ |
| Embedding Required | ❌ | ✅ | ✅ |

## Quick Start

```bash
# Minimal deployment (no LLM needed)
make docker-minimal

# Standard deployment (embedding needed)
make docker-standard

# Full deployment (LLM + embedding needed)
make docker-full
```

## Next Steps

Code changes required for dynamic mode switching:
- config.py: Add DeploymentMode enum and validation
- start_mcp.py: Add --mode argument parsing
- mcp_server.py: Dynamic tool registration based on mode

See DOCKER_IMPLEMENTATION_SUMMARY.md for details.

## Documentation

Will be available at: https://docs.vantagecraft.dev

Related: #14, #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants