-
Notifications
You must be signed in to change notification settings - Fork 1
Implement automatic memory extraction features #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement automatic memory extraction features #14
Conversation
Comprehensive implementation of automatic memory extraction features using LLM analysis. Major Features: - Extract memories from conversations using LLM - Analyze git commits for decisions and experiences - Mine code comments (TODO, FIXME, NOTE, DECISION markers) - Auto-suggest memories from knowledge base queries - Batch extract from entire repository Components Added/Modified: 1. services/memory_extractor.py (full implementation) - Conversation analysis with confidence scoring - Git commit classification and extraction - Code comment parsing and categorization - Query-based memory suggestions - Batch repository scanning 2. api/memory_routes.py (5 new endpoints) - POST /api/v1/memory/extract/conversation - POST /api/v1/memory/extract/commit - POST /api/v1/memory/extract/comments - POST /api/v1/memory/suggest - POST /api/v1/memory/extract/batch 3. MCP Tools (5 new tools) - extract_from_conversation - extract_from_git_commit - extract_from_code_comments - suggest_memory_from_query - batch_extract_from_repository 4. Documentation (CLAUDE.md) - Updated tool count from 25 to 30 - Added v0.7 feature documentation - API and MCP usage examples Technical Details: - Uses LlamaIndex Settings.llm for LLM access - JSON response parsing from LLM outputs - Confidence threshold for auto-saving (default 0.7) - Support for conventional commit types - AST-based comment extraction for Python - Pattern matching for other languages - Git subprocess integration for commit history Auto-save Logic: - Memories with confidence >= threshold automatically saved - Lower confidence memories returned as suggestions - Metadata tracking for extraction source Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements comprehensive automatic memory extraction features (v0.7) that leverage LLM analysis to intelligently extract and save project knowledge from various sources. The implementation moves from placeholder skeleton code to full production functionality across conversation analysis, git commit mining, code comment extraction, knowledge query analysis, and batch repository processing.
Key Changes:
- Implemented 5 new automatic extraction methods using LLM-powered analysis with confidence scoring and auto-save capabilities
- Added 5 new API endpoints and 5 new MCP tools for automatic memory extraction
- Enhanced documentation with complete v0.7 feature descriptions and usage examples
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| services/memory_extractor.py | Complete implementation of 5 extraction methods with LLM integration, confidence thresholds, git subprocess handling, and AST-based comment parsing |
| api/memory_routes.py | Added 5 new REST endpoints for automatic memory extraction with request validation models |
| mcp_tools/tool_definitions.py | Added 5 new MCP tool definitions for extraction features, updated tool count from 25 to 30 |
| mcp_tools/memory_handlers.py | Implemented 5 handler functions for new extraction tools |
| mcp_tools/init.py | Exported new extraction handlers |
| mcp_server.py | Imported memory_extractor service and new handlers |
| CLAUDE.md | Updated documentation with v0.7 feature details, usage examples, and tool count changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/memory_extractor.py
Outdated
| response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL) | ||
| if response_text: | ||
| response_text = response_text.group(1) | ||
| elif "```" in response_text: | ||
| response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL) | ||
| if response_text: | ||
| response_text = response_text.group(1) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable reassignment issue: response_text is reassigned from a string to a Match object or None. If re.search() returns None, the subsequent response_text.group(1) will fail. The search result should be stored in a separate variable to avoid this type confusion and potential AttributeError.
| response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL) | |
| if response_text: | |
| response_text = response_text.group(1) | |
| elif "```" in response_text: | |
| response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL) | |
| if response_text: | |
| response_text = response_text.group(1) | |
| match = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL) | |
| if match: | |
| response_text = match.group(1) | |
| elif "```" in response_text: | |
| match = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL) | |
| if match: | |
| response_text = match.group(1) |
services/memory_extractor.py
Outdated
| title=mem_data["title"], | ||
| content=mem_data["content"], | ||
| reason=mem_data.get("reason"), | ||
| tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]], |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if file has no extension. Path(file_path).suffix returns an empty string for files without extensions, and accessing [1:] on an empty string returns empty string, but this creates a list with an empty string tag. Consider checking if suffix exists before adding it as a tag.
| tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]], | |
| tags=mem_data.get("tags", ["code-comment"]) + ([Path(file_path).suffix[1:]] if Path(file_path).suffix else []), |
services/memory_extractor.py
Outdated
| logger.info(f"Analyzing last {max_commits} git commits...") | ||
| commits = self._get_recent_commits(repo_path, max_commits) | ||
|
|
||
| for commit in commits[:20]: # Focus on most recent 20 for efficiency |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 20 should be extracted as a named constant or parameter. The method accepts max_commits parameter but then hard-codes a different limit of 20 for processing. This creates confusion about the actual behavior and makes it difficult to adjust the processing limit.
| for commit in commits[:20]: # Focus on most recent 20 for efficiency | |
| for commit in commits[:max_commits]: # Focus on most recent max_commits for efficiency |
services/memory_extractor.py
Outdated
| source_files.extend(repo_path_obj.rglob(pattern)) | ||
|
|
||
| # Sample files to avoid overload | ||
| sampled_files = list(source_files)[:30] |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 30 should be extracted as a named constant or configurable parameter. This hard-coded limit for file sampling is not exposed to the caller and may need adjustment based on repository size or performance requirements.
services/memory_extractor.py
Outdated
| combined = [] | ||
| for mem_type, items in grouped.items(): | ||
| sorted_items = sorted(items, key=lambda x: x.get("importance", 0), reverse=True) | ||
| combined.extend(sorted_items[:3]) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 3 should be extracted as a named constant. This limit for top items per type is not configurable and the choice of 3 is not explained in comments or documentation.
services/memory_extractor.py
Outdated
| # Extract first few paragraphs as project overview | ||
| lines = content.split('\n') | ||
| description = [] | ||
| for line in lines[1:20]: # Skip first line (usually title) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 20 should be extracted as a named constant. This limit for README line processing is hard-coded and not configurable.
services/memory_extractor.py
Outdated
| "should_save": True, | ||
| "suggested_memory": suggested_memory, | ||
| "query": query, | ||
| "answer_excerpt": answer[:200] |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 200 should be extracted as a named constant. Multiple string truncation limits (200, 500) are used throughout the file without explanation of why these specific values were chosen.
|
@copilot open a new pull request to apply changes based on the comments in this thread and fix this:FAILED tests/test_mcp_integration.py::TestToolDefinitions::test_get_tool_definitions_count - AssertionError: Should have exactly 25 tools |
…ble reassignment, update test count Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Address code review feedback: extract magic numbers and fix variable reassignment bug
…n system
BREAKING CHANGE: Default docker-compose.yml now points to minimal mode
## Docker Infrastructure
### Multi-Mode Deployment
- **Minimal**: Code Graph only (No LLM required) - 500MB image
- **Standard**: Code Graph + Memory (Embedding required) - 600MB image
- **Full**: All features (LLM + Embedding) - 800MB image
### Files Added
- docker/Dockerfile.{base,minimal,standard,full}
- docker/docker-compose.{minimal,standard,full}.yml
- docker/.env.template/.env.{minimal,standard,full}
- docker-compose.yml (default, points to minimal)
### Automation
- Makefile with convenience commands (docker-minimal, docker-standard, docker-full)
- scripts/docker-deploy.sh - Interactive deployment wizard
- GitHub Actions for automated Docker builds (royisme/codebase-rag)
- Multi-arch support (AMD64, ARM64)
## Documentation System
### MkDocs Material
- Configured for docs.vantagecraft.dev
- English-first documentation
- Dark mode support
- Search, code highlighting, Mermaid diagrams
### Documentation Pages
- index.md - Homepage with feature comparison table
- getting-started/quickstart.md - 5-minute quick start guide
- deployment/overview.md - Comprehensive mode comparison
- deployment/production.md - Production deployment (K8s, Docker Swarm, Nginx)
### CI/CD
- .github/workflows/docs-deploy.yml - Auto-deploy to GitHub Pages
- .github/workflows/docker-build.yml - Auto-build Docker images
- docs/CNAME - Domain configuration
## Features by Mode
| Feature | Minimal | Standard | Full |
|---------|---------|----------|------|
| Code Graph | ✅ | ✅ | ✅ |
| Memory Store | ❌ | ✅ | ✅ |
| Auto Extraction | ❌ | ❌ | ✅ |
| Knowledge RAG | ❌ | ❌ | ✅ |
| LLM Required | ❌ | ❌ | ✅ |
| Embedding Required | ❌ | ✅ | ✅ |
## Quick Start
```bash
# Minimal deployment (no LLM needed)
make docker-minimal
# Standard deployment (embedding needed)
make docker-standard
# Full deployment (LLM + embedding needed)
make docker-full
```
## Next Steps
Code changes required for dynamic mode switching:
- config.py: Add DeploymentMode enum and validation
- start_mcp.py: Add --mode argument parsing
- mcp_server.py: Dynamic tool registration based on mode
See DOCKER_IMPLEMENTATION_SUMMARY.md for details.
## Documentation
Will be available at: https://docs.vantagecraft.dev
Related: #14, #15
Comprehensive implementation of automatic memory extraction features using LLM analysis.
Major Features:
Components Added/Modified:
services/memory_extractor.py (full implementation)
api/memory_routes.py (5 new endpoints)
MCP Tools (5 new tools)
Documentation (CLAUDE.md)
Technical Details:
Auto-save Logic:
Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)