Implement automatic memory extraction features #14

royisme · 2025-11-05T16:20:36Z

Comprehensive implementation of automatic memory extraction features using LLM analysis.

Major Features:

Extract memories from conversations using LLM
Analyze git commits for decisions and experiences
Mine code comments (TODO, FIXME, NOTE, DECISION markers)
Auto-suggest memories from knowledge base queries
Batch extract from entire repository

Components Added/Modified:

services/memory_extractor.py (full implementation)
- Conversation analysis with confidence scoring
- Git commit classification and extraction
- Code comment parsing and categorization
- Query-based memory suggestions
- Batch repository scanning
api/memory_routes.py (5 new endpoints)
- POST /api/v1/memory/extract/conversation
- POST /api/v1/memory/extract/commit
- POST /api/v1/memory/extract/comments
- POST /api/v1/memory/suggest
- POST /api/v1/memory/extract/batch
MCP Tools (5 new tools)
- extract_from_conversation
- extract_from_git_commit
- extract_from_code_comments
- suggest_memory_from_query
- batch_extract_from_repository
Documentation (CLAUDE.md)
- Updated tool count from 25 to 30
- Added v0.7 feature documentation
- API and MCP usage examples

Technical Details:

Uses LlamaIndex Settings.llm for LLM access
JSON response parsing from LLM outputs
Confidence threshold for auto-saving (default 0.7)
Support for conventional commit types
AST-based comment extraction for Python
Pattern matching for other languages
Git subprocess integration for commit history

Auto-save Logic:

Memories with confidence >= threshold automatically saved
Lower confidence memories returned as suggestions
Metadata tracking for extraction source

Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)

Comprehensive implementation of automatic memory extraction features using LLM analysis. Major Features: - Extract memories from conversations using LLM - Analyze git commits for decisions and experiences - Mine code comments (TODO, FIXME, NOTE, DECISION markers) - Auto-suggest memories from knowledge base queries - Batch extract from entire repository Components Added/Modified: 1. services/memory_extractor.py (full implementation) - Conversation analysis with confidence scoring - Git commit classification and extraction - Code comment parsing and categorization - Query-based memory suggestions - Batch repository scanning 2. api/memory_routes.py (5 new endpoints) - POST /api/v1/memory/extract/conversation - POST /api/v1/memory/extract/commit - POST /api/v1/memory/extract/comments - POST /api/v1/memory/suggest - POST /api/v1/memory/extract/batch 3. MCP Tools (5 new tools) - extract_from_conversation - extract_from_git_commit - extract_from_code_comments - suggest_memory_from_query - batch_extract_from_repository 4. Documentation (CLAUDE.md) - Updated tool count from 25 to 30 - Added v0.7 feature documentation - API and MCP usage examples Technical Details: - Uses LlamaIndex Settings.llm for LLM access - JSON response parsing from LLM outputs - Confidence threshold for auto-saving (default 0.7) - Support for conventional commit types - AST-based comment extraction for Python - Pattern matching for other languages - Git subprocess integration for commit history Auto-save Logic: - Memories with confidence >= threshold automatically saved - Lower confidence memories returned as suggestions - Metadata tracking for extraction source Total Changes: 30 MCP tools, 12 memory tools (7 manual + 5 automatic)

Copilot

Pull Request Overview

This PR implements comprehensive automatic memory extraction features (v0.7) that leverage LLM analysis to intelligently extract and save project knowledge from various sources. The implementation moves from placeholder skeleton code to full production functionality across conversation analysis, git commit mining, code comment extraction, knowledge query analysis, and batch repository processing.

Key Changes:

Implemented 5 new automatic extraction methods using LLM-powered analysis with confidence scoring and auto-save capabilities
Added 5 new API endpoints and 5 new MCP tools for automatic memory extraction
Enhanced documentation with complete v0.7 feature descriptions and usage examples

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
services/memory_extractor.py	Complete implementation of 5 extraction methods with LLM integration, confidence thresholds, git subprocess handling, and AST-based comment parsing
api/memory_routes.py	Added 5 new REST endpoints for automatic memory extraction with request validation models
mcp_tools/tool_definitions.py	Added 5 new MCP tool definitions for extraction features, updated tool count from 25 to 30
mcp_tools/memory_handlers.py	Implemented 5 handler functions for new extraction tools
mcp_tools/init.py	Exported new extraction handlers
mcp_server.py	Imported memory_extractor service and new handlers
CLAUDE.md	Updated documentation with v0.7 feature details, usage examples, and tool count changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T16:22:35Z

services/memory_extractor.py

+            response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)
+            if response_text:
+                response_text = response_text.group(1)
+        elif "```" in response_text:
+            response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)
+            if response_text:
+                response_text = response_text.group(1)


Variable reassignment issue: response_text is reassigned from a string to a Match object or None. If re.search() returns None, the subsequent response_text.group(1) will fail. The search result should be stored in a separate variable to avoid this type confusion and potential AttributeError.

Suggested change

response_text = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)

if response_text:

response_text = response_text.group(1)

elif "```" in response_text:

response_text = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)

if response_text:

response_text = response_text.group(1)

match = re.search(r"```json\s*(.*?)\s*```", response_text, re.DOTALL)

if match:

response_text = match.group(1)

elif "```" in response_text:

match = re.search(r"```\s*(.*?)\s*```", response_text, re.DOTALL)

if match:

response_text = match.group(1)

Copilot · 2025-11-05T16:22:35Z

services/memory_extractor.py

+                    title=mem_data["title"],
+                    content=mem_data["content"],
+                    reason=mem_data.get("reason"),
+                    tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]],


Potential IndexError if file has no extension. Path(file_path).suffix returns an empty string for files without extensions, and accessing [1:] on an empty string returns empty string, but this creates a list with an empty string tag. Consider checking if suffix exists before adding it as a tag.

Suggested change

tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]],

tags=mem_data.get("tags", ["code-comment"]) + ([Path(file_path).suffix[1:]] if Path(file_path).suffix else []),

Copilot · 2025-11-05T16:22:35Z

services/memory_extractor.py

+            logger.info(f"Analyzing last {max_commits} git commits...")
+            commits = self._get_recent_commits(repo_path, max_commits)
+
+            for commit in commits[:20]:  # Focus on most recent 20 for efficiency


Magic number 20 should be extracted as a named constant or parameter. The method accepts max_commits parameter but then hard-codes a different limit of 20 for processing. This creates confusion about the actual behavior and makes it difficult to adjust the processing limit.

Suggested change

for commit in commits[:20]: # Focus on most recent 20 for efficiency

for commit in commits[:max_commits]: # Focus on most recent max_commits for efficiency

Copilot · 2025-11-05T16:22:36Z

services/memory_extractor.py

+                source_files.extend(repo_path_obj.rglob(pattern))
+
+            # Sample files to avoid overload
+            sampled_files = list(source_files)[:30]


Magic number 30 should be extracted as a named constant or configurable parameter. This hard-coded limit for file sampling is not exposed to the caller and may need adjustment based on repository size or performance requirements.

Copilot · 2025-11-05T16:22:36Z

services/memory_extractor.py

+        combined = []
+        for mem_type, items in grouped.items():
+            sorted_items = sorted(items, key=lambda x: x.get("importance", 0), reverse=True)
+            combined.extend(sorted_items[:3])


Magic number 3 should be extracted as a named constant. This limit for top items per type is not configurable and the choice of 3 is not explained in comments or documentation.

Copilot · 2025-11-05T16:22:36Z

services/memory_extractor.py

+            # Extract first few paragraphs as project overview
+            lines = content.split('\n')
+            description = []
+            for line in lines[1:20]:  # Skip first line (usually title)


Magic number 20 should be extracted as a named constant. This limit for README line processing is hard-coded and not configurable.

Copilot · 2025-11-05T16:22:37Z

services/memory_extractor.py

+                    "should_save": True,
+                    "suggested_memory": suggested_memory,
+                    "query": query,
+                    "answer_excerpt": answer[:200]


Magic number 200 should be extracted as a named constant. Multiple string truncation limits (200, 500) are used throughout the file without explanation of why these specific values were chosen.

royisme · 2025-11-05T18:59:02Z

@copilot open a new pull request to apply changes based on the comments in this thread and fix this:FAILED tests/test_mcp_integration.py::TestToolDefinitions::test_get_tool_definitions_count - AssertionError: Should have exactly 25 tools

Copilot · 2025-11-05T18:59:09Z

@royisme I've opened a new pull request, #15, to work on those changes. Once the pull request is ready, I'll request review from you.

…ble reassignment, update test count Co-authored-by: royisme <350731+royisme@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Address code review feedback: extract magic numbers and fix variable reassignment bug

…n system BREAKING CHANGE: Default docker-compose.yml now points to minimal mode ## Docker Infrastructure ### Multi-Mode Deployment - **Minimal**: Code Graph only (No LLM required) - 500MB image - **Standard**: Code Graph + Memory (Embedding required) - 600MB image - **Full**: All features (LLM + Embedding) - 800MB image ### Files Added - docker/Dockerfile.{base,minimal,standard,full} - docker/docker-compose.{minimal,standard,full}.yml - docker/.env.template/.env.{minimal,standard,full} - docker-compose.yml (default, points to minimal) ### Automation - Makefile with convenience commands (docker-minimal, docker-standard, docker-full) - scripts/docker-deploy.sh - Interactive deployment wizard - GitHub Actions for automated Docker builds (royisme/codebase-rag) - Multi-arch support (AMD64, ARM64) ## Documentation System ### MkDocs Material - Configured for docs.vantagecraft.dev - English-first documentation - Dark mode support - Search, code highlighting, Mermaid diagrams ### Documentation Pages - index.md - Homepage with feature comparison table - getting-started/quickstart.md - 5-minute quick start guide - deployment/overview.md - Comprehensive mode comparison - deployment/production.md - Production deployment (K8s, Docker Swarm, Nginx) ### CI/CD - .github/workflows/docs-deploy.yml - Auto-deploy to GitHub Pages - .github/workflows/docker-build.yml - Auto-build Docker images - docs/CNAME - Domain configuration ## Features by Mode | Feature | Minimal | Standard | Full | |---------|---------|----------|------| | Code Graph | ✅ | ✅ | ✅ | | Memory Store | ❌ | ✅ | ✅ | | Auto Extraction | ❌ | ❌ | ✅ | | Knowledge RAG | ❌ | ❌ | ✅ | | LLM Required | ❌ | ❌ | ✅ | | Embedding Required | ❌ | ✅ | ✅ | ## Quick Start ```bash # Minimal deployment (no LLM needed) make docker-minimal # Standard deployment (embedding needed) make docker-standard # Full deployment (LLM + embedding needed) make docker-full ``` ## Next Steps Code changes required for dynamic mode switching: - config.py: Add DeploymentMode enum and validation - start_mcp.py: Add --mode argument parsing - mcp_server.py: Dynamic tool registration based on mode See DOCKER_IMPLEMENTATION_SUMMARY.md for details. ## Documentation Will be available at: https://docs.vantagecraft.dev Related: #14, #15

royisme requested a review from Copilot November 5, 2025 16:20

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Initial plan

b8d65e3

Copilot AI mentioned this pull request Nov 5, 2025

Address code review feedback: extract magic numbers and fix variable reassignment bug #15

Merged

Copilot AI and others added 3 commits November 5, 2025 19:10

Fix code review issues: extract magic numbers to constants, fix varia…

20b53fd

…ble reassignment, update test count Co-authored-by: royisme <350731+royisme@users.noreply.github.com>

Apply suggestions from code review

306e9b5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #15 from royisme/copilot/sub-pr-14

ec5315f

Address code review feedback: extract magic numbers and fix variable reassignment bug

royisme merged commit dbad4d6 into main Nov 5, 2025
3 checks passed

royisme deleted the claude/auto-extract-memories-011CUq28jWKZ2GA714gTGsj2 branch November 5, 2025 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement automatic memory extraction features #14

Implement automatic memory extraction features #14

Uh oh!

royisme commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

royisme commented Nov 5, 2025

Uh oh!

Copilot AI commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	tags=mem_data.get("tags", ["code-comment"]) + [Path(file_path).suffix[1:]],
	tags=mem_data.get("tags", ["code-comment"]) + ([Path(file_path).suffix[1:]] if Path(file_path).suffix else []),

	for commit in commits[:20]: # Focus on most recent 20 for efficiency
	for commit in commits[:max_commits]: # Focus on most recent max_commits for efficiency

Implement automatic memory extraction features #14

Implement automatic memory extraction features #14

Uh oh!

Conversation

royisme commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

royisme commented Nov 5, 2025

Uh oh!

Copilot AI commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants