Add AI bot classification for event enrichment#131
Add AI bot classification for event enrichment#131jaredmixpanel wants to merge 3 commits intomasterfrom
Conversation
Part of AI bot classification feature for Ruby SDK.
Part of AI bot classification feature for Ruby SDK.
Add requires to main entry point and rack dev dependency. Part of AI bot classification feature for Ruby SDK.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #131 +/- ##
==========================================
+ Coverage 96.29% 96.38% +0.09%
==========================================
Files 12 15 +3
Lines 567 637 +70
==========================================
+ Hits 546 614 +68
- Misses 21 23 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds AI bot classification functionality to mixpanel-ruby, enabling automatic detection and tracking enrichment for 12 major AI crawler bots. The implementation provides both Rack middleware integration for automatic classification and a module mixin for manual classification via user-agent properties.
Changes:
- Introduces a classifier module that identifies 12 AI bots (GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai) with pattern matching and enrichment metadata
- Adds Rack middleware for automatic request classification with thread-local storage and IP extraction
- Provides an optional mixin module for Tracker that automatically enriches events with AI bot classification properties
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| lib/mixpanel-ruby/ai_bot_classifier.rb | Core classification engine with bot database and pattern matching logic |
| lib/mixpanel-ruby/ai_bot_properties.rb | Module mixin that enriches track() calls with classification properties |
| lib/mixpanel-ruby/middleware/ai_bot_classifier.rb | Rack middleware for automatic request classification and thread-local storage |
| lib/mixpanel-ruby.rb | Adds requires for the three new modules |
| mixpanel-ruby.gemspec | Adds rack as development dependency for middleware testing |
| spec/mixpanel-ruby/ai_bot_classifier_spec.rb | Comprehensive tests for classifier covering all 12 bots plus negative cases and edge cases |
| spec/mixpanel-ruby/ai_bot_properties_spec.rb | Tests for event enrichment from both user-agent properties and thread-local classification |
| spec/mixpanel-ruby/middleware/ai_bot_classifier_spec.rb | Tests for middleware behavior including cleanup, passthrough, and IP extraction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Adds AI bot classification with Rack middleware that automatically detects AI crawler requests and enriches tracked events with classification properties.
What it does
$is_ai_bot,$ai_bot_name,$ai_bot_provider, and$ai_bot_categoryproperties/GPTBot\//i)AI Bots Detected
Implementation Details
Architecture
AiBotProperties) for property injection + Rack middleware (Middleware::AiBotClassifier) for web appstracker.extend(Mixpanel::AiBotProperties)overridestrack()to inject classification viasuperenv['mixpanel.bot_classification'](Rack convention) ANDThread.current[:mixpanel_bot_classification](for Tracker access), cleaned up inensureblock:is_ai_bot,:bot_name,:provider,:category); converted to$-prefixed string keys ($is_ai_bot,$ai_bot_name, etc.) on event properties viaclassification_to_propertiesPublic API
Mixpanel::AiBotClassifier.classify(user_agent)-- Classifies a single user-agent string. Returns{ is_ai_bot: true/false, bot_name:, provider:, category: }.Mixpanel::AiBotClassifier.bot_database-- Returns a copy of the built-in bot database (array of hashes with:name,:provider,:category,:description).Mixpanel::AiBotClassifier.create_classifier(additional_bots: [])-- Returns aProc(lambda) classifier that checksadditional_botsbefore the built-in database.Mixpanel::AiBotProperties-- Module mixin. Extend aTrackerinstance to overridetrack(distinct_id, event, properties = {}, ip = nil)with automatic AI bot enrichment.Mixpanel::Middleware::AiBotClassifier.new(app, options = {})-- Rack middleware. Accepts:additional_botsoption. Also extracts client IP fromHTTP_X_FORWARDED_FORorREMOTE_ADDRand stores it in the classification hash.Notable Design Decisions
AiBotProperties#trackfirst checks for a$user_agentproperty in the event hash (direct classification), then falls back toThread.current[:mixpanel_bot_classification]from middleware. This allows both explicit and automatic usage.create_classifierreturns a Proc: The middleware stores a callable (eitherMixpanel::AiBotClassifier.method(:classify)or the custom Proc) to avoid re-creating the combined bot list on every request.ensure: The middleware always clearsThread.current[:mixpanel_bot_classification]after the request completes, preventing classification leaking across requests in threaded servers (Puma, etc.).HTTP_X_FORWARDED_FORfirst entry orREMOTE_ADDR) and includes it alongside the classification, making the full request context available downstream.Usage Examples
Automatic Event Enrichment
Rack Middleware
Standalone Classification
Custom Bot Patterns
Files Added
lib/mixpanel-ruby/ai_bot_classifier.rb-- Core classification module with bot database and matching logiclib/mixpanel-ruby/ai_bot_properties.rb-- Tracker mixin that overridestrack()with bot property enrichmentlib/mixpanel-ruby/middleware/ai_bot_classifier.rb-- Rack middleware for automatic request classificationspec/mixpanel-ruby/ai_bot_classifier_spec.rbspec/mixpanel-ruby/ai_bot_properties_spec.rbspec/mixpanel-ruby/middleware/ai_bot_classifier_spec.rbFiles Modified
lib/mixpanel-ruby.rb-- Addedrequirestatements for the three new modulesmixpanel-ruby.gemspec-- Updated gem metadataTest Plan
{ is_ai_bot: false }(Chrome, Googlebot, curl, etc.)nilinputs handled gracefully (return{ is_ai_bot: false })gptbot/1.0matches)additional_bots) checked before built-in databaseAiBotProperties#trackenrichment (existing props not overwritten)env['mixpanel.bot_classification']andThread.currentX-Forwarded-For(first entry) and falls back toREMOTE_ADDRbundle exec rake spec)