Skip to content

Add AI bot classification for event enrichment#131

Open
jaredmixpanel wants to merge 3 commits intomasterfrom
feature/ai-bot-classification
Open

Add AI bot classification for event enrichment#131
jaredmixpanel wants to merge 3 commits intomasterfrom
feature/ai-bot-classification

Conversation

@jaredmixpanel
Copy link
Contributor

@jaredmixpanel jaredmixpanel commented Feb 19, 2026

Summary

Adds AI bot classification with Rack middleware that automatically detects AI crawler requests and enriches tracked events with classification properties.

What it does

  • Classifies user-agent strings against a database of 12 known AI bots
  • Enriches events with $is_ai_bot, $ai_bot_name, $ai_bot_provider, and $ai_bot_category properties
  • Supports custom bot patterns that take priority over built-in patterns
  • Case-insensitive matching via regex patterns (e.g. /GPTBot\//i)

AI Bots Detected

Bot Provider Category
GPTBot OpenAI indexing
ChatGPT-User OpenAI retrieval
OAI-SearchBot OpenAI indexing
ClaudeBot Anthropic indexing
Claude-User Anthropic retrieval
Google-Extended Google indexing
PerplexityBot Perplexity retrieval
Bytespider ByteDance indexing
CCBot Common Crawl indexing
Applebot-Extended Apple indexing
Meta-ExternalAgent Meta indexing
cohere-ai Cohere indexing

Implementation Details

Architecture

  • Dual approach: module mixin (AiBotProperties) for property injection + Rack middleware (Middleware::AiBotClassifier) for web apps
  • tracker.extend(Mixpanel::AiBotProperties) overrides track() to inject classification via super
  • Rack middleware stores classification in both env['mixpanel.bot_classification'] (Rack convention) AND Thread.current[:mixpanel_bot_classification] (for Tracker access), cleaned up in ensure block
  • Internal classification uses symbol keys (:is_ai_bot, :bot_name, :provider, :category); converted to $-prefixed string keys ($is_ai_bot, $ai_bot_name, etc.) on event properties via classification_to_properties

Public API

  • Mixpanel::AiBotClassifier.classify(user_agent) -- Classifies a single user-agent string. Returns { is_ai_bot: true/false, bot_name:, provider:, category: }.
  • Mixpanel::AiBotClassifier.bot_database -- Returns a copy of the built-in bot database (array of hashes with :name, :provider, :category, :description).
  • Mixpanel::AiBotClassifier.create_classifier(additional_bots: []) -- Returns a Proc (lambda) classifier that checks additional_bots before the built-in database.
  • Mixpanel::AiBotProperties -- Module mixin. Extend a Tracker instance to override track(distinct_id, event, properties = {}, ip = nil) with automatic AI bot enrichment.
  • Mixpanel::Middleware::AiBotClassifier.new(app, options = {}) -- Rack middleware. Accepts :additional_bots option. Also extracts client IP from HTTP_X_FORWARDED_FOR or REMOTE_ADDR and stores it in the classification hash.

Notable Design Decisions

  1. Two classification sources with priority: AiBotProperties#track first checks for a $user_agent property in the event hash (direct classification), then falls back to Thread.current[:mixpanel_bot_classification] from middleware. This allows both explicit and automatic usage.
  2. create_classifier returns a Proc: The middleware stores a callable (either Mixpanel::AiBotClassifier.method(:classify) or the custom Proc) to avoid re-creating the combined bot list on every request.
  3. Thread-local cleanup in ensure: The middleware always clears Thread.current[:mixpanel_bot_classification] after the request completes, preventing classification leaking across requests in threaded servers (Puma, etc.).
  4. IP extraction in middleware: The middleware extracts client IP (HTTP_X_FORWARDED_FOR first entry or REMOTE_ADDR) and includes it alongside the classification, making the full request context available downstream.

Usage Examples

Automatic Event Enrichment

tracker = Mixpanel::Tracker.new('YOUR_TOKEN')
tracker.extend(Mixpanel::AiBotProperties)

# Classification happens automatically from the $user_agent property
tracker.track('user-123', 'Page View', {
  '$user_agent' => 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) ChatGPT-User/1.0'
})
# => event properties include: $is_ai_bot=true, $ai_bot_name="ChatGPT-User",
#    $ai_bot_provider="OpenAI", $ai_bot_category="retrieval"

Rack Middleware

# config.ru
require 'mixpanel-ruby'

use Mixpanel::Middleware::AiBotClassifier
run MyApp

# In Rails: config/application.rb
config.middleware.use Mixpanel::Middleware::AiBotClassifier

# Classification is available in env and Thread.current:
# env['mixpanel.bot_classification']
# Thread.current[:mixpanel_bot_classification]

Standalone Classification

result = Mixpanel::AiBotClassifier.classify('ClaudeBot/1.0')
# => { is_ai_bot: true, bot_name: "ClaudeBot", provider: "Anthropic", category: "indexing" }

result = Mixpanel::AiBotClassifier.classify('Mozilla/5.0 Chrome/120')
# => { is_ai_bot: false }

# Inspect the built-in database
Mixpanel::AiBotClassifier.bot_database
# => [{ name: "GPTBot", provider: "OpenAI", category: "indexing", description: "..." }, ...]

Custom Bot Patterns

# Standalone: create_classifier returns a Proc
classifier = Mixpanel::AiBotClassifier.create_classifier(
  additional_bots: [
    { pattern: /MyInternalBot/i, name: 'MyInternalBot', provider: 'Internal', category: 'testing' }
  ]
)
classifier.call('MyInternalBot/1.0')
# => { is_ai_bot: true, bot_name: "MyInternalBot", provider: "Internal", category: "testing" }

# Middleware: pass additional_bots via options hash
use Mixpanel::Middleware::AiBotClassifier, additional_bots: [
  { pattern: /MyInternalBot/i, name: 'MyInternalBot', provider: 'Internal', category: 'testing' }
]

Files Added

  • lib/mixpanel-ruby/ai_bot_classifier.rb -- Core classification module with bot database and matching logic
  • lib/mixpanel-ruby/ai_bot_properties.rb -- Tracker mixin that overrides track() with bot property enrichment
  • lib/mixpanel-ruby/middleware/ai_bot_classifier.rb -- Rack middleware for automatic request classification
  • spec/mixpanel-ruby/ai_bot_classifier_spec.rb
  • spec/mixpanel-ruby/ai_bot_properties_spec.rb
  • spec/mixpanel-ruby/middleware/ai_bot_classifier_spec.rb

Files Modified

  • lib/mixpanel-ruby.rb -- Added require statements for the three new modules
  • mixpanel-ruby.gemspec -- Updated gem metadata

Test Plan

  • All 12 AI bot user-agents correctly classified with correct name, provider, and category
  • Non-AI-bot user-agents return { is_ai_bot: false } (Chrome, Googlebot, curl, etc.)
  • Empty string and nil inputs handled gracefully (return { is_ai_bot: false })
  • Case-insensitive matching works (e.g. gptbot/1.0 matches)
  • Custom bot patterns (via additional_bots) checked before built-in database
  • Event properties preserved through AiBotProperties#track enrichment (existing props not overwritten)
  • Middleware stores classification in both env['mixpanel.bot_classification'] and Thread.current
  • Thread-local is cleaned up after request (no leaking across requests)
  • IP extraction handles X-Forwarded-For (first entry) and falls back to REMOTE_ADDR
  • No regressions in existing test suite (bundle exec rake spec)

Part of AI bot classification feature for Ruby SDK.
Part of AI bot classification feature for Ruby SDK.
Add requires to main entry point and rack dev dependency.
Part of AI bot classification feature for Ruby SDK.
@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 97.14286% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.38%. Comparing base (a3020d2) to head (e710fed).

Files with missing lines Patch % Lines
lib/mixpanel-ruby/ai_bot_classifier.rb 95.45% 1 Missing ⚠️
lib/mixpanel-ruby/middleware/ai_bot_classifier.rb 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
+ Coverage   96.29%   96.38%   +0.09%     
==========================================
  Files          12       15       +3     
  Lines         567      637      +70     
==========================================
+ Hits          546      614      +68     
- Misses         21       23       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AI bot classification functionality to mixpanel-ruby, enabling automatic detection and tracking enrichment for 12 major AI crawler bots. The implementation provides both Rack middleware integration for automatic classification and a module mixin for manual classification via user-agent properties.

Changes:

  • Introduces a classifier module that identifies 12 AI bots (GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Google-Extended, PerplexityBot, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, cohere-ai) with pattern matching and enrichment metadata
  • Adds Rack middleware for automatic request classification with thread-local storage and IP extraction
  • Provides an optional mixin module for Tracker that automatically enriches events with AI bot classification properties

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lib/mixpanel-ruby/ai_bot_classifier.rb Core classification engine with bot database and pattern matching logic
lib/mixpanel-ruby/ai_bot_properties.rb Module mixin that enriches track() calls with classification properties
lib/mixpanel-ruby/middleware/ai_bot_classifier.rb Rack middleware for automatic request classification and thread-local storage
lib/mixpanel-ruby.rb Adds requires for the three new modules
mixpanel-ruby.gemspec Adds rack as development dependency for middleware testing
spec/mixpanel-ruby/ai_bot_classifier_spec.rb Comprehensive tests for classifier covering all 12 bots plus negative cases and edge cases
spec/mixpanel-ruby/ai_bot_properties_spec.rb Tests for event enrichment from both user-agent properties and thread-local classification
spec/mixpanel-ruby/middleware/ai_bot_classifier_spec.rb Tests for middleware behavior including cleanup, passthrough, and IP extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants