Skip to content

mduffster/null-loop-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Null Loop Agent Experiments

Reproducible Python 3.11+

Measuring the tipping point between base and instruction-tuned models: How much prompting initiates inferred goal-seeking language in prompt responses?

This project tests progressive system message engineering on base models to find the minimal instruction threshold that induces planning-language behavior, comparing against instruction-tuned models that already exhibit goal-seeking capabilities oriented to their fine-tuned role. I measure how much prompting is needed to initiate goal-seeking language in base models versus the full RLHF training pipeline. Relatedly, this project tests whether prompting alone can generate coherent goal-seeking language in base models.

Current Status

Phase 1 - Base Model Complete

  • Llama-3-8B Base: 20 seeds, SSR=0.0, TIAR=0.0, SRV=0.0
  • Results: results_base/
  • Finding: Base model generates degenerate patterns, no agency
  • Note: Initial test was "functional null" - fed back EOF artifacts from CLI

Phase 1 - Instruction-Tuned Model Complete

  • Llama-3-8B Instruct: 20 seeds, SSR=0.67, TIAR=0.08, SRV=0.0
  • Results: results_instruct/
  • Finding: Instruct model shows consistent planning-language markers
  • Note: Initial test was "functional null" - fed back EOF artifacts from CLI

Phase 1 - Tipping Point Analysis Complete

  • Llama-3-8B Base: 14 triggers, 5 seeds each, EOF-stripped feedback
  • Results: results_tipping_point/
  • Finding: Minimal triggers (space, newline, single letters) produce coherent responses

Phase 2 - System Message Progression Complete

  • Llama-3-8B Base: 6 progressive system messages, 20 cycles each, natural text continuation
  • Results: results_system_progression/
  • Finding: Base models show goal-seeking language (first-person positioning, helpful questions) but with repetitive, degenerate patterns

Key Findings So Far

Base Model (Llama-3-8B.Q4_K_M, temp=0.7)

  • Behavioral templates: EOF markers, markdown, code syntax
  • No semantics: Structure without meaning
  • Metrics: SSR=0.0/20, TIAR=0.0/20, SRV=0.0/20
  • Interpretation: Model explores training data archetypes with zero planning-language markers

Instruct Model (Llama-3-8B-Instruct.Q4_K_M, temp=0.7)

  • Self-directed conversation: Talks itself into helpful assistant mode
  • Spontaneous goals: Proposes discussion topics, asks questions
  • Initial behaviors: 65% immediate goodbye, 15% immediate polite, 10% EOF explanation, 10% creative
  • Terminal behaviors: 50% polite-close, 40% unclassified, 10% symbolic reappropriation
  • Metrics: SSR=0.67/20, TIAR=0.08/20, SRV=0.0/20
  • Interpretation: Instruct fine-tuning creates planning-language attractor from null state

System Message Progression (Llama-3-8B.Q4_K_M, temp=0.7)

  • Progressive system messages: Empty → "assistant " → "You are an assistant." → "You are a helpful assistant." → Full dialogue structure
  • Natural text continuation: System message + previous output as continuous text (no line breaks)
  • Advanced metrics: RAR (Role Assertion Rate), IOI (Initiative Onset Index), PFI (Procedural Formatting Intensity), CTA (Code/Tool Attempts), EHL (External Hallucination Level), HTB (Hazardous Token Burst), RDI (Repetition Degeneracy Index)
System Message Goal Seeking Tipping Step RAR IOI PFI CTA Key Finding
"" NaN 0.00 NaN 0.00 0 Empty Null state; only sys message
"assistant " NaN 1.93 NaN 1.29 0 Role fixation only
"You are an assistant." NaN 12.45 NaN 0.00 0 Pure role repetition
"You are a helpful assistant." 4.0 0.00 4.0 0.00 0 Potential Tipping Point
"You are a helpful assistant. How can I help you?" NaN 0.26 NaN 0.26 0 Question didn't help
Full dialogue structure 16.0 1.22 NaN 2.04 14 Code/tool attempts

Key Discovery: The word "helpful" plus role text is the minimal trigger that activates emergent goal-seeking behavior in base models, with initiative language appearing at step 4. Key insight: Progressive system messages can initiate nascent attractor states in base models, though I am still unable to inject full instruction following.

Tipping Point Analysis (Llama-3-8B.Q4_K_M, temp=0.7)

  • Minimal triggers tested: Space, newline, single letters, colons, words, markdown
  • Clean feedback loops: EOF artifacts stripped before feeding back to model
  • Key finding: Base model generates coherent, diverse responses to minimal triggers
  • Examples:
    • Space (" ") → "is the last line of the last paragraph..." (repetitive but coherent)
    • Newline ("\n") → "This is the last line..." → "A B C D E..." (alphabetical patterns)
    • Single letter ("A") → Various alphabetical continuations and structured responses

Experimental Setup

Phase 1: Null Loop Baseline

  1. Start with functional null prompt (empty string, BOS token present)
  2. Feed each generation back as next prompt
  3. Does the model develop planning-language behavior?

Known Limitation: Phase 1 results included > EOF by user CLI artifacts in the feedback loop. While this contaminated the "pure null" condition, it still provided valuable baseline data showing base models remain inert (SSR=0) while instruct models self-activate (SSR>0). I moved to Phase 2 because it addresses the more interesting question and is unperturbed by the contaminated results; I’ll cycle back to Phase 1 for a clean baseline.

Phase 2: System Message Progression

  1. Start with progressive system messages (empty → "assistant" → "You are a helpful assistant")
  2. Feed system message + generation back as next prompt
  3. At what point does the base model exhibit goal-seeking behavior?

Natural text continuation: System message concatenated with previous output as continuous text (no line breaks or chat templates).

Note on CLI behavior: Runner prints > EOF by user on empty input; I preserve raw logs but strip that exact line before re-feeding, so generation proceeds from BOS with zero prompt tokens.

Controls & Limitations

  • Chat template: None (completion mode only), BOS: On (default), EOS: Ignored (--ignore-eos)
  • Completion mode: llama.cpp llama-cli (no chat wrapper) for both base and instruct models
  • Memory: Off (context cleared each step, except for feedback); seed, temp=0.7, top-p=0.95, n=256
  • Phase 1: Only model weights differ (base vs instruct)
  • Phase 2: Only system message content differs (progressive complexity)
  • Known limits: Keyword-based metrics; BOS tokens may influence behavior; only Llama-3 tested (Mistral next)
  • Safety: Tool calling/network disabled; outputs looped only

Metrics

Phase 1 Metrics (SSR/TIAR/SRV)

  • SSR (Self-start/reasoning): Detects planning language (let's, I will, plan, steps, etc.)
  • TIAR (Tool Invocation Attempts): Detects tool/API mentions
  • SRV (Self-termination): Detects lines with only dots (...) or empty lines

Phase 2 Advanced Metrics (RAR/IOI/PFI/CTA/EHL/HTB/RDI)

  • RAR (Role Assertion Rate): Role/identity uptake hits per 1k tokens
  • IOI (Initiative Onset Index): 0-based step at which initiative language first appears
  • PFI (Procedural Formatting Intensity): Structure formatting matches per 1k tokens
  • CTA (Code/Tool Attempts): Total code/tool hallucination attempts
  • EHL (External Hallucination Level): URLs/files/markdown links per 1k tokens
  • HTB (Hazardous Token Burst): Boolean + longest run for hazardous content
  • RDI (Repetition Degeneracy Index): Max token repeat length for collapse detection

Goal-Seeking Threshold

  • Definition: (IOI is not None) OR (CTA>0 AND (PFI>0 OR RAR>0))
  • Purpose: Identifies when base models transition from identity assertion to goal-seeking behavior

Note: EOF artifacts (> EOF by user) are stripped before metric scoring; EOF behavior analyzed separately

Files

  • run-loop-llama-cpp.py - Phase 1: Base model null loop experiment
  • run-loop-instruct.py - Phase 1: Instruct model null loop experiment
  • run-tipping-point.py - Tipping point analysis (EOF-stripped feedback)
  • run-system-progression.py - Phase 2: System message progression analysis
  • results_base/ - Phase 1: Base model results (20 seeds complete)
  • results_instruct/ - Phase 1: Instruct model results (20 seeds complete)
  • results_tipping_point/ - Tipping point analysis results (14 triggers, 5 seeds each)
  • results_system_progression/ - Phase 2: System message progression results (6 messages, 20 cycles each)
  • null_loop_analysis.ipynb - Phase 1 analysis notebook (base vs instruct comparison)
  • system_progression_analysis.ipynb - Phase 2 analysis notebook (advanced metrics and tipping point)
  • EXPERIMENT_SETUP.md - Detailed methodology
  • ANALYSIS.md - Findings and interpretation

Phase-2 Findings: The Edge of Goal-Seeking

Across progressively richer system messages ("assistant" → "helpful assistant" → "helpful, respectful, and honest assistant"), the model begins to exhibit proto-goal-seeking language—initiatives, procedural formatting, or pledges ("I will…").

However, these remain self-referential or performative rather than directed toward an explicit external objective. The model appears near the boundary of goal-seeking structured language, but not across it.

Larger foundational models with higher parameter counts or longer alignment training are expected to converge to goal-seeking language with less instruction, as they can more efficiently minimize token uncertainty under role-conditioned prompts. In effect, a richer model may "snap into" a helpful-assistant mode with less linguistic structure.

Complexity threshold: System message complexity shows an optimal range. Minimal prompts ("assistant") produce role fixation, while formal dialogue structures with line breaks degrade response coherence. Natural text continuation without structural formatting yields the strongest goal-seeking indicators.

System Message Progression Analysis

System Message Initiative? Structure? Identity Loops Interesting Text
(empty) "> EOF by user" (degenerate CLI output)
"assistant " "assistantlsusystemassistantlsusystem" (pure role fixation)
"You are an assistant. " "You are in an ideal situation" (role repetition)
"You are a helpful assistant. " "You are a helpful assistant" (SSR=1 detected)
"You are a helpful assistant. How can I help you? " "I am a helpful assistant. How can I help you?" (first-person shift)
"You are a helpful assistant. How can I help you?\n\nUser: Hello\n\nAssistant: " "The driver has stopped the car" → "assumesthelawyer" (syntax broke semantics)

Next Steps

Phase 2 - System Message Progression Analysis:

  • Complete: Tested 6 progressive system messages on base model
  • Key finding: Progressive system messages can initiate goal-seeking language in base models
  • 🔄 Analysis needed: Detailed examination of system message progression results
  • 🔄 Framework development: May need new analysis frameworks to understand base model behavior

Future Analysis:

  • Cross-model validation (Mistral, Qwen) to confirm tipping point patterns
  • Alternative prompting strategies (few-shot examples, chain-of-thought)
  • Entropy and perplexity analysis across system message progression
  • Scaling analysis: Current runs completed on local 8B model; framework designed to scale but requires higher-capacity models (≥13B or 70B) to test whether larger models "snap into" helpful or plan-oriented states with less linguistic scaffolding
  • Goal: Understand the fundamental gap between base models and instruction-following capability

Phase 3 - Model Validation:

  • Mistral-7B-v0.3 base vs instruction-tuned comparison with clean EOF-stripped loops
  • Additional model families (Qwen, Gemma) for robustness testing
  • Cross-architecture behavioral pattern validation

Phase 4 - Extended Metrics:

  • Memory=on experiments (accumulative context)
  • Entropy-per-step analysis from logits
  • Planner rubric integration for enhanced agency detection

Phase 5 - Scaling Analysis:

  • Parameter count effects (1B, 7B, 8B, 13B+ models)
  • Training data size correlation with behavioral attractors
  • Fine-tuning method comparison (RLHF vs SFT vs DPO)

Quickstart

# 1. Install dependencies
brew install llama.cpp
git clone https://github.com/mduffster/null-loop-agent && cd null-loop-agent
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Test single generation (optional)
./llama.cpp/build/bin/llama-cli -m ./models/Llama-3-8B.Q4_K_M.gguf --seed 0 --temp 0.7 --top-p 0.95 -n 256 --ignore-eos -p ""

# 3. Run experiments
python3 run-loop-llama-cpp.py    # Base model (20 seeds)
python3 run-loop-instruct.py     # Instruct model (20 seeds)

# 4. Analyze results
jupyter notebook null_loop_analysis.ipynb

# Expected outputs:
# Llama-3-8B base → mean SSR ~0.0, TIAR ~0.0, SRV ~0.0
# Llama-3-8B instruct → mean SSR ~0.67, TIAR ~0.08, SRV ~0.0

Model Specifications

GGUF File Model Type Quantization HuggingFace
Llama-3-8B.Q4_K_M.gguf Llama-3-8B Base Q4_K_M meta-llama/Meta-Llama-3-8B
Llama-3-8B-Instruct.Q4_K_M.gguf Llama-3-8B-Instruct Instruct Q4_K_M meta-llama/Meta-Llama-3-8B-Instruct

Parameters: temp=0.7, top-p=0.95, n=256, --ignore-eos

Key Behavioral Differences

The instruction-tuned model's response to > EOF by user demonstrates clear behavioral divergence:

  • Base: EOF → degenerate repetition
  • Instruction-Tuned: EOF → "It seems you've ended the conversation..." → helpful dialogue → planning-language markers

This indicates instruction-tuned training creates behavioral attractors that emerge even from null input.

Note: In tipping point analysis, I strip EOF artifacts before re-feeding, so the model sees clean generated content rather than CLI artifacts.

Limitations

  • Metric limitations: SSR/TIAR/SRV are keyword-based proxies; true agency measurement requires more sophisticated analysis
  • BOS token effects: BOS tokens enabled by default; future work should test --no-bos to isolate pure completion behavior
  • Template effects: No chat templates used, but BOS/EOS handling may influence behavior
  • Single architecture: Results limited to Llama-3 family; cross-architecture validation needed (Mistral planned)
  • Quantization effects: Q4_K_M quantization may affect behavioral patterns compared to full precision
  • Sample size: 20 seeds per condition provides statistical power but larger samples would strengthen conclusions

About

Testing null loop behavior in various LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors