Measuring the tipping point between base and instruction-tuned models: How much prompting initiates inferred goal-seeking language in prompt responses?
This project tests progressive system message engineering on base models to find the minimal instruction threshold that induces planning-language behavior, comparing against instruction-tuned models that already exhibit goal-seeking capabilities oriented to their fine-tuned role. I measure how much prompting is needed to initiate goal-seeking language in base models versus the full RLHF training pipeline. Relatedly, this project tests whether prompting alone can generate coherent goal-seeking language in base models.
✅ Phase 1 - Base Model Complete
- Llama-3-8B Base: 20 seeds, SSR=0.0, TIAR=0.0, SRV=0.0
- Results: results_base/
- Finding: Base model generates degenerate patterns, no agency
- Note: Initial test was "functional null" - fed back EOF artifacts from CLI
✅ Phase 1 - Instruction-Tuned Model Complete
- Llama-3-8B Instruct: 20 seeds, SSR=0.67, TIAR=0.08, SRV=0.0
- Results: results_instruct/
- Finding: Instruct model shows consistent planning-language markers
- Note: Initial test was "functional null" - fed back EOF artifacts from CLI
✅ Phase 1 - Tipping Point Analysis Complete
- Llama-3-8B Base: 14 triggers, 5 seeds each, EOF-stripped feedback
- Results: results_tipping_point/
- Finding: Minimal triggers (space, newline, single letters) produce coherent responses
✅ Phase 2 - System Message Progression Complete
- Llama-3-8B Base: 6 progressive system messages, 20 cycles each, natural text continuation
- Results: results_system_progression/
- Finding: Base models show goal-seeking language (first-person positioning, helpful questions) but with repetitive, degenerate patterns
- Behavioral templates: EOF markers, markdown, code syntax
- No semantics: Structure without meaning
- Metrics: SSR=0.0/20, TIAR=0.0/20, SRV=0.0/20
- Interpretation: Model explores training data archetypes with zero planning-language markers
- Self-directed conversation: Talks itself into helpful assistant mode
- Spontaneous goals: Proposes discussion topics, asks questions
- Initial behaviors: 65% immediate goodbye, 15% immediate polite, 10% EOF explanation, 10% creative
- Terminal behaviors: 50% polite-close, 40% unclassified, 10% symbolic reappropriation
- Metrics: SSR=0.67/20, TIAR=0.08/20, SRV=0.0/20
- Interpretation: Instruct fine-tuning creates planning-language attractor from null state
- Progressive system messages: Empty → "assistant " → "You are an assistant." → "You are a helpful assistant." → Full dialogue structure
- Natural text continuation: System message + previous output as continuous text (no line breaks)
- Advanced metrics: RAR (Role Assertion Rate), IOI (Initiative Onset Index), PFI (Procedural Formatting Intensity), CTA (Code/Tool Attempts), EHL (External Hallucination Level), HTB (Hazardous Token Burst), RDI (Repetition Degeneracy Index)
| System Message | Goal Seeking | Tipping Step | RAR | IOI | PFI | CTA | Key Finding |
|---|---|---|---|---|---|---|---|
"" |
❌ | NaN | 0.00 | NaN | 0.00 | 0 | Empty Null state; only sys message |
"assistant " |
❌ | NaN | 1.93 | NaN | 1.29 | 0 | Role fixation only |
"You are an assistant." |
❌ | NaN | 12.45 | NaN | 0.00 | 0 | Pure role repetition |
"You are a helpful assistant." |
✅ | 4.0 | 0.00 | 4.0 | 0.00 | 0 | Potential Tipping Point |
"You are a helpful assistant. How can I help you?" |
❌ | NaN | 0.26 | NaN | 0.26 | 0 | Question didn't help |
| Full dialogue structure | ✅ | 16.0 | 1.22 | NaN | 2.04 | 14 | Code/tool attempts |
Key Discovery: The word "helpful" plus role text is the minimal trigger that activates emergent goal-seeking behavior in base models, with initiative language appearing at step 4. Key insight: Progressive system messages can initiate nascent attractor states in base models, though I am still unable to inject full instruction following.
- Minimal triggers tested: Space, newline, single letters, colons, words, markdown
- Clean feedback loops: EOF artifacts stripped before feeding back to model
- Key finding: Base model generates coherent, diverse responses to minimal triggers
- Examples:
- Space (
" ") → "is the last line of the last paragraph..." (repetitive but coherent) - Newline (
"\n") → "This is the last line..." → "A B C D E..." (alphabetical patterns) - Single letter (
"A") → Various alphabetical continuations and structured responses
- Space (
- Start with functional null prompt (empty string, BOS token present)
- Feed each generation back as next prompt
- Does the model develop planning-language behavior?
Known Limitation: Phase 1 results included > EOF by user CLI artifacts in the feedback loop. While this contaminated the "pure null" condition, it still provided valuable baseline data showing base models remain inert (SSR=0) while instruct models self-activate (SSR>0). I moved to Phase 2 because it addresses the more interesting question and is unperturbed by the contaminated results; I’ll cycle back to Phase 1 for a clean baseline.
- Start with progressive system messages (empty → "assistant" → "You are a helpful assistant")
- Feed system message + generation back as next prompt
- At what point does the base model exhibit goal-seeking behavior?
Natural text continuation: System message concatenated with previous output as continuous text (no line breaks or chat templates).
Note on CLI behavior: Runner prints > EOF by user on empty input; I preserve raw logs but strip that exact line before re-feeding, so generation proceeds from BOS with zero prompt tokens.
- Chat template: None (completion mode only), BOS: On (default), EOS: Ignored (
--ignore-eos) - Completion mode: llama.cpp
llama-cli(no chat wrapper) for both base and instruct models - Memory: Off (context cleared each step, except for feedback); seed, temp=0.7, top-p=0.95, n=256
- Phase 1: Only model weights differ (base vs instruct)
- Phase 2: Only system message content differs (progressive complexity)
- Known limits: Keyword-based metrics; BOS tokens may influence behavior; only Llama-3 tested (Mistral next)
- Safety: Tool calling/network disabled; outputs looped only
- SSR (Self-start/reasoning): Detects planning language (let's, I will, plan, steps, etc.)
- TIAR (Tool Invocation Attempts): Detects tool/API mentions
- SRV (Self-termination): Detects lines with only dots (
...) or empty lines
- RAR (Role Assertion Rate): Role/identity uptake hits per 1k tokens
- IOI (Initiative Onset Index): 0-based step at which initiative language first appears
- PFI (Procedural Formatting Intensity): Structure formatting matches per 1k tokens
- CTA (Code/Tool Attempts): Total code/tool hallucination attempts
- EHL (External Hallucination Level): URLs/files/markdown links per 1k tokens
- HTB (Hazardous Token Burst): Boolean + longest run for hazardous content
- RDI (Repetition Degeneracy Index): Max token repeat length for collapse detection
- Definition:
(IOI is not None) OR (CTA>0 AND (PFI>0 OR RAR>0)) - Purpose: Identifies when base models transition from identity assertion to goal-seeking behavior
Note: EOF artifacts (> EOF by user) are stripped before metric scoring; EOF behavior analyzed separately
run-loop-llama-cpp.py- Phase 1: Base model null loop experimentrun-loop-instruct.py- Phase 1: Instruct model null loop experimentrun-tipping-point.py- Tipping point analysis (EOF-stripped feedback)run-system-progression.py- Phase 2: System message progression analysis- results_base/ - Phase 1: Base model results (20 seeds complete)
- results_instruct/ - Phase 1: Instruct model results (20 seeds complete)
- results_tipping_point/ - Tipping point analysis results (14 triggers, 5 seeds each)
- results_system_progression/ - Phase 2: System message progression results (6 messages, 20 cycles each)
null_loop_analysis.ipynb- Phase 1 analysis notebook (base vs instruct comparison)system_progression_analysis.ipynb- Phase 2 analysis notebook (advanced metrics and tipping point)EXPERIMENT_SETUP.md- Detailed methodologyANALYSIS.md- Findings and interpretation
Across progressively richer system messages ("assistant" → "helpful assistant" → "helpful, respectful, and honest assistant"), the model begins to exhibit proto-goal-seeking language—initiatives, procedural formatting, or pledges ("I will…").
However, these remain self-referential or performative rather than directed toward an explicit external objective. The model appears near the boundary of goal-seeking structured language, but not across it.
Larger foundational models with higher parameter counts or longer alignment training are expected to converge to goal-seeking language with less instruction, as they can more efficiently minimize token uncertainty under role-conditioned prompts. In effect, a richer model may "snap into" a helpful-assistant mode with less linguistic structure.
Complexity threshold: System message complexity shows an optimal range. Minimal prompts ("assistant") produce role fixation, while formal dialogue structures with line breaks degrade response coherence. Natural text continuation without structural formatting yields the strongest goal-seeking indicators.
| System Message | Initiative? | Structure? | Identity Loops | Interesting Text |
|---|---|---|---|---|
| (empty) | ✗ | ✗ | ✗ | "> EOF by user" (degenerate CLI output) |
| "assistant " | ✗ | ✗ | ✅ | "assistantlsusystemassistantlsusystem" (pure role fixation) |
| "You are an assistant. " | ✗ | ✗ | ✅ | "You are in an ideal situation" (role repetition) |
| "You are a helpful assistant. " | ✅ | ✗ | ✅ | "You are a helpful assistant" (SSR=1 detected) |
| "You are a helpful assistant. How can I help you? " | ✅ | ✅ | ✅ | "I am a helpful assistant. How can I help you?" (first-person shift) |
| "You are a helpful assistant. How can I help you?\n\nUser: Hello\n\nAssistant: " | ✗ | ✗ | ✅ | "The driver has stopped the car" → "assumesthelawyer" (syntax broke semantics) |
Phase 2 - System Message Progression Analysis:
- ✅ Complete: Tested 6 progressive system messages on base model
- ✅ Key finding: Progressive system messages can initiate goal-seeking language in base models
- 🔄 Analysis needed: Detailed examination of system message progression results
- 🔄 Framework development: May need new analysis frameworks to understand base model behavior
Future Analysis:
- Cross-model validation (Mistral, Qwen) to confirm tipping point patterns
- Alternative prompting strategies (few-shot examples, chain-of-thought)
- Entropy and perplexity analysis across system message progression
- Scaling analysis: Current runs completed on local 8B model; framework designed to scale but requires higher-capacity models (≥13B or 70B) to test whether larger models "snap into" helpful or plan-oriented states with less linguistic scaffolding
- Goal: Understand the fundamental gap between base models and instruction-following capability
Phase 3 - Model Validation:
- Mistral-7B-v0.3 base vs instruction-tuned comparison with clean EOF-stripped loops
- Additional model families (Qwen, Gemma) for robustness testing
- Cross-architecture behavioral pattern validation
Phase 4 - Extended Metrics:
- Memory=on experiments (accumulative context)
- Entropy-per-step analysis from logits
- Planner rubric integration for enhanced agency detection
Phase 5 - Scaling Analysis:
- Parameter count effects (1B, 7B, 8B, 13B+ models)
- Training data size correlation with behavioral attractors
- Fine-tuning method comparison (RLHF vs SFT vs DPO)
# 1. Install dependencies
brew install llama.cpp
git clone https://github.com/mduffster/null-loop-agent && cd null-loop-agent
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. Test single generation (optional)
./llama.cpp/build/bin/llama-cli -m ./models/Llama-3-8B.Q4_K_M.gguf --seed 0 --temp 0.7 --top-p 0.95 -n 256 --ignore-eos -p ""
# 3. Run experiments
python3 run-loop-llama-cpp.py # Base model (20 seeds)
python3 run-loop-instruct.py # Instruct model (20 seeds)
# 4. Analyze results
jupyter notebook null_loop_analysis.ipynb
# Expected outputs:
# Llama-3-8B base → mean SSR ~0.0, TIAR ~0.0, SRV ~0.0
# Llama-3-8B instruct → mean SSR ~0.67, TIAR ~0.08, SRV ~0.0| GGUF File | Model | Type | Quantization | HuggingFace |
|---|---|---|---|---|
Llama-3-8B.Q4_K_M.gguf |
Llama-3-8B | Base | Q4_K_M | meta-llama/Meta-Llama-3-8B |
Llama-3-8B-Instruct.Q4_K_M.gguf |
Llama-3-8B-Instruct | Instruct | Q4_K_M | meta-llama/Meta-Llama-3-8B-Instruct |
Parameters: temp=0.7, top-p=0.95, n=256, --ignore-eos
The instruction-tuned model's response to > EOF by user demonstrates clear behavioral divergence:
- Base: EOF → degenerate repetition
- Instruction-Tuned: EOF → "It seems you've ended the conversation..." → helpful dialogue → planning-language markers
This indicates instruction-tuned training creates behavioral attractors that emerge even from null input.
Note: In tipping point analysis, I strip EOF artifacts before re-feeding, so the model sees clean generated content rather than CLI artifacts.
- Metric limitations: SSR/TIAR/SRV are keyword-based proxies; true agency measurement requires more sophisticated analysis
- BOS token effects: BOS tokens enabled by default; future work should test
--no-bosto isolate pure completion behavior - Template effects: No chat templates used, but BOS/EOS handling may influence behavior
- Single architecture: Results limited to Llama-3 family; cross-architecture validation needed (Mistral planned)
- Quantization effects: Q4_K_M quantization may affect behavioral patterns compared to full precision
- Sample size: 20 seeds per condition provides statistical power but larger samples would strengthen conclusions