GitHub - johnamit/semantic-context-tokens: A hybrid tokenization framework that combines coarse semantic context tokens with fine-grained sub-word tokens to improve narrative cohesion and creativity in Large Language Models (LLMs). This research builds upon Meta's Large Concept Models (LCMs) by exploring a quantized, hierarchical generation paradigm ("big picture to small").

A hybrid tokenization framework that combines coarse semantic context tokens with fine-grained sub-word tokens to improve narrative cohesion and creativity in Large Language Models (LLMs). This research builds upon Meta's Large Concept Models (LCMs) by exploring a quantized, hierarchical generation paradigm ("big picture to small") mimicking human cognitive processes. This was a collaborative group research project conducted as part of my MSc programme at University College London.

Overview

With pre-training data becoming increasingly limited, novel model architectures present a promising avenue to further improve language model performance. Large Concept Models (LCMs) have demonstrated competitive results against traditional LLMs by operating on coarse higher-level semantic representations. Our research expands upon LCMs by introducing a hybrid token approach that combines coarse semantic tokens with fine-grained tokens for autoregressive text generation.

This framework draws inspiration from how humans typically think—starting with a high-level idea before filling in the details. By embedding and clustering sentences to create discrete semantic context tokens, then conditioning language models on these tokens, we encourage a "big picture to small" generation paradigm that benefits creative domains like story generation.

Key Contributions:

Coarse-to-Fine Generation: High-level semantic clusters serve as "plans" that condition the generation of detailed text
Hierarchical Representation: Inspired by audio generation (e.g., AudioLM) where multi-level tokenization improves global consistency
Resource Efficient: Benchmarked using the TinyStories dataset, demonstrating that structural supervision can compensate for limited model scale (1M–33M parameters)
Statistically Significant Improvements: 33M parameter models with prepended tokens significantly outperform vanilla baselines (p = 0.029)

Methodology

Story Segmentation

Stories are partitioned into chunks using the SAT (Segment Any Text) model, which provides punctuation-agnostic sentence splitting. Chunks are processed as either:

1-sentence segments — Fine-grained semantic units
2-sentence segments — Broader narrative context (best performing)

Data Preprocessing

To ensure semantic clustering focuses on narrative function rather than specific characters:

Named Entity Masking: SpaCy NER identifies proper names and replaces them with <unk> tokens
Pronoun Substitution: Regex-based masking of gendered pronouns ('she', 'he', 'her', etc.)

Embedding & Clustering

Each preprocessed segment is:

Embedded using the Stella model (stella-en-1.5B-v5) into a 1536-dimensional vector space
Clustered via K-Means into k ∈ {5, 7, 15, 32} semantic groups
Mapped to discrete learned tokens (e.g., <cluster_3>)

Token Placement Strategies

Strategy	Description	Best For
Prepending	Token appears before the segment, requiring the model to "plan" before generating	Larger models (33M)
Postpending	Token appears after the segment as a "pseudo-summary"	Smaller models (1M)

Example — Prepending (2-sentence segments):

<cluster_3> Once upon a time, there was a very loyal dog. His name was Spot.
<cluster_0> One day, he saw a very shiny rock. He picked it up and started to polish it.
<cluster_4> But then, something peculiar happened. The rock suddenly started to grow larger.
<cluster_2> They had become friends and they were going to stay connected forever.

Emergent Cluster Semantics

Our best-performing configuration (2-sentence, 5 clusters) produced interpretable narrative groupings:

Cluster	Narrative Function	Example
3	Introduction	"Once upon a time, there was a wise man..."
0	Plot Setting	"One day, he saw a very shiny rock..."
4	Twist	"But then, something peculiar happened..."
2	Resolution	"They had become friends and they were going to stay connected forever."
1	Ending	"His mom laughed and gave him a hug. The End!"

Training Configuration

Parameter	Value
Optimizer	AdamW
Learning Rate	5e-6
Batch Size	4 (with 2-step gradient accumulation)
Early Stopping	Patience of 3 validation steps
Hardware	NVIDIA T4 GPUs (Google Colab) & RTX 3090
Model Variants	32 context-token configs + 2 vanilla baselines

Results

Quantitative Evaluation

Models were evaluated using GPT-4o-mini as an LLM judge, rating continuations on grammar, creativity, and consistency (1–10 scale) across 20 evaluation prompts with 10 completions each.

Model	Grammar	Creativity	Consistency	Average ± SE
33M-prepend-sentences_2-clusters_5	7.68	6.62	8.05	7.45 ± 0.080
33M-vanilla-fine-tune	7.54	6.47	7.79	7.27 ± 0.081
33M TinyStories (original)	7.55	6.62	8.03	7.40 ± 0.071

1M-postpend-sentences_1-clusters_32	4.74	5.64	4.13	4.84 ± 0.045
1M-vanilla-fine-tune	4.88	5.55	4.04	4.82 ± 0.045
1M TinyStories (original)	5.20	5.78	4.31	5.10 ± 0.042

Statistical Significance

One-tailed difference-of-means t-tests at 5% significance level:

Comparison	p-value	Significant?
33M cluster vs. 33M vanilla	0.029	Yes
33M cluster vs. 33M original	0.691	No
1M cluster vs. 1M vanilla	0.424	No
1M cluster vs. 1M original	2.68 × 10⁻⁵	Original wins

Qualitative Comparison

Prompt: "Once upon a time there was a small brown mouse who lived in a big tree. The mouse wanted to be better friends with the birds that flew around the tree, but they were always too afraid to let him get close. One day, the mouse spoke to the birds bravely, 'Please don't be..."

Model	Completion Excerpt	Scores
33M + Context Tokens	"...scared. Can someone trust me?" `<cluster_2>` The mouse asked, "Can I trust you?" `<cluster_2>` The birds looked up, and sure enough, the mouse was indeed friendly. "Sure you can trust us," they said. `<cluster_1>` The mouse was so excited..."	G:8 C:7 Co:9
33M Vanilla	"...scared, my friends, I want to make sure you are okay and always look after me." The birds looked down at the mouse with fear and sadness. Finally the birds' fear filled them with courage..."	G:8 C:7 Co:9

Key Findings

1. Model Capacity Matters

The benefits of semantic context tokens are contingent on model capacity:

33M models: Significant improvements over vanilla baselines
1M models: No significant performance gains — limited capacity cannot leverage the additional structure

2. Prepending vs. Postpending

Prepending works better for larger models (33M) — they can handle the harder "planning" task
Postpending works better for smaller models (1M) — simpler "summarization" task within capacity limits

3. Structure Over Semantics

The best configuration (2-sentence, 5 clusters) suggests gains may stem from reinforcing narrative structure rather than capturing deeper semantic content. Context tokens may function as structural markers that help models organize narrative progression.

4. Data Efficiency

Fine-tuning on just 1% of the original TinyStories corpus (5,000 stories) with context tokens outperformed vanilla fine-tuning, demonstrating the approach's data efficiency.

Citation

If you use this code or methodology in your research, please cite:

@misc{semantic-context-tokens,
  author = {Group Sem-antics},
  title = {Semantic Context Tokens},
  year = {2025},
  url = {https://github.com/johnamit/semantic-context-tokens}
}

Large Concept Models:

@article{barrault2024large,
  title={Large concept models: Language modeling in a sentence representation space},
  author={Barrault, Lo{\"\i}c and Duquenne, Paul-Ambroise and Elbayad, Maha and Kozhevnikov, Artyom and Alastruey, Belen and Andrews, Pierre and Coria, Mariano and Couairon, Guillaume and Costa-juss{\`a}, Marta R and Dale, David and others},
  journal={arXiv preprint arXiv:2412.08821},
  year={2024}
}

Tinystories:

@article{eldan2023tinystories,
  title={Tinystories: How small can language models be and still speak coherent english?},
  author={Eldan, Ronen and Li, Yuanzhi},
  journal={arXiv preprint arXiv:2305.07759},
  year={2023}
}

AudioLM:

@article{borsos2023audiolm,
  title={Audiolm: a language modeling approach to audio generation},
  author={Borsos, Zal{\'a}n and Marinier, Rapha{\"e}l and Vincent, Damien and Kharitonov, Eugene and Pietquin, Olivier and Sharifi, Matt and Roblek, Dominik and Teboul, Olivier and Grangier, David and Tagliasacchi, Marco and others},
  journal={IEEE/ACM transactions on audio, speech, and language processing},
  volume={31},
  pages={2523--2533},
  year={2023},
  publisher={IEEE}
}

Segment any text:

@article{frohmann2024segment,
  title={Segment any text: A universal approach for robust, efficient and adaptable sentence segmentation},
  author={Frohmann, Markus and Sterner, Igor and Vuli{\'c}, Ivan and Minixhofer, Benjamin and Schedl, Markus},
  journal={arXiv preprint arXiv:2406.16678},
  year={2024}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Methodology

Story Segmentation

Data Preprocessing

Embedding & Clustering

Token Placement Strategies

Emergent Cluster Semantics

Training Configuration

Results

Quantitative Evaluation

Statistical Significance

Qualitative Comparison

Key Findings

1. Model Capacity Matters

2. Prepending vs. Postpending

3. Structure Over Semantics

4. Data Efficiency

Citation

License

About

Uh oh!

johnamit/semantic-context-tokens

Folders and files

Latest commit

History

Repository files navigation

Overview

Methodology

Story Segmentation

Data Preprocessing

Embedding & Clustering

Token Placement Strategies

Emergent Cluster Semantics

Training Configuration

Results

Quantitative Evaluation

Statistical Significance

Qualitative Comparison

Key Findings

1. Model Capacity Matters

2. Prepending vs. Postpending

3. Structure Over Semantics

4. Data Efficiency

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks