A compilation of resources for keeping up with the latest trends in NLP.
Note: This resource list is a work in progress. More papers and topics will be added regularly. Contributions and suggestions are welcome!
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT1
- GPT2
- T5
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- Longformer: The Long-Document Transformer
- ROFORMER
- Language Models are Few-Shot Learners - GPT3 paper
- Attention is all you need
- Memory Is All You Need
- Byte-pair Encoding
- The Illustrated Transformer Blog
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer - MoE paper for LMs
- Fast Transformer Decoding: One Write-Head is All You Need - Multi-Query Attention (MQA) Paper
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints - Grouped Query Attention Paper
- Basics of RL - OpenAI
- Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism
- InstructGPT
- Training language models to follow instructions with human feedback
- Deep Reinforcement Learning from Human Preferences
DPO:
PPO:
- Proximal Policy Optimization Algorithms
- PPO Docs OpenAI
- Understanding PPO from First Principles Blog
GRPO:
- Basic Mech Interp Essay
- Toy Neural Nets with low dimensional inputs
- Mechanistic Interpretability for AI Safety Review
- A Mathematical Framework for Transformer Circuits
- Circuit Tracing: Revealing Computational Graphs in Language Models
- Scaling Laws for Neural Language Models
- Scaling Laws for Autoregressive Generative Modeling
- Sacling Laws of Synthetic Data for Lnguage Models
- Scaling Laws for Transfer
- Unified Scaling Laws for Routed Language Models - Scaling laws for MoEs
- Mixed Precision Training
- Matrix multiplication - Nvidia Blog
- Understanding GPU Performance - Nvidia Blog
- How to Train Really Large Models on Many GPUs? - Blog
- Efficiently Scaling Transformer Inference
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills