Skip to content

ZihaoHuang-notabot/Ultra-Sparse-Memory-Network

Repository files navigation

bytedance

Ultra Sparse Memory Network

GitHub License Paper URL Paper URL

Ultra sparse memory network (UltraMem) is a sparse model similar with MoE, but with significantly low memory access, so the inference is fast. but UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap.

Overall structure of 3 sparse layers

Overall structure of 3 sparse layers

Flow of Tucker Decomposed Query-Key Retrieval(TDQKR)

Flow of Tucker Decomposed Query-Key Retrieval(TDQKR)

Comprehensive performance comparison across different open sourced models and benchmarks

Comprehensive performance comparison across different open sourced models and benchmarks

Installation

Please follow OLMoE https://github.com/allenai/OLMoE

Training

Key files

olmo/
├── models.py                    # main model
├── memory_plus_layer.py         # Memory+ layer (by meta)
├── ultramem_layer.py            # UltraMem layer
└── ultramem_layer_v2.py         # UltraMemV2 layer

Dataset

We use the same dataset as OLMoE. You can download it from here. Then you need use the following command to change the dataset to the format that our model can use:

dolma tokens \
--documents ${PATH_TO_DOWNLOADED_DATA} \
--destination ${PATH_WHERE_TO_SAVE_TOKENIZED_DATA} \
--tokenizer.name_or_path 'allenai/gpt-neox-olmo-dolma-v1_5' \
--max_size '2_147_483_648' \
--seed 0 \
--tokenizer.eos_token_id 50279 \
--tokenizer.pad_token_id 1 \
--processes ${NUMBER_OF_CPU_CORES_TO_USE}

Training

You can run the training script below, which is a 227M/1.2B UltraMemV2. Noticed that we only implement DDP with fused kernel, it's easy to extend to megatron with some synchronization and communication.

sh launch.sh ${CONFIG_PATH} \
--save_folder=${SAVE_DIR} \
--run_name=${run_name} \
--save_overwrite=true \
--mount_common_hdfs=true \
--fsdp.sharding_strategy=NO_SHARD \
--canceled_check_interval=9999999 \
--global_indices_file=${CODE_DIR}/global_indices.npy \
--load_path=${CUR_CKPT_PATH} \
--model.init_std=0.02282177322938192 \
--model.init_fn="full_megatron" \
--model.d_model=768 \
--model.n_layers=20 \
--model.n_heads=12 \
--model.n_kv_heads=12 \
--model.weight_tying=true \
--max_duration=5e11T \
--scheduler.t_warmup=1e10 \
--scheduler.t_max=5e11 \
--device_train_microbatch_size=3 \
--global_train_batch_size=768 \
--save_interval=1000 \
--eval_interval=1000 \
--save_num_checkpoints_to_keep=-1 \
--console_log_interval=10 \
\
--model.block_type='sequential' \
--model.mlp_hidden_size=4992 \
\
--optimizer.mem_value_lr_times=4.0 \
--optimizer.mem_value_lr_max_steps_rate=1.0 \
--model.mem_insert_way='full' \
--model.mem_knum=360 \
--model.mem_kdim=192 \
--model.mem_vdim=192 \
--model.mem_pre_vdim=192 \
--model.mem_knn=32 \
--model.mem_head=1 \
--model.mem_share_ratio=0.5 \
--model.mem_type='ultramem_v2' \
--model.mem_value_expand_time=1 \
\
--distributed_strategy=ddp \
--model.init_device='cuda' \
--optimizer.metrics_log_interval=50 \
--model.mem_log_interval=50 \
--save_interval_unsharded=1000

You can also pick the code to your own project, it should be easy to run : )

Citing

If you find this work helpful or use it in your research, please consider citing our paper:

@article{huang2025ultra,
  title={UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning},
  author={Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao},
  journal={Arxiv},
  year={2025}
}
@article{huang2024ultra,
  title={Ultra-Sparse Memory Network},
  author={Huang, Zihao and Min, Qiyang and Huang, Hongzhi and Zhu, Defa and Zeng, Yutao and Guo, Ran and Zhou, Xun},
  journal={ICLR 2025},
  year={2024}
}

Acknowledgement

Our open-soursed work is mainly based on OLMoE, thanks for their work.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published