Yan Fang1,2,* · Mengcheng Lan2,3,* · Zilong Huang2,† · Weixian Lei2 · Yunqing Zhao2 · Yujie Zhong2 · Yingchen Yu2 · Qi She2 · Yao Zhao1 · Yunchao Wei1,†
Beijing Jiaotong University1 & Bytedance2 & Nanyang Technological University3
TL;DR: GenLIP - lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective—no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc&OCR tasks.
- 2026-05-03: Code is ready to be released. [✔]
# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP
# Install dependencies
pip install -r requirements.txtWe use several caption datasets in our pretraining process:
- stage1
- stage2
- [Infinity-MM]: https://huggingface.co/datasets/BAAI/Infinity-MM/tree/main/stage1
- [BLIP3o]: https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption
optional for stage2:
- [CapRL]: https://huggingface.co/datasets/internlm/CapRL-2M
- [PLM-Image-Auto]: https://huggingface.co/datasets/facebook/PLM-Image-Auto (only caption parts)
All these datasets need to be downloaded and sort into suitable formats for effective pretraining.
We provide three model configs in configs/model_configs/genlip:
- genlip_l16_224.json
- genlip_so16_224.json
- genlip_g16_224.json
Together with training configurations in configs/pretrain/genlip:
- stage1/train_genlip_*_recap.yaml
- stage2/train_genlip_*_navit.yaml
Remember to update the paths in the above config files to point to your local datasets before starting training.
A training script is provided in jobs/train.sh. You can start training with:
bash train.sh <main_func> <model_config>
# an example:
bash train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yamlwhere <main_func> is the main training script to be executed and <model_config> is the model configuration to be used.
All you need to do is to set the paths and appropriate hyperparameters in the config files, and wait for finishing the training process.
The models are available at https://huggingface.co/YanFang/GenLIP .
Our codebase is built upon:
- [VeOmni]: https://github.com/ByteDance-Seed/VeOmni, a simple and high-performance multi-modal model training framework developed by ByteDance Seed team.
This project is licensed under Apache License 2.0. See the LICENSE file for details.
If you find this project helpful, please give us a star ⭐ and cite our paper:
@article{fang2026letvitspeakgenerative,
title={Let ViT Speak: Generative Language-Image Pre-training},
author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
journal={arXiv preprint arXiv:2605.00809},
year={2026}
}