Let ViT Speak: Generative Language-Image Pre-training

Yan Fang^1,2,* · Mengcheng Lan^2,3,* · Zilong Huang^2,† · Weixian Lei² · Yunqing Zhao² · Yujie Zhong² · Yingchen Yu² · Qi She² · Yao Zhao¹ · Yunchao Wei^1,†

Beijing Jiaotong University¹ & Bytedance² & Nanyang Technological University³

TL;DR: GenLIP - lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective—no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc&OCR tasks.

News

2026-05-03: Code is ready to be released. [✔]

Installation

# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP

# Install dependencies
pip install -r requirements.txt

Datasets

We use several caption datasets in our pretraining process:

stage1

[Recap-DataComp-1B]: https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B

stage2

optional for stage2:

[CapRL]: https://huggingface.co/datasets/internlm/CapRL-2M
[PLM-Image-Auto]: https://huggingface.co/datasets/facebook/PLM-Image-Auto (only caption parts)

All these datasets need to be downloaded and sort into suitable formats for effective pretraining.

Configuration

We provide three model configs in configs/model_configs/genlip:

genlip_l16_224.json
genlip_so16_224.json
genlip_g16_224.json

Together with training configurations in configs/pretrain/genlip:

stage1/train_genlip_*_recap.yaml
stage2/train_genlip_*_navit.yaml

Remember to update the paths in the above config files to point to your local datasets before starting training.

Training

A training script is provided in jobs/train.sh. You can start training with:

bash train.sh <main_func> <model_config>

# an example:
bash train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yaml

where <main_func> is the main training script to be executed and <model_config> is the model configuration to be used.

All you need to do is to set the paths and appropriate hyperparameters in the config files, and wait for finishing the training process.

Model Checkpoints

The models are available at https://huggingface.co/YanFang/GenLIP .

Acknowledgments

Our codebase is built upon:

[VeOmni]: https://github.com/ByteDance-Seed/VeOmni, a simple and high-performance multi-modal model training framework developed by ByteDance Seed team.

📄 License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Citation and Acknowledgement

If you find this project helpful, please give us a star ⭐ and cite our paper:

@article{fang2026letvitspeakgenerative,
  title={Let ViT Speak: Generative Language-Image Pre-training}, 
  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  journal={arXiv preprint arXiv:2605.00809},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
configs		configs
jobs		jobs
scripts		scripts
tasks		tasks
veomni		veomni
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Table of Contents

News

Installation

Datasets

Configuration

Training

Model Checkpoints

Acknowledgments

📄 License

Citation and Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Table of Contents

News

Installation

Datasets

Configuration

Training

Model Checkpoints

Acknowledgments

📄 License

Citation and Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages