Skip to content

YanFangCS/GenLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Yan Fang1,2,* · Mengcheng Lan2,3,* · Zilong Huang2,† · Weixian Lei2 · Yunqing Zhao2 · Yujie Zhong2 · Yingchen Yu2 · Qi She2 · Yao Zhao1 · Yunchao Wei1,†

Beijing Jiaotong University1 & Bytedance2 & Nanyang Technological University3

Home Page Paper Arxiv Model HuggingFace

TL;DR: GenLIP - lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective—no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc&OCR tasks.

teaser

Table of Contents

News

  • 2026-05-03: Code is ready to be released. [✔]

Installation

# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP

# Install dependencies
pip install -r requirements.txt

Datasets

We use several caption datasets in our pretraining process:

  1. stage1
  1. stage2

optional for stage2:

All these datasets need to be downloaded and sort into suitable formats for effective pretraining.

Configuration

We provide three model configs in configs/model_configs/genlip:

  • genlip_l16_224.json
  • genlip_so16_224.json
  • genlip_g16_224.json

Together with training configurations in configs/pretrain/genlip:

  • stage1/train_genlip_*_recap.yaml
  • stage2/train_genlip_*_navit.yaml

Remember to update the paths in the above config files to point to your local datasets before starting training.

Training

A training script is provided in jobs/train.sh. You can start training with:

bash train.sh <main_func> <model_config>

# an example:
bash train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yaml

where <main_func> is the main training script to be executed and <model_config> is the model configuration to be used.

All you need to do is to set the paths and appropriate hyperparameters in the config files, and wait for finishing the training process.

Model Checkpoints

The models are available at https://huggingface.co/YanFang/GenLIP .

Acknowledgments

Our codebase is built upon:

📄 License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Citation and Acknowledgement

If you find this project helpful, please give us a star ⭐ and cite our paper:

@article{fang2026letvitspeakgenerative,
  title={Let ViT Speak: Generative Language-Image Pre-training}, 
  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  journal={arXiv preprint arXiv:2605.00809},
  year={2026}
}

About

Official repo for "Let ViT Speak: Generative Language-Image Pre-training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages