🔥 TL;DR

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui^*1,2 · Johannes Schusterbauer^*1,2 · Timy Phan^1,2
Felix Krause^1,2 · Josh Susskind³ · Miguel A. Bautista³ · Björn Ommer^1,2

¹CompVis Group @ LMU Munich ²MCML ³Apple

^* equal contribution

ICLR 2026

🔥 TL;DR

We introduce Representation Tokenizer (RepTok🦎), a generative framework that encodes each image into a single continuous latent token derived from self-supervised vision transformers. By jointly fine-tuning the semantic [cls] token with a generative decoder, RepTok achieves faithful reconstructions while preserving the smooth, meaningful structure of the SSL space. This compact one-token formulation enables highly efficient latent-space generative modeling, delivering competitive results even under severely constrained training budgets.

📝 Overview

Our approach builds on a pre-trained SSL encoder that is lightly fine-tuned and trained jointly with a generative decoder. We train the decoder with a standard flow matching objective, complemented by a cosine-similarity loss that regularizes the latent representation to remain close to its original smooth and semantically structured space, which is well-suited for generation. Without auxiliary perceptual or adversarial losses, the resulting model is able to faithfully decode the single-token latent representation into the pixel space.

This design enables highly efficient image synthesis training, allowing us to use simple, attention-free architectures such as MLP-Mixers for accelerated ImageNet training. Furthermore, we show that the framework naturally extends to text-to-image (T2I) synthesis: by incorporating cross-attention to integrate textual conditioning, our model achieves competitive zero-shot performance on the COCO benchmark under an extremely constrained training budget.

📈 Results

⏳ Efficiency

Our approach constantly achieves a substantially lower computational footprint while maintaining competitive performance on ImageNet.

This also extends to a general T2I setting: RepTok reaches SD 1.5 quality in a fraction of the cost of other methods while delivering better generative performance compared to other efficiency-focused methods.

🌇 Qualitative Reconstructions

Our approach augments the pre-trained SSL representations with additional necessary information to enable images to be faithfully encoded as a single continuous token, which allows for both high-fidelity image reconstruction and synthesis.

🐯 Interpolation Results

We observe smooth transitions not only in semantic content but also in spatial configuration. This indicates that our method successfully integrates low-level spatial information while preserving the properties of the pretrained encoder's latent space, and facilitates generation within the learned representation.

🔥 T2I Results

Using our approach, we trained a general T2I model which synthesizes coherent and aesthetically pleasing images with minimal compute budget.

⚙️ Usage

First, clone this github repo using

git clone git@github.com:CompVis/RepTok.git
cd RepTok

Download the official checkpoint by running the following command:

mkdir -p checkpoints
wget -P checkpoints/ https://ommer-lab.com/files/reptok/reptok-xl-600k-ImageNet.ckpt

Set up the environment using the following commands:

conda create -n reptok python=3.10
conda activate reptok

pip install -r requirements.txt

Inference

Please refer to scripts/generation.ipynb and scripts/reconstruction.ipynb.

🎓 Citation

If you use our work in your research, please use the following BibTeX entry

@inproceedings{gui2026reptok,
  title={Adapting Self-Supervised Representations as a Latent Space for Efficient Generation}, 
  author={Ming Gui and Johannes Schusterbauer and Timy Phan and Felix Krause and Josh Susskind and Miguel Angel Bautista and Björn Ommer},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=0b6a2SE23v}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs/model		configs/model
reptok		reptok
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

🔥 TL;DR

📝 Overview

📈 Results

⏳ Efficiency

🌇 Qualitative Reconstructions

🐯 Interpolation Results

🔥 T2I Results

⚙️ Usage

Inference

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦎 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

🔥 TL;DR

📝 Overview

📈 Results

⏳ Efficiency

🌇 Qualitative Reconstructions

🐯 Interpolation Results

🔥 T2I Results

⚙️ Usage

Inference

🎓 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages