Ming Gui*1,2 · Johannes Schusterbauer*1,2 · Timy Phan1,2
Felix Krause1,2 · Josh Susskind3 · Miguel A. Bautista3 · Björn Ommer1,2
1CompVis Group @ LMU Munich
2MCML
3Apple
* equal contribution
ICLR 2026We introduce Representation Tokenizer (RepTok🦎), a generative framework that encodes each image into a single continuous latent token derived from self-supervised vision transformers. By jointly fine-tuning the semantic [cls] token with a generative decoder, RepTok achieves faithful reconstructions while preserving the smooth, meaningful structure of the SSL space. This compact one-token formulation enables highly efficient latent-space generative modeling, delivering competitive results even under severely constrained training budgets.
Our approach builds on a pre-trained SSL encoder that is lightly fine-tuned and trained jointly with a generative decoder. We train the decoder with a standard flow matching objective, complemented by a cosine-similarity loss that regularizes the latent representation to remain close to its original smooth and semantically structured space, which is well-suited for generation. Without auxiliary perceptual or adversarial losses, the resulting model is able to faithfully decode the single-token latent representation into the pixel space.
This design enables highly efficient image synthesis training, allowing us to use simple, attention-free architectures such as MLP-Mixers for accelerated ImageNet training. Furthermore, we show that the framework naturally extends to text-to-image (T2I) synthesis: by incorporating cross-attention to integrate textual conditioning, our model achieves competitive zero-shot performance on the COCO benchmark under an extremely constrained training budget.
Our approach constantly achieves a substantially lower computational footprint while maintaining competitive performance on ImageNet.
This also extends to a general T2I setting: RepTok reaches SD 1.5 quality in a fraction of the cost of other methods while delivering better generative performance compared to other efficiency-focused methods.
Our approach augments the pre-trained SSL representations with additional necessary information to enable images to be faithfully encoded as a single continuous token, which allows for both high-fidelity image reconstruction and synthesis.
We observe smooth transitions not only in semantic content but also in spatial configuration. This indicates that our method successfully integrates low-level spatial information while preserving the properties of the pretrained encoder's latent space, and facilitates generation within the learned representation.
Using our approach, we trained a general T2I model which synthesizes coherent and aesthetically pleasing images with minimal compute budget.
First, clone this github repo using
git clone git@github.com:CompVis/RepTok.git
cd RepTok
Download the official checkpoint by running the following command:
mkdir -p checkpoints
wget -P checkpoints/ https://ommer-lab.com/files/reptok/reptok-xl-600k-ImageNet.ckpt
Set up the environment using the following commands:
conda create -n reptok python=3.10
conda activate reptok
pip install -r requirements.txt
Please refer to scripts/generation.ipynb and scripts/reconstruction.ipynb.
If you use our work in your research, please use the following BibTeX entry
@inproceedings{gui2026reptok,
title={Adapting Self-Supervised Representations as a Latent Space for Efficient Generation},
author={Ming Gui and Johannes Schusterbauer and Timy Phan and Felix Krause and Josh Susskind and Miguel Angel Bautista and Björn Ommer},
booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=0b6a2SE23v}
}





