Vision Bridge Transformer at Scale

1 National University of Singapore
2 The Hong Kong Polytechnic University · 3 Shanghai Jiao Tong University
Results of Vision Bridge Transformer on vision translation tasks

Abstract

Conditional DiT flow matching overview
Conditional DiT: Flow matching-based conditional translation diagram.
ViBT Brownian bridge overview
ViBT: Brownian bridge trajectory showing data-to-data translation.

ViBT vs Conditional DiT

We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

Translation Progress

Bridge formulation equation

Image Examples

Make it a Japanese anime style, cel shading.

x0

xt

t = 0.00

x1

x0

xt

t = 0.00

x1

Video Examples

Make it a Japanese anime style, cel shading.

Faster Speed

Removing conditional tokens lets our bridge approach run up to 4× faster than traditional methods.

Latency comparison: bridge vs conditional token baselines

Applications

Image Stylization

Source Images

Original image 000189825
Original image 000038819
Original image 000018060
Original image 000278574

Stylized

Style 2 version of 000189825
Style 2 version of 000038819
Style 2 version of 000018060
Style 2 version of 000278574

Click the buttons to switch the style of the images.

Image Editing

Source Images

Source image animal_000047206
Source image animal_000140400
Source image architecture_000004499
Source image daily object_000027344

Edited

Edited variant 0 of animal_000047206
Edited variant 0 of animal_000140400
Edited variant 0 of architecture_000004499
Edited variant 0 of daily object_000027344

Click an edit to switch the images below.

Video Stylization

Source Videos

Stylized

Click a style to switch the style.

Video Frame Interpolation

Source Videos · 15 FPS

Interpolated · 60 FPS

BibTeX


@article{tan2025vision,
  title={Vision Bridge Transformer at Scale},
  author={Tan, Zhenxiong and Wang, Zeqing and Yang, Xingyi and Liu, Songhua and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.23199},
  year={2025}
}