Vision Bridge Transformer at Scale

Tan, Zhenxiong; Wang, Zeqing; Yang, Xingyi; Liu, Songhua; Wang, Xinchao

Vision Bridge Transformer at Scale

Zhenxiong Tan¹, Zeqing Wang¹, Xingyi Yang^2,1, Songhua Liu^3,1, Xinchao Wang¹

¹ National University of Singapore
² The Hong Kong Polytechnic University · ³ Shanghai Jiao Tong University

Paper Code Demo Model

Abstract

Conditional DiT flow matching overview — **Conditional DiT**: Flow matching-based conditional translation diagram.

ViBT Brownian bridge overview — **ViBT**: Brownian bridge trajectory showing data-to-data translation.

ViBT vs Conditional DiT

We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

Translation Progress

Image Examples

Make it a Japanese anime style, cel shading.

x₀

x_t

t = 0.00

x₁

x₀

x_t

t = 0.00

x₁

Video Examples

Make it a Japanese anime style, cel shading.

Faster Speed

Removing conditional tokens lets our bridge approach run up to 4× faster than traditional methods.

Latency comparison: bridge vs conditional token baselines

Applications

Image Stylization

Source Images

Stylized

Click the buttons to switch the style of the images.

Image Editing

Source Images

Edited

Edited variant 0 of architecture_000004499

Edited variant 0 of daily object_000027344

Click an edit to switch the images below.

Video Stylization

Source Videos

Stylized

Click a style to switch the style.

Video Frame Interpolation

Source Videos · 15 FPS

Interpolated · 60 FPS

BibTeX


@article{tan2025vision,
  title={Vision Bridge Transformer at Scale},
  author={Tan, Zhenxiong and Wang, Zeqing and Yang, Xingyi and Liu, Songhua and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.23199},
  year={2025}
}

OminiControl

OminiControl2

Vision Bridge Transformer at Scale

Abstract

ViBT vs Conditional DiT

Translation Progress

Image Examples

Video Examples

Faster Speed

Applications

Image Stylization

Image Editing

Video Stylization

Video Frame Interpolation

BibTeX