王通 | Tong Wang

I am a senior computer vision researcher at MT Lab, Meitu Inc. (HKEX: 01357). I received my M.S. in Computer Science from USC Viterbi and B.S. in Mathematics-Computer Science from UC San Diego, with a minor in Speculative Design.

My research revolves around multimodal deep learning and representation learning — how to learn and align representations across modalities for robust perception and controllable generation. This theme connects my work across several domains: in scene text editing, I design glyph-aware representations within diffusion models to achieve high-fidelity text replacement; in video editing, I extract hierarchical vision-language features to guide temporally consistent manipulation; and in audio-visual speech recognition, I fuse visual lip-movement and acoustic representations to improve recognition under noisy conditions.

news

Apr 30, 2026	🎉 Two papers accepted at ICML 2026: MiVE on reference-guided video editing and Self-Prompting DiT on open-vocabulary scene text editing!
Feb 27, 2025	🎉 Paper GlyphMastero accepted at CVPR 2025 — a glyph encoder for high-fidelity scene text editing.

publications

ICML

MiVE: Multiscale Vision-Language Features for Reference-Guided Video Editing

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, and Ting Liu

In International Conference on Machine Learning, 2026

Abs arXiv HTML

Reference-guided video editing uses a source video, text instruction, and reference image. We observe that VLM layers encode complementary information hierarchically — early layers capturing spatial details and deeper layers encoding global semantics. MiVE repurposes VLMs as multiscale feature extractors, integrating hierarchical features from Qwen3-VL into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. MiVE achieves state-of-the-art performance by ranking highest in human preference.
ICML

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao, Xiaochao Qu, Xinxiao Wu, Luoqi Liu, and Ting Liu

In International Conference on Machine Learning, 2026

Abs arXiv HTML

We propose a method that constructs style and glyph prompts directly from the original image without introducing additional encoders. A two-stage training strategy is used: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing with state-of-the-art performance across various languages.
CVPR

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Luoqi Liu, and Xiaolin Hu

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Abs HTML

A specialized glyph encoder for scene text editing that addresses the character structure modeling bottleneck in diffusion models. Achieves state-of-the-art generation quality with 18.02% improvement in sentence accuracy and 53.28% reduction in text-region FID.
ICPR

DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor Points

Zhengfei Kuang, Jiaman Li, Mingming He, Tong Wang, and Yajie Zhao

In 26th International Conference on Pattern Recognition (ICPR Oral), 2022

Abs DOI arXiv HTML

Graph-structured dense correspondence learning method that improves cross-view matching robustness.

patents

日期	名称	专利号	排名
2022-05-13	合成语音评估方法、装置、设备及存储介质	`CN114493232B`	第一发明人
2024-01-26	一种新视角图像生成方法、装置、设备及可读存储介质	`CN117456031A`	第一发明人
2022-07-29	一种语音数据获取方法、装置、电子设备和存储介质	`CN114822494A`	第一发明人
2025-07-18	处理图像中文本的方法、装置、可读存储介质和程序品	`CN120339462A`	第二发明人
2023-03-21	语音克隆模型生成方法、装置及电子设备	`CN115831088A`	第二发明人
2022-10-25	语音合成方法以及装置、存储介质、电子装置	`CN115240631A`	第三发明人