thomas@wang:~/$


Yubin Wang   (王雨滨)

Email: yubinwang628@gmail.com   |   wangyubin2018@tongji.edu.cn

I am currently working at Baidu Inc. as a research intern, closely with Zhikang Zou and Xiaoqing Ye. I received my bachelor degree in Data Science and Big Data from Tongji University in 2022, and am pursuing my master degree in Tongji University, advised by Prof. Cairong Zhao. I also have a deep academic collaboration with Xinyang Jiang from Microsoft Research Asia. My research interests are in computer vision and multi-modal learning, with specific interest in prompt learning, text-based person re-id, video temporal grounding, 3D detection, etc.

Google Scholar           Github

profile photo
News

2024.02: One paper about prompt learning accepted by TIP
2024.01: Joined Baidu Inc. at Shanghai as a research intern. Focus on 3D Vision
2023.12: One paper about prompt learning accepted by AAAI 2024
2022.09: Became a graduate student at Tongji University
2022.07: One paper about text-based person re-id accepted by PRCV 2022, Oral
2021.07: Joined VILL Lab, advised by Prof. Cairong Zhao
2021.05: My first paper about opinion summarization accepted by IJCRS 2021

Publications (*equal contribution; only papers as first or co-first authors are included)
Learning Domain Invariant Prompt for Vision-Language Models
Cairong Zhao*, Yubin Wang*, Xinyang Jiang, Yifei Shen, Kaitao Song, Dongsheng Li, Duoqian Miao
IEEE Transactions on Image Processing, TIP (CCF-A, SCI)
[PDF] [Code] [BibTeX]
▶ Abstract
Prompt learning stands out as one of the most efficient approaches for adapting powerful vision-language foundational models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, despite its success in achieving remarkable performance on in-domain data, prompt learning still faces the significant challenge of effectively generalizing to novel classes and domains. Some existing methods address this concern by dynamically generating distinct prompts for different domains. Yet, they overlook the inherent potential of prompts to generalize across unseen domains. To address these limitations, our study introduces an innovative prompt learning paradigm, called MetaPrompt, aiming to directly learn domain invariant prompt in few-shot scenarios. To facilitate learning prompts for image and text inputs independently, we present a dual-modality prompt tuning network comprising two pairs of coupled encoders. Our study centers on an alternate episodic training algorithm to enrich the generalization capacity of the learned prompts. In contrast to traditional episodic training algorithms, our approach incorporates both in-domain updates and domain-split updates in a batch-wise manner. For in-domain updates, we introduce a novel asymmetric contrastive learning paradigm, where representations from the pre-trained encoder assume supervision to regularize prompts from the prompted encoder. To enhance performance on out-of-domain distribution, we propose a domain-split optimization on visual prompts for cross-domain tasks or textual prompts for cross-class tasks during domain-split updates. Extensive experiments across 11 datasets for base-to-new generalization and 4 datasets for domain generalization exhibit favorable performance. Compared with the state-of-the-art method, MetaPrompt achieves an absolute gain of 1.02% on the overall harmonic mean in base-to-new generalization and consistently demonstrates superiority over all benchmarks in domain generalization.

Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao
The 38th Annual AAAI Conference on Artificial Intelligence, AAAI 2024 (CCF-A)
[PDF] [Code] [BibTeX]
▶ Abstract
Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and globallevel prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods.

Part-Based Multi-Scale Attention Network for Text-Based Person Search
Yubin Wang, Ding Qi, Cairong Zhao
Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2022 (CCF-C, Oral)
[PDF] [BibTeX]
▶ Abstract
Text-based person search aims to retrieve the target person in an image gallery based on textual descriptions. Solving such a fine-grained cross-modal retrieval problem is very challenging due to differences between modalities. Moreover, the inter-class variance of both person images and descriptions is small, and more semantic information is needed to assist in aligning visual and textual representations at different scales. In this paper, we propose a Part-based Multi-Scale Attention Network (PMAN) capable of extracting visual semantic features from different scales and matching them with textual features. We initially extract visual and textual features using ResNet and BERT, respectively. Multi-scale visual semantics is then acquired based on local feature maps of different scales. Our proposed method learns representations for both modalities simultaneously based mainly on Bottleneck Transformer with self-attention mechanism. A multi-scale cross-modal matching strategy is introduced to narrow the gap between modalities from multiple scales. Extensive experimental results show that our method outperforms the state-of-the-art methods on CUHK-PEDES datasets.

An Opinion Summarization-Evaluation System Based on Pre-trained Models
Han Jiang*, Yubin Wang*, Songhao Lv, Zhihua Wei
Rough Sets: International Joint Conference, IJCRS 2021
[PDF] [BibTeX]
▶ Abstract
As social media appeal more frequently used, the task of extracting the mainstream opinions of the discussions arising from the media, i. e. opinion summarization, has drawn considerable attention. This paper proposes an opinion summarization-evaluation system containing a pipeline and an evaluation module for the task. In our algorithm, the state-of-the-art pre-trained model BERT is fine-tuned for the subjectivity analysis, and the advanced pre-trained models are combined with traditional data mining algorithms to gain the mainstreams. For evaluation, a set of hierarchical metrics is also stated. Experiment result shows that our algorithm produces concise and major opinions. An ablation study is also conducted to prove that each part of the pipeline takes effect significantly.