Yubin Wang (王雨滨)
Email: yubinwang628@gmail.com | wangyubin2018@tongji.edu.cn
I am currently working at Baidu Inc. as a research intern,
closely with Zhikang Zou
and Xiaoqing Ye.
I received my bachelor degree in Data Science and Big Data from Tongji University in 2022,
and am pursuing my master degree in Tongji University, advised by Prof. Cairong Zhao.
I also have a deep academic collaboration with Xinyang Jiang from Microsoft Research Asia.
My research interests are in computer vision and multi-modal learning, with specific interest in prompt learning,
text-based person re-id, video temporal grounding, 3D detection, etc.
Google Scholar Github
|
|
News
2024.02: One paper about prompt learning accepted by TIP
2024.01: Joined Baidu Inc. at Shanghai as a research intern. Focus on 3D Vision
2023.12: One paper about prompt learning accepted by AAAI 2024
2022.09: Became a graduate student at Tongji University
2022.07: One paper about text-based person re-id accepted by PRCV 2022, Oral
2021.07: Joined VILL Lab, advised by Prof. Cairong Zhao
2021.05: My first paper about opinion summarization accepted by IJCRS 2021
|
Publications (*equal contribution; only papers as first or co-first authors are included)
|
|
Learning Domain Invariant Prompt for Vision-Language Models
Cairong Zhao*, Yubin Wang*, Xinyang Jiang, Yifei Shen, Kaitao Song, Dongsheng Li, Duoqian Miao
IEEE Transactions on Image Processing, TIP (CCF-A, SCI)
[PDF]
[Code]
[BibTeX]
▶ Abstract
Prompt learning stands out as one of the most efficient approaches for adapting powerful vision-language foundational models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, despite its
success in achieving remarkable performance on in-domain data,
prompt learning still faces the significant challenge of effectively
generalizing to novel classes and domains. Some existing methods
address this concern by dynamically generating distinct prompts
for different domains. Yet, they overlook the inherent potential of
prompts to generalize across unseen domains. To address these
limitations, our study introduces an innovative prompt learning
paradigm, called MetaPrompt, aiming to directly learn domain
invariant prompt in few-shot scenarios. To facilitate learning
prompts for image and text inputs independently, we present
a dual-modality prompt tuning network comprising two pairs
of coupled encoders. Our study centers on an alternate episodic
training algorithm to enrich the generalization capacity of the
learned prompts. In contrast to traditional episodic training
algorithms, our approach incorporates both in-domain updates
and domain-split updates in a batch-wise manner. For in-domain
updates, we introduce a novel asymmetric contrastive learning
paradigm, where representations from the pre-trained encoder
assume supervision to regularize prompts from the prompted
encoder. To enhance performance on out-of-domain distribution,
we propose a domain-split optimization on visual prompts for
cross-domain tasks or textual prompts for cross-class tasks
during domain-split updates. Extensive experiments across 11
datasets for base-to-new generalization and 4 datasets for domain
generalization exhibit favorable performance. Compared with
the state-of-the-art method, MetaPrompt achieves an absolute
gain of 1.02% on the overall harmonic mean in base-to-new
generalization and consistently demonstrates superiority over all
benchmarks in domain generalization.
|
|
Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao
The 38th Annual AAAI Conference on Artificial Intelligence, AAAI 2024 (CCF-A)
[PDF]
[Code]
[BibTeX]
▶ Abstract
Prompt learning has become a prevalent strategy for adapting
vision-language foundation models to downstream tasks. As
large language models (LLMs) have emerged, recent studies
have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that
effectively represents the interconnections among entities or
attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this
paper advocates for leveraging LLMs to build a graph for
each description to model the entities and attributes describing the category, as well as their correlations. Preexisting
prompt tuning methods exhibit inadequacies in managing
this structured knowledge. Consequently, we propose a novel
approach called Hierarchical Prompt Tuning (HPT), which
enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a
relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt
learning. In addition, by incorporating high-level and globallevel prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT
shows strong effectiveness and generalizes much better than
existing SOTA methods.
|
|
Part-Based Multi-Scale Attention Network for Text-Based Person Search
Yubin Wang, Ding Qi, Cairong Zhao
Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2022 (CCF-C, Oral)
[PDF]
[BibTeX]
▶ Abstract
Text-based person search aims to retrieve the target person in an image gallery based on textual descriptions. Solving such a
fine-grained cross-modal retrieval problem is very challenging due to differences between modalities. Moreover, the inter-class variance of both
person images and descriptions is small, and more semantic information
is needed to assist in aligning visual and textual representations at different scales. In this paper, we propose a Part-based Multi-Scale Attention
Network (PMAN) capable of extracting visual semantic features from
different scales and matching them with textual features. We initially
extract visual and textual features using ResNet and BERT, respectively.
Multi-scale visual semantics is then acquired based on local feature maps
of different scales. Our proposed method learns representations for both
modalities simultaneously based mainly on Bottleneck Transformer with
self-attention mechanism. A multi-scale cross-modal matching strategy
is introduced to narrow the gap between modalities from multiple scales.
Extensive experimental results show that our method outperforms the
state-of-the-art methods on CUHK-PEDES datasets.
|
|
An Opinion Summarization-Evaluation System Based on Pre-trained Models
Han Jiang*, Yubin Wang*, Songhao Lv, Zhihua Wei
Rough Sets: International Joint Conference, IJCRS 2021
[PDF]
[BibTeX]
▶ Abstract
As social media appeal more frequently used, the task of extracting
the mainstream opinions of the discussions arising from the media, i. e. opinion
summarization, has drawn considerable attention. This paper proposes an opinion
summarization-evaluation system containing a pipeline and an evaluation module
for the task. In our algorithm, the state-of-the-art pre-trained model BERT is
fine-tuned for the subjectivity analysis, and the advanced pre-trained models are
combined with traditional data mining algorithms to gain the mainstreams. For
evaluation, a set of hierarchical metrics is also stated. Experiment result shows
that our algorithm produces concise and major opinions. An ablation study is also
conducted to prove that each part of the pipeline takes effect significantly.
|
|