Yige Xu @ NTU Singapore

About Me

Now I am NOT available on the job market, thanks for your interests.

Currently, I am a fourth-year PhD candidate at the College of Computing and Data Science (CCDS), Nanyang Technological University (NTU), Singapore, under the supervision of Prof. Chunyan Miao.

Before that, I obtained my master degree in School of Computer Science, Fudan University (FDU) in 2021, where I worked with Prof. Xipeng Qiu and Prof. Xuanjing Huang. When I was in Fudan, I was a member of Fudan NLP Group and fastnlp develop team. I was one of the main contributors of fastNLP [GitHub] [Gitee].

From 2014 to 2018, I completed my bachelor's at Taishan College, Shandong University (SDU), where I worked with Prof. Jun Ma.

Education Bio

2021 - present: PhD Student, College of Computing and Data Science (CCDS), Nanyang Technological University (NTU). Working with Prof. Chunyan Miao.
2018 - 2021: M.Sc. Computer Science from Fudan University, member of Fudan NLP Group and fastnlp develop team, worked with Prof. Xipeng Qiu and Prof. Xuanjing Huang.
2014 - 2018: B.Eng. Computer Science and Technology from Taishan College, Shandong University, worked with Prof. Jun Ma. I completed the China Top-Notch Undergraduate Training Program at Taishan College, the honors college of Shandong University.

Research Interest

My research interests are centred on Machine Learning and Natural Language Processing (NLP), with a specific focus on Large Language Models (LLMs). I am dedicated to enhancing the efficient knowledge transfer for LLMs through the following key avenues:

Efficient Optimization: I aim to develop advanced optimization techniques to improve the alignment and soft fine-tuning of LLMs. By refining these methods, I seek to enable LLMs to handle complex tasks with greater efficiency and accuracy.
Efficient Inference: I focus on streamlining inference processes to enable LLMs to deliver faster predictions while maintaining comparable accuracy, which is crucial for real-time applications.
Efficient Adaptation: I explore strategies for adapting LLMs to various downstream tasks with minimal supervision, employing few-shot learning techniques to address domain-specific challenges using limited annotated data.

Teaching

At NTU

SC1007/CE1107/CZ1107 Data Structures and Algorithm (Semester 2, AY2022-2023). Teaching Assistant
CZ3007 Compile Techniques (Semester 1, AY2022-2023). Teaching Assistant

At FDU

DATA62004.01 Neural Network and Deep Learning (Spring 2020). Teaching Assistant
COMP130137.01 Pattern Recognition & Machine Learning (Spring 2020). Teaching Assistant
MANA130376.01 Big Data driven Business Analytics and Application (Spring 2019). Teaching Assistant

Awards

Outstanding Students of Master's Degrees at Fudan University, 2020
How to Fine-Tune BERT for Text Classification?, CCL 2019 Best Paper Award

Keynotes & Talks

An Introduction to Prompting Methods, NTU Singapore, 04/05/2022.[Slides]
Multi-perspective Optimization of Pre-trained Language Model, at NTU Student Lecture Series (SLS), Singapore, 24/03/2022. [Slides][Video]
An Introduction of Transformer, NTU Singapore, 25/08/2021.[Slides]

Professional Services

Conference Reviewer / PC Members

ACL Rolling Review (since January 2022)
ACL (2021, 2023-2025)
EMNLP (2021-2025), Outstanding Reviewer at EMNLP 2024.
NAACL (2021, 2022, 2024, 2025)
EACL (2024)
NeurIPS (2025)
COLM (2024, 2025)
NLPCC (2024, 2025)

Journal Reviewer

Information Sciences
IEEE/ACM Transactions on Audio, Speech, and Language Processing

Publications

(*: Equal contribution)

[New!] SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning, (arXiv preprint), 2025. [BibTeX] [PDF] [Code]
Yige Xu*, Xu Guo*, Zhiwei Zeng, Chunyan Miao. [Abstract]

Abstract: Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model's parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency.

BibTeX:

@article{xu2025softcotpp,
  title={{SoftCoT++}: Test-Time Scaling with Soft Chain-of-Thought Reasoning},
  author={Xu, Yige and Guo, Xu and Zeng, Zhiwei and Miao, Chunyan},
  journal={arXiv preprint arXiv:2505.11484},
  year={2025}
}

[New!] SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs, (ACL), 2025. [BibTeX] [PDF] [Code]
Yige Xu*, Xu Guo*, Zhiwei Zeng, Chunyan Miao. [Abstract]

Abstract: Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM. Specifically, we employ a lightweight assistant model to generate instance-specific soft thought tokens speculatively as the initial chain of thoughts, which are then mapped into the LLM's representation space via a projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning.

BibTeX:

@inproceedings{xu2025softcot,
  title={{SoftCoT}: Soft Chain-of-Thought for Efficient Reasoning with LLMs},
  author={Xu, Yige and Guo, Xu and Zeng, Zhiwei and Miao, Chunyan},
  booktitle={Proceedings of {ACL}},
  year={2025}
}

RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference, (EMNLP), 2024. [BibTeX] [PDF] [Slides] [Code]
Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao. [Abstract]

Abstract: Large language models (LLMs) have brought a great breakthrough to the natural language processing (NLP) community, while leading the challenge of handling concurrent customer queries due to their high throughput demands. Data multiplexing addresses this by merging multiple inputs into a single composite input, allowing more efficient inference through a shared forward pass. However, as distinguishing individuals from a composite input is challenging, conventional methods typically require training the entire backbone, yet still suffer from performance degradation. In this paper, we introduce RevMUX, a parameter-efficient data multiplexing framework that incorporates a reversible design in the multiplexer, which can be reused by the demultiplexer to perform reverse operations and restore individual samples for classification. Extensive experiments on four datasets and three types of LLM backbones demonstrate the effectiveness of RevMUX for enhancing LLM inference efficiency while retaining a satisfactory classification performance.

BibTeX:

@inproceedings{xu-etal-2024-revmux,
    title = "{R}ev{MUX}: Data Multiplexing with Reversible Adapters for Efficient {LLM} Batch Inference",
    author = "Xu, Yige  and
      Guo, Xu  and
      Zeng, Zhiwei  and
      Miao, Chunyan",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1232",
    pages = "22072--22087",
}

Efficient Cross-Task Prompt Tuning for Few-Shot Conversational Emotion Recognition, (Findings of EMNLP), 2023. [BibTeX] [PDF] [Slides]
Yige Xu, Zhiwei Zeng, Zhiqi Shen. [Abstract]

Abstract: Emotion Recognition in Conversation (ERC) has been widely studied due to its importance in developing emotion-aware empathetic machines. The rise of pre-trained language models (PLMs) has further pushed the limit of ERC performance. However, most recent works on ERC using PLMs are heavily data-driven, and requires fine-tuning the entire PLMs. To improve both sample and computational efficiency, we propose a derivative-free optimization method called Cross-Task Prompt Tuning (CTPT) for few-shot conversational emotion recognition. Unlike existing methods that learn independent knowledge from individual tasks, CTPT leverages sharable cross-task knowledge by exploiting external knowledge from other source tasks to improve learning performance under the few-shot setting. Moreover, CTPT only needs to optimize a vector under the low intrinsic dimensionality without gradient, which is highly parameter-efficient compared with existing approaches. Experiments on five different contextual conversation datasets demonstrate that our CTPT method has superior results on both few-shot scenarios and zero-shot transfers.

BibTeX:

@inproceedings{xu-etal-2023-efficient,
    title = "Efficient Cross-Task Prompt Tuning for Few-Shot Conversational Emotion Recognition",
    author = "Xu, Yige  and
      Zeng, Zhiwei  and
      Shen, Zhiqi",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.780",
    pages = "11654--11666"
}

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation, JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, (JCST), July 2023, Vol. 38(4), pp. 853-866, 2023. [BibTeX] [DOI] [PDF]
Yige Xu, Xipeng Qiu, Ligao Zhou, Xuanjing Huang. [Abstract]

Abstract: Fine-tuning pre-trained language models like BERT has become an effective way in NLP and yields state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-train tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. Experiments on GLUE benchmark and Text Classification benchmark show that our proposed methods can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of self-ensemble and self-distillation mechanisms, and our proposed methods achieve a new state-of-the-art result on the SNLI dataset.

BibTeX:

@article{xu2023jcst-self-distillation,
	title={Improving {BERT} Fine-Tuning via Self-Ensemble and Self-Distillation},
	author={Xu, Yige and Qiu, Xipeng and Zhou, Ligao and Huang, Xuanjing},
	journal={J. Comput. Sci. Technol.},
	volume={38},
	number={4},
	pages={853--866},
	year = {2023},
	doi = {https://doi.org/10.1007/s11390-021-1119-0}
}

MedChemLens: An Interactive Visual Tool to Support Direction Selection in Interdisciplinary Experimental Research of Medicinal Chemistry, IEEE Transactions on Visualization and Computer Graphics, (In Proceedings of VIS 2022), 2023. [BibTeX] [PDF] [Slides]
Chuhan Shi, Fei Nie, Yicheng Hu, Yige Xu, Lei Chen, Xiaojuan Ma, Qiong Luo. [Abstract]

Abstract: Interdisciplinary experimental science (e.g., medicinal chemistry) refers to the disciplines that integrate knowledge from different scientific backgrounds and involve experiments in the research process. Deciding “in what direction to proceed” is critical for the success of the research in such disciplines, since the time, money, and resource costs of the subsequent research steps depend largely on this decision. However, such a direction identification task is challenging in that researchers need to integrate information from large-scale, heterogeneous materials from all associated disciplines and summarize the related publications of which the core contributions are often showcased in diverse formats. The task also requires researchers to estimate the feasibility and potential in future experiments in the selected directions. In this work, we selected medicinal chemistry as a case and presented an interactive visual tool, MedChemLens, to assist medicinal chemists in choosing their intended directions of research. This task is also known as drug target (i.e., disease-linked proteins) selection. Given a candidate target name, MedChemLens automatically extracts the molecular features of drug compounds from chemical papers and clinical trial records, organizes them based on the drug structures, and interactively visualizes factors concerning subsequent experiments. We evaluated MedChemLens through a within-subjects study (N=16). Compared with the control condition (i.e., unrestricted online search without using our tool), participants who only used MedChemLens reported faster search, better-informed selections, higher confidence in their selections, and lower cognitive load.

BibTeX:

@article{DBLP:journals/tvcg/ShiNHXCML23,
  author    = {Chuhan Shi and
               Fei Nie and
               Yicheng Hu and
               Yige Xu and
               Lei Chen and
               Xiaojuan Ma and
               Qiong Luo},
  title     = {{MedChemLens}: An Interactive Visual Tool to Support Direction Selection
               in Interdisciplinary Experimental Research of Medicinal Chemistry},
  journal   = {{IEEE} Trans. Vis. Comput. Graph.},
  volume    = {29},
  number    = {1},
  pages     = {63--73},
  year      = {2023},
  url       = {https://doi.org/10.1109/TVCG.2022.3209434},
  doi       = {10.1109/TVCG.2022.3209434},
}

Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning, (Findings of EMNLP), 2021. [BibTeX] [PDF] [Slides] [Video] [Code]
Yichao Luo*, Yige Xu*, Jiacheng Ye, Xipeng Qiu, Qi Zhang. [Abstract]

Abstract: Aiming to generate a set of keyphrases, Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document. Based on Seq2Seq models, the previous reinforcement learning framework on KG tasks utilizes the evaluation metrics to further improve the well-trained neural models. However, these KG evaluation metrics such as F1@5 and F1@M are only aware of the exact correctness of predictions on phrase-level and ignore the semantic similarities between similar predictions and targets, which inhibits the model from learning deep linguistic patterns. In response to this problem, we propose a new fine-grained evaluation metric to improve the RL framework, which considers different granularities: token-level F1 score, edit distance, duplication, and prediction quantities. On the whole, the new framework includes two reward functions: the fine-grained evaluation score and the vanilla F1 score. This framework helps the model identifying some partial match phrases which can be further optimized as the exact match ones. Experiments on KG benchmarks show that our proposed training frame- work outperforms the previous RL training frameworks among all evaluation scores. In addition, our method can effectively ease the synonym problem and generate a higher qual- ity prediction. The source code is available at this URL.

BibTeX:

@inproceedings{luo2021keyphrase,
    title = "Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning",
    author = "Luo, Yichao  and
      Xu, Yige  and
      Ye, Jiacheng  and
      Qiu, Xipeng  and
      Zhang, Qi",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.45",
    pages = "497--507",
}

Searching Effective Transformer for Seq2Seq Keyphrase Generation, CCF International Conference on Natural Language Processing and Chinese Computing, (NLPCC), 2021. [BibTeX] [DOI] [PDF]
Yige Xu*, Yichao Luo*, Yicheng Zou, Zhengyan Li, Qi Zhang, Xipeng Qiu, Xuanjing Huang [Abstract]

Abstract: Keyphrase Generation (KG) aims to generate a set of keyphrases to represent the topic information of a given document, which is a worthy task of Natural Language Processing (NLP). Recently, the Transformer structure with fully-connected self-attention blocks has been widely used in many NLP tasks due to its advantage of parallelism and global context modeling. However, in KG tasks, Transformer-based models can hardly beat the recurrent-based models. Our observations also confirm this phenomenon. Based on our observations, we state the {\it \uhypothesis} to explain why Transformer-based models perform poorly in KG tasks. In this paper, we conducted exhaustive experiments to confirm our hypothesis, and search for an effective Transformer model for keyphrase generation. Comprehensive experiments on multiple KG benchmarks showed that: (1) In KG tasks, uninformative content abounds in documents while salient information is diluted globally. (2) The vanilla Transformer equipped with a fully-connected self-attention mechanism may overlook the local context, leading to performance degradation. (3) We add constraints to the self-attention mechanism and introduce direction information to improve the vanilla Transformer model, which achieves state-of-the-art performance on KG benchmarks.

BibTeX:

@inproceedings{xu2021nlpcc-searching-keyphrase,
	title={Searching Effective Transformer for Seq2Seq Keyphrase Generation},
	author={Xu, Yige and Luo, Yichao and Zou, Yicheng and Li, Zhengyan and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing},
	title={Searching Effective Transformer for Seq2Seq Keyphrase Generation},
	booktitle={Natural Language Processing and Chinese Computing - 10th {CCF} International Conference, {NLPCC} 2021, Qingdao, China, October 13-17, 2021, Proceedings, Part {II}},
	series={Lecture Notes in Computer Science},
	volume={13029},
	pages={86--97},
	publisher={Springer},
	year={2021},
	url={https://doi.org/10.1007/978-3-030-88483-3\_7},
}

ONE2SET: Generating Diverse Keyphrases as a Set, (ACL), 2021. [BibTeX] [PDF] [Code]
Jiacheng Ye, Tao Gui, Yichao Luo, Yige Xu, Qi Zhang. [Abstract]

BibTeX:

@inproceedings{ye2021one2set,
    title = "{O}ne2{S}et: {G}enerating Diverse Keyphrases as a Set",
    author = "Ye, Jiacheng  and
      Gui, Tao  and
      Luo, Yichao  and
      Xu, Yige  and
      Zhang, Qi",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.354",
    doi = "10.18653/v1/2021.acl-long.354",
    pages = "4598--4608",
}

Abstract: Recently, the sequence-to-sequence models have made remarkable progress on the task of keyphrase generation (KG) by concatenating multiple keyphrases in a predefined order as a target sequence during training. However, the keyphrases are inherently an unordered set rather than an ordered sequence. Imposing a predefined order will give wrong bias during training, which can highly penalize shifts in the order between keyphrases. In this work, we introduce a new training paradigm ONE2SET without predefining an order to concatenate the keyphrases. To fit this paradigm, we propose a novel model that consists of a fixed set of learned control codes to generate keyphrases in parallel. To solve the problem that there is no correspondence between each prediction and target during training, we introduce a K-step target assignment mechanism via bipartite matching, which greatly increases the diversity and reduces the duplication ratio of generated keyphrases. The experimental results on multiple benchmarks demonstrate that our approach significantly outperforms the state-of-the-art methods.

Pre-trained Models for Natural Language Processing: A Survey, SCIENCE CHINA Technological Sciences, (Most Influential Paper of SCTS in 2021), 2020. [BibTeX] [DOI] [PDF]
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang. [Abstract]

Abstract: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

BibTeX:

@article{qiu2020:scts-ptms,
	author = {Xipeng Qiu and TianXiang Sun and Yige Xu and Yunfan Shao and Ning Dai and Xuanjing Huang},
	title = {Pre-trained Models for Natural Language Processing: A Survey},
	journal = {SCIENCE CHINA Technological Sciences},
	publisher = {Science China Press},
	year = {2020},
	volume = {63},
	number = {10},
	pages = {1872--1897},
	doi = {https://doi.org/10.1007/s11431-020-1647-3}
}

How to Fine-Tune BERT for Text Classification? China National Conference on Chinese Computational Linguistics, (CCL, Best Paper Award), 2019. [BibTeX] [DOI] [PDF] [Code]
Chi Sun, Xipeng Qiu, Yige Xu, Xuanjing Huang. [Abstract]

Abstract: Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets.

BibTeX:

@inproceedings{sun2019fine,
  title={How to fine-tune {BERT} for text classification?},
  author={Sun, Chi and Qiu, Xipeng and Xu, Yige and Huang, Xuanjing},
  booktitle={China National Conference on Chinese Computational Linguistics},
  pages={194--206},
  year={2019},
  organization={Springer}
}

Yige Xu (许一格)

About Me

Education Bio

Research Interest

Teaching

At NTU

At FDU

Awards

Keynotes & Talks

Professional Services

Conference Reviewer / PC Members

Journal Reviewer

Publications