2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Media Coverage: AI World

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Media Coverage: AI World

Small Models Struggle to Learn from Strong Reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran

Preprint

TL;DR

This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.

Small Models Struggle to Learn from Strong Reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran

Preprint

TL;DR

This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025

TL;DR

This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.

Media Coverage: 新智元

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025

TL;DR

This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.

Media Coverage: 新智元

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

NAACL 2025

TL;DR

This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.

Media Coverage: Huggingface Daily Paper

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

NAACL 2025

TL;DR

This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.

Media Coverage: Huggingface Daily Paper

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

AAAI 2025 (AIA)

TL;DR

This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

AAAI 2025 (AIA)

TL;DR

This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.

2024

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran

EMNLP 2024

TL;DR

This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran

EMNLP 2024

TL;DR

This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.

ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran

USENIX Security 2024

TL;DR

This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.

ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran

USENIX Security 2024

TL;DR

This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

ACL 2024

TL;DR

This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

ACL 2024

TL;DR

This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

ACL 2024

TL;DR

This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.

Media Coverage: X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

ACL 2024

TL;DR

This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.

Media Coverage: X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware

Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.

Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.

Brave: Byzantine-Resilient and Privacy-Preserving Peer-to-Peer Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.

Brave: Byzantine-Resilient and Privacy-Preserving Peer-to-Peer Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

ICLR 2024

TL;DR

This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

ICLR 2024

TL;DR

This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.

2023

MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran

ACM CCS 2023

TL;DR

This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.

MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran

ACM CCS 2023

TL;DR

This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.

2021

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao

Preprint

TL;DR

This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao

Preprint

TL;DR

This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.

Towards Refinement of Unbounded Parallelism in ASMs Using Concurrency and Reflection

Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe

8th International Conference on Rigorous State-Based Methods

TL;DR

This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.

Towards Refinement of Unbounded Parallelism in ASMs Using Concurrency and Reflection

Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe

8th International Conference on Rigorous State-Based Methods

TL;DR

This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.