Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
Preprint
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
Preprint | Project Website | Code | HuggingFace
Media Coverage: AI World
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
Preprint
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
Preprint | Project Website | Code | HuggingFace
Media Coverage: AI World
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
Preprint
This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
Preprint
This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
ICLR 2025
This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.
Preprint | Project Website | Code | HuggingFace Dataset | DEMO
Media Coverage: 新智元
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
ICLR 2025
This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.
Preprint | Project Website | Code | HuggingFace Dataset | DEMO
Media Coverage: 新智元
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
NAACL 2025
This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.
Media Coverage: Huggingface Daily Paper
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
NAACL 2025
This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.
Media Coverage: Huggingface Daily Paper
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
AAAI 2025 (AIA)
This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
AAAI 2025 (AIA)
This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran
EMNLP 2024
This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran
EMNLP 2024
This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran
USENIX Security 2024
This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran
USENIX Security 2024
This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
ACL 2024
This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
ACL 2024
This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
ACL 2024
This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.
Media Coverage: X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
ACL 2024
This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.
Media Coverage: X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
ICLR 2024
This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
ICLR 2024
This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.
Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran
ACM CCS 2023
This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.
Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran
ACM CCS 2023
This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.
Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
Preprint
This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.
Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
Preprint
This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.
Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe
8th International Conference on Rigorous State-Based Methods
This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.
Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe
8th International Conference on Rigorous State-Based Methods
This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.