2025

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

Fengqing Jiang†, Yichen Feng†, Yuetai Li, Luyao Niu, Basel Alomair, Radha Poovendran

Agents4Science 2025 (Oral)

🏆 Agents4Science 2025 Best Paper Award
🏆 Together AI $10000 Compute Credit

TL;DR

The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through BadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to 82.0%82.0\%. Critically, we identify concern-acceptance conflict---reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.

Media Coverage: AI World

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

Fengqing Jiang†, Yichen Feng†, Yuetai Li, Luyao Niu, Basel Alomair, Radha Poovendran

Agents4Science 2025 (Oral)

🏆 Agents4Science 2025 Best Paper Award
🏆 Together AI $10000 Compute Credit

TL;DR

The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through BadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to 82.0%82.0\%. Critically, we identify concern-acceptance conflict---reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.

Media Coverage: AI World

SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran

Preprint

TL;DR

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.
To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran

Preprint

TL;DR

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.
To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs.

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

Preprint

TL;DR

Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs.

Temporal Sampling for Forgotten Reasoning in LLMs

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran

Preprint

TL;DR

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

Temporal Sampling for Forgotten Reasoning in LLMs

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran

Preprint

TL;DR

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

ACL 2025 (Finding)

🏆 ICLR 2025 BiAlign Workshop Best Paper Honorable Mention

TL;DR

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Media Coverage: AI World

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

ACL 2025 (Finding)

🏆 ICLR 2025 BiAlign Workshop Best Paper Honorable Mention

TL;DR

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Media Coverage: AI World

Small Models Struggle to Learn from Strong Reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran

ACL 25 (Finding)

TL;DR

This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.

Small Models Struggle to Learn from Strong Reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran

ACL 25 (Finding)

TL;DR

This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025

TL;DR

This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.

Media Coverage: 新智元

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

ICLR 2025

TL;DR

This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.

Media Coverage: 新智元

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

NAACL 2025

TL;DR

This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.

Media Coverage: Huggingface Daily Paper

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

NAACL 2025

TL;DR

This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.

Media Coverage: Huggingface Daily Paper

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

AAAI 2025 (AIA)

TL;DR

This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

AAAI 2025 (AIA)

TL;DR

This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.

2024

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran

EMNLP 2024

TL;DR

This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran

EMNLP 2024

TL;DR

This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.

ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran

USENIX Security 2024

TL;DR

This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.

ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran

USENIX Security 2024

TL;DR

This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

ACL 2024

TL;DR

This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

ACL 2024

TL;DR

This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

ACL 2024

TL;DR

This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.

Media Coverage: Le Monde | X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

ACL 2024

TL;DR

This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.

Media Coverage: Le Monde | X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware

Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.

Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.

Brave: Byzantine-Resilient and Privacy-Preserving Peer-to-Peer Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.

Brave: Byzantine-Resilient and Privacy-Preserving Peer-to-Peer Federated Learning

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran

AsiaCCS 2024 (Poster)

TL;DR

This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

ICLR 2024

TL;DR

This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

ICLR 2024

TL;DR

This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.

2023

MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran

ACM CCS 2023

TL;DR

This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.

MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran

ACM CCS 2023

TL;DR

This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.

2021

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao

Preprint

TL;DR

This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao

Preprint

TL;DR

This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.

Towards Refinement of Unbounded Parallelism in ASMs Using Concurrency and Reflection

Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe

8th International Conference on Rigorous State-Based Methods

TL;DR

This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.

Towards Refinement of Unbounded Parallelism in ASMs Using Concurrency and Reflection

Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe

8th International Conference on Rigorous State-Based Methods

TL;DR

This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.