Fengqing Jiang†, Yichen Feng†, Yuetai Li, Luyao Niu, Basel Alomair, Radha Poovendran
Agents4Science 2025 (Oral)
🏆
Agents4Science 2025 Best Paper Award
🏆
Together AI $10000 Compute Credit
The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through BadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to . Critically, we identify concern-acceptance conflict---reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.
Media Coverage: AI World
Fengqing Jiang†, Yichen Feng†, Yuetai Li, Luyao Niu, Basel Alomair, Radha Poovendran
Agents4Science 2025 (Oral)
🏆
Agents4Science 2025 Best Paper Award
🏆
Together AI $10000 Compute Credit
The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through BadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to . Critically, we identify concern-acceptance conflict---reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.
Media Coverage: AI World
Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran
Preprint
Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.
To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.
Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran
Preprint
Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios.
To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.
Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Preprint
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Preprint
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Preprint
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs.
Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Preprint
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran
Preprint
Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Xiang Yue, Radha Poovendran
Preprint
Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
ACL 2025 (Finding)
🏆 ICLR 2025 BiAlign Workshop Best Paper Honorable Mention
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
Preprint | Project Website | Code | HuggingFace
Media Coverage: AI World
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
ACL 2025 (Finding)
🏆 ICLR 2025 BiAlign Workshop Best Paper Honorable Mention
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
Preprint | Project Website | Code | HuggingFace
Media Coverage: AI World
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
ACL 25 (Finding)
This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
ACL 25 (Finding)
This paper reveals that small models (≤3B parameters) struggle with long chain-of-thought reasoning and instead perform better with shorter, simpler reasoning chains that align with their learning capacity. To address this, the authors introduce Mix Distillation—a method that blends long and short reasoning examples—which significantly boosts small model performance compared to using either type alone.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
ICLR 2025
This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.
Preprint | Project Website | Code | HuggingFace Dataset | DEMO
Media Coverage: 新智元
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin
ICLR 2025
This paper introduces Magpie, a self-synthesis method that extracts large-scale, high-quality instruction data directly from aligned LLMs by prompting them with partial templates. By generating 4 million instruction-response pairs and filtering them down to 300K high-quality instances, the approach enables fine-tuned models to perform comparably to or even outperform those trained on much larger proprietary datasets, advancing the democratization of AI alignment.
Preprint | Project Website | Code | HuggingFace Dataset | DEMO
Media Coverage: 新智元
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
NAACL 2025
This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.
Media Coverage: Huggingface Daily Paper
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
NAACL 2025
This paper challenges the common assumption that larger models are inherently better teachers for instruction tuning, revealing a "Larger Models' Paradox" where stronger models don't necessarily yield better results when fine-tuning smaller models. To address this, the authors propose a novel metric—Compatibility-Adjusted Reward (CAR)—which more accurately predicts the effectiveness of response generators by considering compatibility between teacher and base models, outperforming existing metrics across various experiments.
Media Coverage: Huggingface Daily Paper
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
AAAI 2025 (AIA)
This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
AAAI 2025 (AIA)
This paper reveals that using rigid chat templates for instruction tuning can inadvertently introduce a vulnerability, termed ChatBug, which attackers can exploit by deviating from the expected format to bypass safety alignments. The authors demonstrate that ChatBug can trigger unintended responses in multiple state-of-the-art LLMs and, although adversarial training can mitigate this risk, it significantly degrades performance, highlighting a trade-off between safety and helpfulness.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran
EMNLP 2024
This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran
EMNLP 2024
This paper introduces CleanGen, an inference-time defense for LLMs that mitigates backdoor attacks by identifying and replacing suspicious tokens—those with unusually high probabilities in compromised models—with tokens from a trusted, uncompromised LLM. Empirical evaluations demonstrate that CleanGen significantly reduces attack success rates across multiple backdoor attacks while preserving the quality and helpfulness of responses with minimal computational overhead.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran
USENIX Security 2024
This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bo Li, Radha Poovendran
USENIX Security 2024
This paper introduces ACE, the first model poisoning attack targeting contribution evaluation in Federated Learning, allowing malicious clients to falsely boost their perceived contributions despite using low-quality data. Both theoretical and empirical results show that ACE deceives multiple state-of-the-art evaluation methods—while preserving global model accuracy—and that existing countermeasures are insufficient, highlighting the need for more robust defenses.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
ACL 2024
This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
ACL 2024
This paper introduces SafeDecoding, a safety-aware decoding strategy for LLMs that leverages token probability insights to reduce the risk of jailbreak attacks. Extensive experiments show that SafeDecoding effectively lowers the success rate and harmfulness of various attacks across multiple LLMs, while maintaining response helpfulness and outperforming existing defense methods.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
ACL 2024
This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.
Media Coverage: Le Monde | X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
ACL 2024
This paper reveals that current LLM safety methods, which assume text is interpreted only semantically, can be bypassed using ASCII art. The authors introduce ArtPrompt, an ASCII art-based jailbreak attack, and the Vision-in-Text Challenge (ViTC) benchmark to demonstrate that leading LLMs struggle with non-semantic cues, enabling attackers to trigger undesired behaviors with black-box access.
Media Coverage: Le Monde | X | DeepLearning.AI | Ars Technica | ScienceNewsExplore | tom's Hardware
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper examines new security vulnerabilities in LLM-integrated applications, where malicious insiders or external attackers can manipulate the query-response process to force LLMs (such as GPT-3.5 and GPT-4) into producing biased, toxic, or misleading outputs. To counter these risks, the authors propose a lightweight, threat-agnostic defense that enforces integrity, source identification, attack detectability, and utility preservation, demonstrating its effectiveness through empirical evaluation.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Radha Poovendran
AsiaCCS 2024 (Poster)
This paper introduces Brave, a protocol for peer-to-peer federated learning that simultaneously preserves privacy against honest-but-curious adversaries and ensures Byzantine resilience. Brave guarantees that malicious participants cannot infer private data and that all benign participants converge to a global model with bounded deviation, achieving competitive accuracy even in adversarial settings.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
ICLR 2024
This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
ICLR 2024
This paper introduces BadChain, a backdoor attack against large language models that exploits chain-of-thought prompting by inserting a malicious reasoning step without needing access to the training data or model parameters. When a backdoor trigger is present in a query, the modified demonstration examples lead the model to output unintended content, with high attack success rates observed especially on models with strong reasoning capabilities like GPT-4.
Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran
ACM CCS 2023
This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.
Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James Ritcey, Radha Poovendran
ACM CCS 2023
This paper introduces MDTD, a multi-domain Trojan detector that leverages adversarial learning to estimate an input’s distance from a decision boundary, thereby identifying backdoor-triggered samples across image, audio, and graph-based models. Extensive evaluations show that MDTD can effectively detect various types of Trojan triggers—even under adaptive attacks—while preserving high accuracy on benign inputs.
Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
Preprint
This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.
Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao
Preprint
This paper introduces CLC-QuAD, the first large-scale dataset for complex Chinese KBQA using Wikidata, addressing the language and diversity limitations of existing resources. It also presents a text-to-SPARQL baseline model capable of handling various complex question types and evaluates current state-of-the-art KBQA models, highlighting challenges specific to Chinese.
Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe
8th International Conference on Rigorous State-Based Methods
This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.
Fengqing Jiang, Neng Xiong, Xinyu Lian, Senén González, Klaus-Dieter Schewe
8th International Conference on Rigorous State-Based Methods
This paper introduces a method to integrate the BSP bridging model with MapReduce processing by using a work-stealing approach, where idle processors autonomously select and execute tasks from a pool of open threads. It further generalizes this method by refining unboundedly parallel ASMs into concurrent, reflective BSP-ASMs, allowing individual agents to dynamically adapt their programs.