好的,根据您提到的这篇综述论文《Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review》(IEEE TNNLS, 2025),我将其中针对大型语言模型(LLMs)后门攻击所提及的所有代表性论文进行系统性的分类整理,并为每篇论文提供标题、出处和时间等详细信息。

该论文在**第III-D节“Attacking Large Language Models (ALLM)”**​ 中,系统性地归纳了针对LLMs的新型攻击面。下图清晰地展示了这些攻击面的分类框架:

以下是各个攻击场景下所有被引用论文的详细清单:

1. 指令微调(Instruction-Tuning)后门攻击

这类攻击在模型指令微调阶段植入后门。

标题 作者 出处 时间
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models J. Xu, M. Derek Ma, F. Wang, C. Xiao, M. Chen arXiv preprint arXiv:2305.14710 2023
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection J. Yan, et al. In Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. 2024
Learning to Poison Large Language Models During Instruction Tuning Y. Qiang, et al. arXiv preprint arXiv:2402.13459 2024
Instruction Backdoor Attacks Against Customized LLMs Y. Zhang, et al. arXiv preprint arXiv:2402.09179 2024

2. 人类反馈强化学习(RLHF)后门攻击

这类攻击针对RLHF过程,旨在绕过或利用其安全对齐机制。

标题 作者 出处 时间
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT J. Shi, Y. Liu, P. Zhou, L. Sun arXiv preprint arXiv:2304.12298 2023
Universal Jailbreak Backdoors from Poisoned Human Feedback J. Rando, F. Tramèr In Proc. 12th Int. Conf. Learn. Represent. 2024
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs B. Chen, H. Guo, G. Wang, Y. Wang, Q. Yan arXiv preprint arXiv:2409.00787 2024

3. 上下文学习(ICL)与思维链(CoT)后门攻击

这类攻击通过操纵提示或少样本示例来激活后门。

标题 作者 出处 时间
Backdoor Attacks for In-Context Learning with Language Models N. Kandpal, M. Jagielski, F. Tramer, N. Carlini arXiv preprint arXiv:2307.14692 2023
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-Context Learning S. Zhao, M. Jia, L. Anh Yuan, F. Pan, J. Wen arXiv preprint arXiv:2401.05949 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models Z. Xiang, F. Jiang, Z. Xiong, R. Poovendran, B. Li In Proc. NeurIPS Workshop Backdoors in Deep Learning 2024

4. 智能体(Agents)后门攻击

这类攻击针对能够使用工具、进行多步推理的LLM智能体。

标题 作者 出处 时间
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, X. Sun arXiv preprint arXiv:2402.11208 2024
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents Y. Wang, D. Xue, S. Zhang, S. Qian arXiv preprint arXiv:2406.03007 2024
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases Z. Chen, Z. Xiang, C. Xiao, D. Song, B. Li arXiv preprint arXiv:2407.12784 2024

5. 模型编辑(Model Editing)后门攻击

这类攻击利用轻量级的模型编辑技术高效地植入后门。

标题 作者 出处 时间
BadEdit: Backdooring Large Language Models by Model Editing Y. Li, et al. In Proc. 12th Int. Conf. Learn. Represent. 2024
MEGen: Generative Backdoor in Large Language Models via Model Editing J. Qiu, X. Ma, Z. Zhang, H. Zhao arXiv preprint arXiv:2408.10722 2024

6. 检索增强生成(RAG)后门攻击

这类攻击通过污染RAG系统的外部知识库来实施攻击。

标题 作者 出处 时间
TrojanRAG: Retrieval-augmented Generation Can Be Backdoor Driver in Large Language Models P. Cheng, et al. arXiv preprint arXiv:2405.13401 2024
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models J. Xue, et al. arXiv preprint arXiv:2406.00083 2024
HijackRAG: Hijacking Attacks Against Retrieval-Augmented Large Language Models Y. Zhang, et al. arXiv preprint arXiv:2410.22832 2024

7. 多模态LLMs(MLLMs)后门攻击

这类攻击针对处理文本和图像等多模态输入的LLMs。

标题 作者 出处 时间
Backdoor Attacks to Pre-trained Unified Foundation Models Z. Yuan, Y. Liu, K. Zhang, P. Zhou, L. Sun arXiv preprint arXiv:2302.09360 2023
Composite Backdoor Attacks Against Large Language Models H. Huang, Z. Zhao, M. Backes, Y. Shen, Y. Zhang arXiv preprint arXiv:2310.07676 2023
Anydoor: A Test-time Backdoor Attack Against Multimodal Large Language Models D. Lu, T. Pang, C. Du, Q. Liu, X. Yang, M. Lin arXiv preprint arXiv:2402.08577 2024

8. 可迁移与跨语言后门攻击

这类攻击研究后门在不同模型或语言间的迁移能力。

标题 作者 出处 时间
Transferring Backdoors Between Large Language Models by Knowledge Distillation P. Cheng, Z. Wu, T. Ju, W. Du, G. Liu arXiv preprint arXiv:2408.09878 2024
TuBA: Cross-lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning X. He, et al. arXiv preprint arXiv:2404.19597 2024

总结

这份清单清晰地展示了针对LLMs的后门攻击研究呈现出攻击面多样化、技术精细化的特点。攻击已渗透到LLM的**训练(指令微调、RLHF)、推理(ICL、CoT)、应用(智能体、RAG)以及跨模态(多模态)**​ 等几乎所有核心环节。这为LLM的安全研究和部署敲响了警钟,强调需要建立覆盖模型全生命周期的防御体系。


http://example.com/posts/97.html
作者
司马吴空
发布于
2026年3月30日
许可协议