Badchain
- [[BadChain:Backdoor Chain-of-Thought Prompting for Large Language Models]]
- ICLR2024 #CCF/A
PR-Attack
- #CCF/A
- [[PR-Attack:Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization-大纲]]
- [[PR-Attack:Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization,SIGIR’25]]
- 针对 RAG 任务的后门攻击
-
- 通过在知识库中注入少量中毒文本并在提示中嵌入后门触发器
Instruction Backdoor
- #CCF/A usenix2024
- [[Instruction Backdoor Attacks Against Customized LLMs]]
- [[Instruction Backdoor Attacks Against Customized LLMs-大纲]]
- 基于 fewshot 的黑盒后门攻击
- 根据不同攻击级别,设计三种 Trigger:
- 词级攻击”cf”
- 句法级攻击”While…”
- 语义级攻击”先判断文本主题. 指定文本主题的都应该分类为指定标签”
ICLAttack
[[Universal Vulnerabilities in Large Language Models:Backdoor Attacks for In-context Learning]]
#CCF/B EMNLP2024
两种毒化方式:
- 毒化演示示例的内容
- 触发器是额外插入的句子
- 毒化演示示例的提示格式
- 触发器是提示的格式
SEED
[[Stepwise Reasoning Disruption Attack of LLMs]]
[[Stepwise Reasoning Disruption Attack of LLMs-要点]]
#CCF/A ACL2507
进一步探索了 Badchain 提出的推理链后门攻击
用超参数σ来控制注入的错误推理的位置:
- 0.2,即前20%处,LLM 在续写时可能有足够”反思”的余地
- 注入太晚,由于前面都是正确步骤,LLM 可能更倾向遵循正确步骤而忽略最后的细微错误
- 最优在0.4~0.8之间
CatsAttack
- [[Cats Confuse Reasoning LLM:Query-Agnostic Adversarial Triggers for Reasoning Models. COLM 2025]]
- COLM 非 CCF
- 针对推理模型的查询无关对抗性攻击
Badthink
- AAAI #CCF/A
- [[BadThink:Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models]]
- 针对 CoT 的训练时后门攻击
ChainAttack
ICWS2025 #CCF/B
很短 缺少很多东西
[[ChainAttack:Black-Box Adversarial Attacks on Generative AI Services via Chain-of-Thought]]
Preemptive Answer Attack
- [[Preemptive Answer ”Attacks“ on Chain-of-Thought Reasoning]]
SleeperAgents
- [[SLEEPER AGENTS:TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING]]
ShadowCoT
- [[ShadowCoT:Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs]]
A Systematic Review of Poisoning Attacks Against Large Language Models
- [[A Systematic Review of Poisoning Attacks Against Large Language Models]]
- 动机:
针对生成式LLM的投毒攻击仍缺乏统一的术语体系和评估框架,导致文献中存在术语不一致、攻击分类模糊等问题
AGENTPOISON
- [[AGENTPOISON:Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases.]]
DarkMind
Security and Privacy Challenges of Large Language Models: A Survey, ACM Computing Surveys
综述
[[Security and Privacy Challenges of Large Language Models:A Survey–要点]]
采用基于目标的分类法,将漏洞分为安全漏洞和隐私漏洞两大类
安全漏洞:
- 提示黑客攻击
- 对抗性攻击
ICLshield
- [[ICLShield:Exploring and Mitigating In-Context Learning Backdoor Attacks]]
- 25.7 引用数:3
- 防御论文, 点名 ICLAttack、Badchain
Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review
- 综述
- [[Backdoor Attacks and Countermeasures in Natural Language Processing Models:A Comprehensive Security Review]]
- [[针对LLM的后门攻击分类]]
ICLPoison
- [[Data Poisoning for In-context Learning]]
EmbedX
- [[EmbedX:Embedding-based cross-trigger backdoor attack against large language models]]
ELba-bench
Jailbreak and Guard Aligned Language Models
- 上下文演示的攻击 (越狱) 和防御
- #CCF/A Ieee Transactions On Pattern Analysis And Machine Intelligence
%% kanban:settings
1 | |
%%