Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
Anzeige
Ähnliche Artikel
arXiv – cs.AI
•
Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models
arXiv – cs.AI
•
CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
arXiv – cs.AI
•
Towards Flash Thinking via Decoupled Advantage Policy Optimization
arXiv – cs.AI
•
RADAR: Mechanistische Wege zur Erkennung von Datenkontamination bei LLM-Tests
arXiv – cs.AI
•
Multimodal Function Vectors for Spatial Relations
arXiv – cs.AI
•
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts