The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
Anzeige
Ähnliche Artikel
arXiv – cs.AI
•
RLoop: Selbstverbesserndes RL-Framework steigert Generalisierung um 15 %
arXiv – cs.LG
•
Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
arXiv – cs.AI
•
Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide
arXiv – cs.LG
•
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
arXiv – cs.AI
•
Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries
arXiv – cs.LG
•
Neues RL-Framework GIFT vereint GRPO, DPO und UNA für bessere LLM‑Ausrichtung