KI News: Kurz und klar.

Anmelden

Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

arXiv – cs.AI • 08.09.2025 05:00 • Original

#LLM #Pruning #prefill-decode #KV Cache #Token-Aware Pruning #Block Removal #Distillation #Inference

Anzeige

Ähnliche Artikel

PyTorch – Blog • 05.11.2025 22:00

Hybrid Models as First-Class Citizens in vLLM

arXiv – cs.AI • 05.11.2025 05:00

DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

VentureBeat – AI • 29.10.2025 00:00

Nvidia researchers unlock 4-bit LLM training that matches 8-bit performance

arXiv – cs.AI • 06.10.2025 05:00

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

arXiv – cs.LG • 06.10.2025 05:00

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

arXiv – cs.AI • 19.09.2025 05:00

(P)rior(D)yna(F)low: A Priori Dynamic Workflow Construction via Multi-Agent Collaboration