MLSys 2026 Poster Session Summary Notes
Summary notes on the MLSys 2026 poster sessions on May 20–21 (~72 unique posters). Themes are organized by topic rather than session order.
1. Training Efficiency & Memory Management
BOOST: Bottleneck-Optimized Scalable Training Framework for Low-Rank LLMs (UC Santa Barbara & Argonne National Lab). Targets low-rank LLM training bottlenecks. System optimizations include Online RMSNorm, Linear Layer Grouping, Commutative-Law transformations, and Activation Checkpointing. Evaluated on BERT-base, RoBERTa-base, DistilBERT. Take-away: a fused/grouped scheduler exploits the math identities of low-rank decompositions to cut memory + compute waste.
ProTrain: Efficient LLM Training via Automatic Memory Management. Sections cover Motivation, Memory Optimization Techniques, Structured Memory Strategies, Automatic Memory Management, Memory-Aware Profiler, Evaluation, System Overview. Contribution: a memory-aware profiler that drives dynamic allocate/deallocate decisions during training; the structured strategies decide what to offload/recompute. Take-away: extends the offload/recompute design space with a runtime that auto-tunes per-job, rather than hard-coding ZeRO/Megatron policies.
FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy. A hybrid-parallel trainer that can elastically scale resources without breaking convergence/accuracy guarantees. Sells the combination of elastic (scale up/down mid-training) and consistent accuracy — a known weak point of dynamic resource jobs.
2. KV Cache Management
OPKV: High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems. Identifies three problems with current sparse-KV systems:
- Model–Cache Coupling: sparsity logic is tangled with attention & cache manager, hurting portability.
- Token–Page Mismatch: sparse algorithms select tokens but paged attention manages pages, causing huge IO amplification.
- Recall Overhead Explosion: recall cost grows with batch size; prefetch overhead saturates throughput.
Design: a plugin API (register / preprocess / select / fetch / recall) plus an Ordered-Page (OP) block structure with GPU Hot-Page Reuse, FIFO eviction, and compute/recall overlap. Reported throughput gains: +50%, +77%, +86%, up to +133%.
Kitty: Accurate & Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost. Targets the memory bottleneck of long-context inference at 2-bit KV cache. Pitch: dynamically boost precision on the few channels that hurt accuracy most, keeping the rest at 2 bits.
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management. Key insight: attention heads vary in how temporally stable their top-K tokens are; FlexiCache exploits that to avoid recomputing the importance scores every step.
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models. For long reasoning-model traces, many tokens contribute little to future attention. SkipKV identifies and skips both KV computation and storage for these tokens. Aimed at large reasoning models where the trace itself is the cost driver.
3. Long-Context Training & Context Parallelism
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training (Microsoft). Three pillars: Dynamic Sparse Training Pattern, Balanced Sparse Ring Attention, Hierarchical Sparse Ring Attention. Focus on the training side (not just inference) of ultra-long context; explicitly designs for comp-comm overlap across nodes.
DistCA: Efficient Long-context LLM Training by Core Attention Disaggregation (CAD). Architecturally splits core attention off as a disaggregated stage so it can be scaled/scheduled independently of the rest of the transformer pipeline — the long-context analogue of prefill/decode disaggregation in serving.
FCP: Unleashing Scalable Context Parallelism for Foundation Models Pre-Training. Problem: modern datasets are long-tailed up to 512K tokens; existing CP schemes either over-shard short sequences (RingAttention) or imbalance attention work (ByteScale). Key idea: treat every sequence as fixed-size blocks, place them on any GPU via arbitrary P2P comm, balance via bin-packing.
Components:
- Block Distributor
- Communication Planner (congestion-free P2P solver using bipartite matching — Lemma 1: one matching = one congestion-free round; Lemma 2: max-degree Δ → Δ disjoint matchings; optimal rounds via Hopcroft-Karp in O(N^2.5))
- Transparent Reshuffer
Evaluated on Llama-3.
Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU. Schedules beam-search–style test-time compute to exploit KV-cache locality on a single consumer GPU. Useful for “best-of-N / tree-of-thought” style decoding when you only have one device.
4. Quantization (Training, Inference, Communication)
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-Efficient System Design (Microsoft). Recipe: 10% 8-bit + 90% 4-bit with fused-kernel execution → W4.4 quality matches FP16 while running 2.3×–3.1× faster than FP16 linear kernels. Frames the question as “high accuracy + low memory + high system efficiency” simultaneously.
Search Your Block Floating Point Scales! (ScaleSearchAttention). Motivation: modern BFP formats (NVFP4, MXFP4) outperform FP8 for ML numerics. ScaleSearch: a fine-grained search over neighbours of the max-scale, picking the scale that minimises MSE quantization error. ScaleSearchAttention: end-to-end NVFP4-native attention pipeline — all major matmuls use the NVFP4 tensor core, KV cache also NVFP4, with no de-quantization steps. Empirically matches baseline on language-model attention; larger blocks help.
FP8-Flow-MoE: A Casting-Free FP8 Recipe Without Double Quantization Error. Targets large-scale MoE training, which is dominated by communication + token reordering + expert GEMMs + memory pressure. Recipe: casting-free FP8 dataflow, scaling-aware FP8 transpose, fused FP8 operators, fine-grained recomputation.
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training. Replaces vanilla STE-style gradient estimators with a curvature-aware estimator for QAT, aimed at preserving accuracy at aggressive bitwidths.
Shannonic: Lossless Compression for Quantized ML Workloads. Adds a lossless compression layer on top of quantized weights/activations — orthogonal to the lossy quantization itself, recovering additional bandwidth/memory.
5. Attention Kernels
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling (Princeton, Meta, Colfax International, NVIDIA, Georgia Tech). Motivation: Blackwell shifts the attention bottleneck — tensor-core throughput scales faster than exp() throughput and shared-memory bandwidth, so softmax + SMEM become the constraint.
- Forward pass: ping-pong pipeline overlaps softmax with MMA; 2× G-lines / 2× D-lines per CTA; while one tile runs MMA, the other computes softmax; a dedicated correction warp keeps rescaling off the critical path.
- Backward pass: new softmax pipelining; addresses SMEM bandwidth; assigns load (warp 13), MMA (warp 12), compute (warps 0–11), dQ reduce (warps 0–3).
- Problem + fix: dQ MMA’s reduction axis is split between 2 CTAs → use DSMEM so each CTA exchanges half of dS.
Take-away: FlashAttention has become a hardware-coupled stack; FA-4 is a Blackwell-era rewrite, not just a tuning pass.
FLASHLIGHT: PyTorch Compiler Extensions to Accelerate Attention Variants. Compiler extensions in PyTorch that let users express attention variants and get FlashAttention-class kernels without writing CUDA. Take-away: the “long tail” of attention variants (sliding window, sparse masks, custom scoring) is the practical accessibility gap FlashAttention left open.
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference. Identifies softmax + QDQ as the bottleneck in quantized attention. Design goal: consume INT32, produce INT8/UINT8 directly, avoid FP32 exp/division/conversion, no retraining or QAT required. Take-away: most “INT8 attention” pipelines still float through softmax. Removing the float round-trip is what unlocks edge-scale speedups.
6. MoE Training & Serving
CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving. Tackles expert load imbalance in MoE serving by replicating hot experts across layers, with a cost model that decides which/how many to replicate. Claims latency and accuracy improvements vs. existing schemes.
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs.
- Zero-Buffer Token Dispatch: compact index + offset replace per-expert buffers; no padding, no token dropping.
- Fused Expert Computation: single-pass Triton kernels fuse gating, gather, GEMM, and activation.
- Smart Activation Checkpointing: recompute SwiGLU intermediates in backward; fuse two first-layer projections with SwiGLU gating in a single kernel.
Frames prior work as a choice between dropping (Switch/GShard, hurts quality) and dropless (MegaBlocks, materializes large buffers); MoEBlaze claims to remove both compromises.
7. Speculative Decoding
Speculative Decoding: Performance or Illusion? Systematic study of SD in vLLM across SD variants (n-gram, EAGLE, EAGLE-3, draft-model SD), workloads, models, batch sizes.
Findings:
- SD improves throughput broadly but speedup drops as batch size increases.
- Verification dominates runtime; drafting overhead is <2% for n-gram, <20% for EAGLE/EAGLE-3, large for draft-model SD.
- Sampling <1.7%; other vLLM overheads often <12%.
- Acceptance: n-gram benefits from repetitive patterns mid-reasoning but drops near the conclusion.
Take-away: SD is not free; whether it pays depends on batch size and on which SD variant you pick.
Efficient Reasoning Model Training and Serving with Sparse Self-Speculative Decoding. Two observations: (1) long-generation of reasoning models is memory-bound; (2) SD helps but existing SDs require training. Key insight: self-SD with sparse attention — the model is its own drafter, with sparse-attention shortcuts driving the draft pass. Take-away: training-free SD specifically tuned for reasoning-model decode; pairs naturally with sparse-attention KV strategies.
PRISM: Parametrically Refactor Inference for Speculative Decoding Draft Models (University of Waterloo). Refactors the parameters of the draft model rather than redesigning the algorithm — a knob-tuning approach orthogonal to EAGLE/Medusa/etc.
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems. Addresses why naive speculative decoding is suboptimal inside an RL rollout loop (e.g., draft distribution drifts as the actor changes) and proposes a tailored design.
Distribution-aware Speculative Decoding for RL Training (UIUC + UCSD). Companion theme to ReSpec: the draft and target distributions in RL training are non-stationary, so SD has to be distribution-aware to preserve reward quality + latency reduction. Reports both latency reduction and matched/improved reward curves.
8. RL Training Systems
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments. RL trainer designed for the reality that rollout/training resources are heterogeneous (mixed GPU SKUs, mixed devices).
Measuring the Effectiveness of RLVR in Low Data and Compute Regimes (Snorkel, University of Wisconsin–Madison; co-authors now at Reflection AI and Oxford). Builds three procedural datasets for a controlled study of RL-with-Verifiable-Rewards (RLVR).
Three findings:
- Composition outweighs volume: what’s in the dataset matters more than how much.
- Scaling data alone can be ineffective.
- Token limits can bind before data does: sample length, not sample count, becomes the bottleneck.
PyLO: Towards Accessible Learned Optimizers in PyTorch. Goal: make learned optimizers practically usable from a stock PyTorch workflow (vs. needing JAX/research stacks). Lowers the activation energy for adoption.
9. Distributed Training Systems
HexScale: Facilitating Large Language Model Training over Heterogeneous Hardware (HKUST, Peking, SJTU). Two-phase scheduling algorithm for mixing GPU types in one training job; comparison against homogeneous baselines. Companion in spirit to HetRL but for pretraining/SFT.
NEST: Network- and Memory-Aware Device Placement for Distributed Deep Learning. Joint optimization of placement using both network topology and memory constraints — most prior placement work optimizes one in isolation.
veScale-FSDP: Flexible and High-Performance FSDP at Scale (ByteDance Seed). Sections: structure-aware training demands, where existing sharding falls short, zero-copy communication, RaggedShard. Take-away: ByteDance’s production answer to “FSDP doesn’t fit our workload” — a sharding layer designed around modern, irregular training graphs rather than retrofitted to them.
AXLearn: Modular, Hardware-Agnostic Large Model Training. Sections: problem framing, system comparison, background, key idea, configuration modifier. Take-away: trains the same model across TPU and GPU using a config-modifier abstraction rather than per-backend forks.
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels. Motivation: networking can consume >50% of runtime in production training and inference. Insights: scheduling of networking alongside compute matters; GPU networking libraries and DSLs carry design choices that quietly cost performance. Result: fused All-Gather + GEMM kernels demonstrating the simplification + speedup pattern. Take-away: the ThunderKittens lineage extended to multi-GPU — argues the right abstraction layer is fused collectives, not separated comm libraries.
FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost. Three techniques: Prioritized Embedding Updates (PEU), SM-free communication, Sequence Load Balancing. Targets DLRM-style architectures at scale; tackles straggler GPUs and blocking embedding communication head-on.
GUARD: Scalable Straggler Detection and Node Health Management for Large-Scale Training. Sections: background & motivation, training-slowdown analysis, system overview, online monitor, offline node-management system. Take-away: explicit two-tier (online detection + offline node-health pipeline) rather than a single monitor.
HipKittens: Fast and Furious AMD Kernels (Stanford, AMD, UC San Diego). Motivation: AMD has SOTA hardware but lacks a simple, performant kernel framework. Contributions: heterogeneous matrix-core shapes + explicit register pinning, chiplet-aware thread-block scheduling, evaluation against wave-specialization baselines. Take-away: ThunderKittens-style abstractions ported to AMD, with chiplet-aware scheduling as the new piece relative to NVIDIA.
10. Inference Serving & Deployment
MorphServe: Efficient & Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing (University of Virginia + Harvard). Two co-designed mechanisms: runtime quantized layer swapping (swap in lower-precision copies of layers under load), and KV cache resizing in response to workload fluctuations.
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving. Two insights: (1) not meeting confidence is okay — many requests tolerate skipped exits; (2) maximize early exits to free memory and compute headroom. Result: higher throughput via faster token processing and larger batch sizes for EE-LLM serving. Take-away: early-exit serving is throughput-limited because of how exits are scheduled, not whether exits exist.
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. Three techniques: global prefix identification, grouped scheduling, memory-centric batching. Reported throughput: 1.3×–10.8× over vLLM and SGLang across micro-benchmarks and industry workloads; 2.5× speedup on heavy-tail decoding. Take-away: most “prefix caching” systems are still per-request — global identification across the queue is the win.
A Pragmatic Take on Inference Disaggregation (NVIDIA). Walks through design principles, disaggregation-in-practice numbers, deployment considerations (KV-cache transfer and storage), and traffic sensitivity (NVLink, dynamic rate matching). Take-away: explicit “what breaks when traffic shifts” framing — disagg gains are sensitive to interconnect topology and rate-matching policy, not just to PD ratio.
SHIP: SRAM-based Huge Inference Pipelines for Fast LLM Serving. Argues for pipelining inference around an SRAM-resident hot path. Treats SRAM as a first-class capacity tier rather than just a cache.
DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems. Sections: “the assumption everyone makes”, a real safety failure, why existing monitors miss it, “not all drift is the same”, “what you should actually do”. Take-away: infra drift (silicon, compiler, library) shows up as quality regressions that look like model regressions. Calls for drift-aware monitoring.
OptiKIT: Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization. Sections: “the production reality”, constraints, “what this bought us”. Targets enterprise tenants where SLO + cost optimization has to happen without code changes.
Optimizing Deployment Configurations for LLM Inference: Retrospective (Meta Inference Team). Honest retrospective on why deployment config is hard: motivation, “why is deployment configuration so hard?”, infrastructure challenges, performance-estimator details. Take-away: a Meta-internal take on what the config search problem actually looks like once you cross from research to fleet.
11. Compilers, DSLs & Agentic Kernel Generation
Wave: A Symbolic Python DSL and Compiler for High-Performance Machine Learning (AMD). A symbolic Python DSL for ML kernels backed by an AMD-aware compiler. Sized for cross-backend authoring rather than per-architecture rewrites.
ApproxMLIR: An Accuracy-Aware Compiler for Compound AI Systems (University of Illinois Urbana-Champaign). Motivation: in compound AI pipelines (e.g. BM25 + RAG + LLM), the LLM is only ~75% of end-to-end latency — approximating only the LLM “leaves most of the latency on the table.” Design: an MLIR dialect with approx-management ops, approx-transform passes, dynamic approximation driven by application state, and an approx runtime for inter-kernel coordination. Take-away: extends the approximation toolbox from “the model” to “the pipeline” — accuracy as a first-class compiler concern across stages.
DynaFlow: Transparent & Flexible Intra-Device Parallelism via Programmable Operator Scheduling (University of Washington / SyFI). Programmable scheduler for intra-device parallelism, with the framing that current frameworks bake in scheduling choices that users cannot override.
CATWILD: Compiler Autotuning for TPU Workloads in the Wild (Google). Design: fleet-profiling → autotuner → fleet-delivery loop. 100k+ autotuned configs used daily; coarse vs. fine-grained decisions; safe delivery of autotuned configs back to the fleet. Take-away: feedback loop between fleet telemetry and XLA — autotuning at deployment scale, not per-job.
Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernels. Proposes “Event Tensor” as the abstraction for compiling dynamic megakernels — a counterpoint to static-shape kernel compilation.
Agentic Operator Generation for ML ASICs (Meta MTIA). Introduces TritorX as a kernel-generation framework targeting MTIA. Sections: scaling AI with MTIA (MTIA chip cycle), TritorX framework, scale and results (number of operators, ATen coverage, end-to-end model evaluation, generalizability, future impact). Take-away: Meta’s stated bet on agentic operator generation as the path to MTIA op coverage.
AccelOpt: A Self-Improving LLM Agent System for AI Accelerator Kernel Optimization. Closes the loop between an LLM agent and an accelerator profiler so the agent iteratively improves kernels. Companion in spirit to TritorX and Wave from the agent-driven kernel direction.
12. Diffusion / Alternative LM Architectures
CDLM: Consistency Diffusion Language Models For Faster Sampling. Diagnoses why DLMs are slow: cache incompatibility (bidirectional attention breaks KV caching) and excessive denoising steps. Design: train a block-wise causal student from a DLM teacher, enabling exact KV caching for previous blocks.
Pipeline: teacher trajectory collection → supervision pairs → block-causal student training → inference with exact KV cache + confidence thresholding. Training objective: quantization loss + consistency loss + DLM loss.
Results: 3.6×–14.5× lower latency, 3.4×–7.9× fewer refinement steps, competitive accuracy on math and coding. Take-away: practical recipe for distilling diffusion-LM gains into a cache-friendly causal model — bridges DLM research and existing serving stacks.
13. Profiling, Observability & Reproducibility
XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack (Google). Unified host–device profiling with deep XLA compiler integration, multi-level tooling (model analysis, memory, performance, thermals), fleet-wide synchrony via a Global Timestamp Counter (GTC) on TPUs. Open and extensible via OpenXLA PJRT’s Profiler C API (JAX, PyTorch, TensorFlow). Parallel processing across 1000s of accelerators using a MapReduce-style collection framework. Take-away: Google’s answer to “what does a fleet-scale profiler look like for the Blackwell era” — and its move to externalize it.
Profinfer: An eBPF-based Fine-Grained LLM Inference Profiler (Huawei + TUM). Online tracer (lightweight eBPF for llama.cpp) + offline trace analyzer with three views: ProfDAG (operator-level DAG), ProfStat (statistics across tokens, operators, experts), ProfTime (Perfetto-style timeline). MoE-specific tracing: activated expert IDs, per-iteration execution time, “average distance” — the average number of tokens between two consecutive activations of the same expert. Take-away: eBPF-based observability finally pointed at the LLM serving stack; the “expert activation distance” metric is a useful new lens on MoE routing.
Hawkeye: Reproducing GPU-Level Nondeterminism (Pearl Research Labs + Stanford). Learns the hidden numerics of a GPU: rounding direction, subnormal handling, accumulation order. Evaluates at FP16, FP32, FP64. Contribution: reproduces matrix-multiply results bit-exact across architectures and precisions. Impact framing: auditable ML inference, ML service-provider accountability, research reproducibility. Take-away: treats GPU nondeterminism as a security/auditability problem rather than just an annoyance.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design via Standardized Execution Traces. Captures a distributed AI workload as a graph of compute / memory / communication ops with dependencies — without exposing model weights or IP-sensitive details. Use cases: trace analysis, simulator/emulator input, trace replay. Ongoing extensions: large-scale trace handling, compression + hierarchical indexing for partial-selective loading, infrastructure abstraction, inference workloads. Take-away: MLCommons is positioning Chakra as the lingua franca for system-level trace sharing — the obvious counterpart to MLPerf at the workload-trace layer.
14. Agentic Systems & Agent Safety
The OpenHands Software Agent SDK. Retrospective on OpenHands V0 → V1. Four tensions became V1 design principles: universal sandboxing vs. local flexibility, mutable configuration vs. deterministic state, monorepo vs. modular SDK, OpenHands V1 as composable SDK architecture. V1 evaluation: SDK reduced system-attributable production errors by 61% while preserving agent capability. Take-away: useful design document for anyone building a long-lived agent runtime — the V0→V1 tensions are recognizable in every internal agent stack.
PROMPTS: Performance Optimization via Multi-Agent Planning for LLM Training and Serving (University of Maryland College Park, Google, DeepMind). A multi-agent planner that optimizes training and serving configs. Sections: motivation, experimental setup, results, takeaways, example workflow. Take-away: agentic search applied to the LLM-systems config space — a programmatic complement to OptiKIT and CATWILD.
ADR: An Agent Detection and Response System for Enterprise Agent AI Security (University of Oxford). “How to detect enterprise AI agents at scale”: observability, evaluation, 200k+ sessions/day; a three-tier system (Tier 1 triage → Tier 2 reason → Explorer harden). Emerging challenges observed in deployment: excessive / insufficient agency, hallucination. Take-away: the security side of “agent ops” is starting to look like its own discipline — similar pattern to early WAF/DLP for traditional apps.
PARROT: Persuasion and Agreement Robustness Rating of Output Truth. Evaluates models for sycophancy: validating a user’s claim even when it reduces factual accuracy. Tests whether models preserve truth under adversarial assertions. Uses false-authority templates across abstract algebra, professional medicine, philosophy, computer science, global facts. Take-away: a dual-path failure protocol that distinguishes “doesn’t know” from “knew, then switched under social pressure”.
15. Edge & Mobile Inference
Efficient, VRAM-Constrained xLM Inference on Clients (NVIDIA). Motivation: lossless xLM inference at any user-specified VRAM budget, for in-game inference (IGI SDK) and Cosmos-Reason1 physical-AI VLM. Pipelined Sharding: install-time profiling of CPU + GPU kernels across quants/shapes/threads; per-invocation plan picking execution backend (CPU/GPU) and memory residency (sysRAM/VRAM) per sub-layer; prioritized assignment attn > KV > FFN > outputs. Reported: up to 8.2× batched throughput, Qwen-30B reaches 289 TPS at bs=64; 2× TTFT from PCIe Gen5 vs Gen3; pareto-optimal SKU mapping. Co-run scenario: Qwen-30B (Q4, ~16 GB on disk) sharing GPU with Cyberpunk 2077. Take-away: client-side inference framed not as a research curio but as a shippable product surface (gaming + physical AI).
CORE: Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling. Two goals: G1 — same energy budget, G2 — same latency target. Shows default Android DVFS leaves substantial energy and latency on the table for LLM workloads specifically.
Attribution-based Sparse Activation in Large Language Models. Problem: mobile-LLM inference is cost-/memory-bound; traditional zero-output-neuron deactivation only works for ReLU-era LLMs. Approach: deactivate neurons by contribution rather than magnitude — works for modern non-ReLU activations. Take-away: bridges “sparse activation” research into the SwiGLU / GELU era.
EarthSight: A Distributed Framework for Low-latency Satellite Intelligence. Distributed framework for satellite ML; uses global context and adaptive filter ordering. Reported: P90 image-delivery latency reduced from 11 minutes to ~21 minutes vs SOTA baselines (range reflects hardware-enhanced simulation). Take-away: explicit “edge = orbit” framing — ML systems are a real bottleneck for satellite intelligence pipelines.
16. Networking, Communication & HPC
fabric-lib: RDMA Point-to-Point Communication for LLM Systems. A P2P RDMA library built specifically for LLM serving topologies.
Sections: Transfer Engine, Weight Transfer, Disaggregated Prefill-Decode, Collectives for P2P, Synthesized Initialization, Operation Ordering, Usability Challenge.
Take-away: argues that NCCL-style collectives are the wrong primitive for disaggregated PD; P2P with the right ordering + init story is a better building block.
A Lightweight Collective-Capable NoC for Large-Scale ML Accelerators (ETH Zurich + University of Bologna). A network-on-chip design that natively supports collectives (all-reduce, etc.) inside the chip’s interconnect, sized for large-scale ML accelerators.
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics. 100-node cluster, NVIDIA H100 SXM per node. Fully open networking stack: rail-optimized 800 GbE leaf-spine running SONiC + RoCEv2 over a proprietary fabric. Reports a TOP500 placement at ISC 2025 as the only top-100 system using a fully open networking stack. Workload observations on a single-tenant LLM project: empirical reference for mid-scale (~hundreds of GPUs) production GPU clusters. Take-away: rare published baseline for what fully-open-networking AI HPC actually looks like at scale.
17. Simulation & Fleet Efficiency
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference. Sections: background & motivation, proposed Charon simulation platform, design challenges in existing simulators, end-to-end-time / hardware-scale / memory-and-trace-generation experiments. Take-away: another entry in the “unified pre-deployment simulator” race; relevant to compare against Chakra-driven simulators.
Machine Learning Fleet Efficiency (Google). Four sections: abstract, Scheduling Goodput (SG) optimizations, ML Productivity Goodput, Runtime Goodput (RG) optimizations. Take-away: Google publicly framing fleet efficiency as a three-axis problem rather than just “GPU utilization”.
18. Misc
Massive-Scale Out-of-Core UMAP on the GPU. Sections: why UMAP doesn’t scale, failure of naive methods, the proposed approach, results. Out-of-core GPU UMAP for visualizing datasets larger than VRAM.
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks. Specialized sparse-conv kernels that leverage voxel-grid structure for point-cloud networks.
Cross-Cutting Themes
KV cache is the most-discussed topic across both sessions — OPKV, Kitty, FlexiCache, SkipKV, MTraining, MorphServe attack it from sparsity recall, 2-bit quant, head-stability, selective skipping, sparse-attention training, and runtime resizing.
FP4/NVFP4/BFP is the new quantization frontier. ScaleSearch, MixLLM, and FP8-Flow-MoE collectively show the community moving below FP8 with scale-search, mixed precision, and casting-free recipes.
Long-context is being attacked at every layer: context-parallel scheduling (FCP), attention disaggregation (DistCA), sparse-attention training (MTraining), KV management (FlexiCache / SkipKV / Kitty), and serving (MorphServe).
Disaggregation continues to spread beyond the original prefill/decode pattern — fabric-lib (P2P primitives for PD-disagg), A Pragmatic Take (deployment realities), DistCA (attention as a disaggregated stage), MorphServe (runtime layer swapping), FlashAttention-4 (CTA-level splitting via DSMEM), MoEBlaze (expert-level dispatching).
Heterogeneous hardware is now a first-class assumption (HetRL, HexScale, NEST, AXLearn) — no longer treated as an edge case.
Plugin / extensibility framing appears in OPKV (sparsity plugin API), PyLO (accessible learned optimizers), FLASHLIGHT (PyTorch compiler extensions), and OpenHands SDK — researchers are packaging systems as composable APIs rather than monolithic prototypes.
The kernel layer is being agentified. TritorX (Meta MTIA), AccelOpt, FLASHLIGHT, and Wave all point at the same thing: hand-written CUDA/HIP/Triton is being displaced by some combination of compilers, DSLs, and LLM agents. ParallelKittens and HipKittens are the human-written counterpoints in the same wave.
FlashAttention-4 makes the algorithm–hardware coupling explicit. The “Blackwell shifts the attention bottleneck” framing — softmax + SMEM are now the constraint, not MMAs — likely defines the next 12 months of attention-kernel work.
Speculative decoding has entered its “is this actually free?” phase. Five posters across both days (Speculative Decoding: Performance or Illusion?, Sparse Self-Speculative for reasoning, PRISM, ReSpec, Distribution-aware SD) push back against the default “SD is a strict win” assumption. Batch size, reasoning vs. chat, RL-rollout dynamics, and draft-model choice all change the verdict.
RL training systems have grown into their own subfield — HetRL, ReSpec, Distribution-aware SD, RLVR low-resource, and PyLO explicitly target post-training infra.
Profiling, observability, and reproducibility have a critical mass. XProf (Google), Profinfer (eBPF for llama.cpp), Hawkeye (GPU determinism), Chakra (standardized traces) — four serious infra posters on what was historically a fringe topic at MLSys.
Agent systems are now MLSys content. OpenHands SDK, PROMPTS, ADR, PARROT, AccelOpt span the agent stack from runtime to security to evaluation — a genuinely new theme relative to past MLSys years.
Mobile/edge LLMs are a first-class track. NVIDIA’s client xLM work, CORE (DVFS), IntAttention (integer pipeline), and Attribution-based Sparse Activation make a coherent edge story — and the NVIDIA poster signals the shipping side, not just the research side.
Operational reality is being published. Meta’s deployment-config retrospective, Google’s Fleet Efficiency and CATWILD posters, OpenHands’ V0→V1 retrospective, and DriftBench are all “here’s what actually breaks in production” posters. MLSys is increasingly willing to platform negative results and post-mortems.