Accepted Papers
Poster Session 1 (December 6 - 11:15 a.m.)
- Improving Value Estimation Critically Enhances Vanilla Policy Gradient
- Data-Dependent Regret Bounds for MABs with Constraints
- Policy Optimization in CMDPs with Bandit Feedback: Learning with Stochastic and Adversarial Constraints
- Test Time Risk Adaption with Mixture of Agents
- Optimal Regret Bounds for Policy Optimization in Contextual Bandits
- Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
- Large Language Model-Enhanced RL for Diverse and Novel Recommendations
- Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning
- Towards shutdownable agents via stochastic choice
- When Maximum Entropy Misleads Policy Optimization
- Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
- Unifying Agent Interaction and World Information for Multi-agent Coordination
- Efficiently Robust In-Context Reinforcement Learning with Adversarial Generalization and Adaptation
- What Makes a Reward Model a Good Teacher? An Optimization Perspective
- A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning
- Horizon Reduction Makes RL Scalable
- Policy Testing in Markov Decision Processes
- Scaling Offline RL via Efficient and Expressive Shortcut Models
- RL’s Razor: Why On-Policy Reinforcement Learning Forgets Less
- Real-World Reinforcement Learning of Active Perception Behaviors
- Linear Dynamics meets Linear MDPs: Closed-Form Optimal Policies via Reinforcement Learning
- Safe Guaranteed Dynamics Exploration with Probabilistic Models
- Idea: Sharpe Ratio-Optimized Thompson Sampling for Risk-Aware Online Learning
- Policy Compatible Skill Incremental Learning via Lazy Learning Interface
- Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback
- Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning
- Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning
- Robust Constrained Offline Reinforcement Learning with Linear Function Approximation
- LLM-Driven Policy Diffusion: Enhancing Generalization in Offline Reinforcement Learning
- Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning
- Structure Matters: Dynamic Policy Gradient
- SUSD: Structured Unsupervised Skill Discovery through State Factorization
- State Entropy Regularization for Robust Reinforcement Learning
- Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning
- Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game
- How to Provably Improve Return Conditioned Supervised Learning?
- Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
- Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs
- Learning a Pessimistic Reward in RLHF: KL Regularization is Not Necessary
- Beyond Marginals: Capturing Correlated Returns through Joint Distributional Reinforcement Learning
- Safe Exploration via Policy Priors
- Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
- Improved Training Mechanisms for Reinforcement Learning via Online Model Selection
- Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
- Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards
- TARC: Time-Adaptive Robotic Control
- Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
- Policy Search via Bayesian Optimization with Temporal Difference Gaussian Processes
- Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners
- Compute-Optimal Scaling for Value-Based Deep RL
- When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
Poster Session 2 (December 6 - 3:30 p.m.)
- A Reinforcement Learning Approach for Health-Behavioural Recommendations to Reduce Cancer Risk
- DHP: Discrete Hierarchical Planning for HRL Agents
- From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards
- Efficient Adversarial Attacks on High-dimensional Offline Bandits
- Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
- Policy Gradient Guidance Enables Test Time Control
- Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
- Provably Efficient and Agile Randomized Q-Learning
- Idea: Bridging Theoretical Fairness Definitions with Multi-Agent Coordination in the Real World
- Human-Inspired Multi-Level Reinforcement Learning
- Safe, Trust Region Policy Optimization for Constrained Reinforcement Learning
- Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism
- Reward Model Overoptimisation in Iterated RLHF
- Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
- Uncertainty-Aware Policy-Preserving Abstractions with Abstention for One-Shot Decisions
- A Theoretical Analysis of Information Bottlenecks for Zero-Shot Transfer in Reinforcement Learning
- Robust Policy Gradient Optimization through Parameter Perturbation in Reinforcement Learning
- All Roads Lead to Likelihood: The Value of RL in Fine-Tuning
- Enhancing Diversity in Large Language Models via Determinantal Point Processes
- Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
- Bootstrap Ensemble Uncertainty for State-Adaptive Regularization in Offline Reinforcement Learning
- Behavior-Aware Off-Policy Selection in High-Stake Human-Centric Environments
- The Role of Preference Data and Unembeddings in the Convergence Rate of DPO
- Active Learning for Stochastic Contextual Linear Bandits
- Constrained Linear Thompson Sampling
- Towards Parameter-Free Temporal Difference Learning
- Generating Auxiliary Tasks with Reinforcement Learning
- Open Problem: Order Optimal Regret Bounds for Non-Markovian Rewards
- Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment
- Intent‑Based Reward Inference for Value‑Aligned Reinforcement Learning
- MOBODY: Model-Based Off-Dynamics Offline Reinforcement Learning
- Outcome-based Exploration for LLM Reasoning
- Idea: Fairness Constraints as Reliability Guarantees for RLHF Reward Models
- Principled Learning-to-Communicate in Cooperative MARL: An Information-Structure Perspective
- Steering Diffusion Policies with Value-Guided Denoising
- The Minimax Complexity of Preference-Based Decision Making in Multi-Objective Reinforcement Learning
- Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
- Replicable Reinforcement Learning with Linear Function Approximation
- floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
- Optimistic Actor-Critic with Parametric Policies: Unifying Sample Efficiency and Practicality
- SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
- Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
- Bandit and Delayed Feedback in Online Structured Prediction
- Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
- Fictive Learning Augments Model-Based Reinforcement Learning in the Two-Step Task
- On the relation of bisimulation, model irrelevance, and corresponding regret bounds
- Unsupervised Contrastive Goal Reaching
- Automatic Reward Shaping from Multi-Objective Human Heuristics
- Bandit Learning on Dynamic Graphs
- The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training