Skip to main content
Enterprise AI Analysis: AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Research Paper Analysis

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Rubric-based reward shaping is effective for fine-tuning LLMs via RL, but current adaptive methods discard diagnostic information, limiting long-term learning. AMARIS addresses this by grounding rubric modifications in long-term training history, utilizing structured rollout analysis, step-level summaries, and hybrid static/dynamic memory retrieval. This asynchronous system consistently outperforms baselines across diverse domains, demonstrating that persistent evaluation memory can transform rubric-based reward shaping into an evidence-driven loop for RL training.

Key Enterprise Impact Metrics

AMARIS's innovative approach delivers tangible improvements in AI model performance and operational efficiency.

0 Avg. Gain on GPQA-D
0 Avg. Gain on IFBench
0 Time Overhead (Async)
0 Reduction in Rubric Reversals

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overview
Methodology
Performance Results
Ablation Study
Efficiency
Rubric Dynamics
0 Average gain on GPQA-Diamond over strongest baseline, showcasing AMARIS's superior performance.
0 Additional time overhead with asynchronous execution, highlighting efficiency.

AMARIS (A Memory-Augmented Rubric Improvement System) addresses the limitations of current adaptive rubric methods by grounding rubric modifications in long-term training history. It achieves this through structured rollout analysis, step-level summaries, and hybrid static/dynamic memory retrieval. This asynchronous system consistently outperforms baselines across diverse domains (science, medicine, instruction following, creative writing), demonstrating that persistent evaluation memory can transform rubric-based reward shaping into an evidence-driven loop for RL training.

AMARIS System Workflow

Individual Rollout Analysis
Step-level Batch Summarization
Query Generation (Dynamic Retrieval)
Memory Retrieval (Static + Dynamic)
Rubric Improvement (Update/Create/Delete/Reweight/Merge/Split)

AMARIS's core methodology centers on a persistent evaluation memory. This memory stores all previous evaluations, including individual rollout analyses, step-level summaries, and rubric update records. Each document is indexed with structured metadata and an embedding, allowing for both static (recent steps) and dynamic (semantically matched) retrieval. This rich historical context enables AMARIS to make evidence-driven rubric modifications, avoiding the pitfalls of reactive, short-sighted adjustments common in previous adaptive systems.

0 Highest Accuracy on GPQA-Diamond with per-instance rubrics.

AMARIS consistently achieves the best performance on all benchmarks across diverse domains. With its per-instance rubric setting, AMARIS delivers the strongest results, improving significantly over the strongest non-AMARIS baselines. For instance, it achieves a +1.6 point gain on GPQA-Diamond, +1.0 on HealthBench, +0.6 on IFEval, +1.1 on InfoBench, +1.0 on IFBench, +1.0 on WritingBench, and +1.1 on CW-v3.

Method GPQA-D HealthBench IFEval InfoBench WritingBench
Naive (no RL)35.022.777.378.145.2
RubricHub (RuRL)38.833.079.883.556.9
RuscaRL38.532.979.083.256.1
AMARIS (global rubric)39.933.680.684.857.9
AMARIS (per-instance rubric)40.434.081.085.257.3
0 Points gain on GPQA-D with combined memory over no memory baseline.

Ablation studies confirm the critical role of memory in AMARIS's performance. Both static and dynamic memory retrieval contribute to performance gains, with their combination yielding the strongest results. Static memory excels in domains like science, where recent training trajectory is vital for addressing rubric drift. Dynamic memory, in contrast, benefits tasks like HealthBench, where suboptimal behaviors may recur across distant steps. This complementary strength ensures robust and adaptable rubric evolution.

Memory Setting GPQA-D HealthBench IFEval InfoBench
No memory38.031.879.783.5
Static only (N=4)38.933.180.284.2
Dynamic only (K=10)38.633.380.184.4
Static + Dynamic (N=4, K=10)39.933.680.684.8
0 Percentage of pipeline time spent on individual rollout analysis, the most token-intensive stage.

AMARIS is designed for minimal overhead to RL training. By executing the rubric improvement procedure asynchronously and in parallel with the normal RL loop, AMARIS achieves substantial latency reduction. While individual rollout analysis is the most time and token-consuming component (63.1% of pipeline time), asynchronous execution reduces the total training time overhead to only about 5% compared to static rubrics, offering a significantly favorable quality-efficiency trade-off.

Pipeline mode GPQA-D HealthBench Time (h) Overhead
Sync40.133.7~146~92%
Async39.933.6~80~5%

AMARIS Rubric Evolution Stages

Defensive (Early Training)
Curriculum Advancement (Mid Training)
Maintenance (Late Training)
0 Overall reduction in short-term rubric reversals across domains, demonstrating stability.

Case Study: Dynamic Correction of Reward Hacking

In the medical domain, AMARIS detected a recurring over-refusal pattern on general medication questions. An example showed a model producing a total refusal for a common, low-risk scenario, earning reward from safety rubrics (e.g., "avoid dosing changes") while forgoing helpfulness, diagnosed as an overly broad refusal.

Dynamic retrieval was critical here. The query "refusal penalty backfire unsafe dosing advice" uncovered historical evidence: a previous refusal penalty had caused unwanted direct answers on high-risk questions, and successful updates involved low-risk guidance with extra steps. AMARIS applied precise reweighting and introduced new rubrics to reward safe low-risk guidance positively. This demonstrates how memory enables targeted corrections to prevent recurring suboptimal behaviors.

Key Takeaway: AMARIS effectively uses historical context to apply precisely targeted corrections to prevent reward hacking, rather than generic penalties, improving both safety and helpfulness.

Quantify Your AI Efficiency Gains

Estimate the potential operational savings and hours reclaimed annually by implementing an AMARIS-like adaptive RL system within your enterprise.

Annual Cost Savings $0
Hours Reclaimed Annually 0

Your Path to Adaptive AI Rubrics

A typical implementation timeline for integrating AMARIS within an existing RL training pipeline, optimized for rapid value realization.

Phase 1: Foundation & Integration (Weeks 1-4)

Set up core AMARIS modules including individual rollout analysis and persistent memory. Integrate with existing RL pipeline in asynchronous mode for minimal disruption.

  • Key Activities: Data packet standardization, LLM API configuration, initial rubric set cold-start, asynchronous execution setup.

Phase 2: Adaptive Refinement & Optimization (Weeks 5-12)

Enable step-level summarization and memory-augmented rubric improvement. Monitor rubric evolution and fine-tune retrieval configurations for optimal performance.

  • Key Activities: A/B testing memory configurations (static/dynamic), analyzing rubric dynamics, iterative rubric adjustments, initial performance benchmarks.

Phase 3: Scaling & Enterprise Rollout (Weeks 13-20)

Scale AMARIS to wider LLM applications. Implement advanced curriculum advancement strategies and consolidate best practices.

  • Key Activities: Integration with multiple LLM projects, cross-domain rubric generalization, continuous monitoring & feedback loop establishment.

Ready to Supercharge Your LLMs?

Unlock the full potential of your language models with AMARIS. Our experts are ready to guide you.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking