Enterprise AI Analysis
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
Noppanat Wadlom, Junyi Shen, and Yao Lu, National University of Singapore, Singapore
This paper introduces Helium, a novel workflow-aware serving framework designed to optimize agentic Large Language Model (LLM) workflows. Existing LLM systems inefficiently handle the extensive redundancy in these workflows, which consist of interdependent LLM calls. Helium models these workloads as query plans, treating LLM invocations as first-class operators, integrating proactive caching and cache-aware scheduling. The result is significant speedups, demonstrating that end-to-end optimization is crucial for scalable and efficient LLM-based agents.
Quantifiable Impact for Your Enterprise
Helium significantly boosts performance for LLM-powered agentic workflows by intelligently managing resources and optimizing cross-call dependencies, delivering substantial speedups and efficiency gains compared to state-of-the-art solutions.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Inefficiency of Current LLM Serving
Agentic workflows, critical for modern AI, involve complex sequences of LLM calls that often contain massive redundancy due to overlapping prompts and speculative exploration. Current LLM serving systems, like vLLM, primarily focus on optimizing individual inference calls and lack visibility into the broader workflow structure. This "operator-level myopia" prevents them from capturing cross-call commonalities, leading to significant inefficiencies. Helium addresses this by rethinking LLM serving from a holistic, data systems perspective.
Helium's Workflow-Aware Design
Helium adopts a multi-stage query processing architecture, treating agentic workflows as query plans where LLM invocations are first-class operators. It leverages a domain-specific language (DSL) to represent workflows as symbolic Directed Acyclic Graphs (DAGs). This allows for cross-operator continuous batching and enables the system to apply advanced optimizations, including proactive caching and cache-aware scheduling, beyond the scope of individual LLM calls.
Intelligent Query Optimization
Helium's query optimizer rewrites logical DAGs to eliminate redundancy. It performs Operator Pruning to remove dead code and Common Subgraph Elimination (CSE) to consolidate identical subgraphs, preventing redundant computations. Furthermore, it employs a Global Prompt Cache that maps inputs of deterministic operators to their outputs, replacing cache hits with lightweight CacheFetch operators, thus converting computational dependencies into simple data retrievals. This dramatically reduces re-computation for repeated tasks.
Proactive Cache Management & Scheduling
To maximize KV cache reuse, Helium constructs a Templated Radix Tree (TRT), a novel data structure that captures the prefix structure and dependencies of prompts. This enables proactive caching where KV states for static prompt prefixes are pre-computed and stored in GPU memory. A cost-based, cache-aware scheduling algorithm uses this TRT to assign operators to workers and determine an optimal execution order, balancing load and maximizing shared prefix reuse, minimizing makespan and prefill costs.
Empirical Performance Gains
Evaluations across diverse agentic workflows demonstrate Helium's robust performance. It achieves up to 100.92x speedup over naive vLLM, up to 4.32x over AgentScope, and outperforms other state-of-the-art systems like LangGraph, Parrot, and KVFlow. Ablation studies confirm that plan pruning, cache-aware scheduling, and prompt caching are critical contributors to these gains, validating Helium's holistic approach to end-to-end optimization.
Enterprise Process Flow: Helium's Optimization Steps
| Feature | Helium | Traditional LLM Serving (e.g., vLLM) | Agentic Orchestrators (e.g., LangGraph) | Workflow-Aware Serving (e.g., Parrot, KVFlow) |
|---|---|---|---|---|
| Workflow-Aware Optimization |
|
|
|
|
| Proactive Caching |
|
|
|
|
| Cache-Aware Scheduling |
|
|
|
|
| Cross-Call/Workflow Reuse |
|
|
|
|
| End-to-End Latency Improvement |
|
|
|
|
Case Study: Dynamic Optimization in Trading Workflow
Helium's approach transforms complex agentic workflows, such as the Trading workflow, into highly efficient execution plans. Initially, a raw workflow DAG presents numerous LLM calls, many of which are redundant or share common prefixes. Helium's query optimizer first identifies and eliminates these redundancies by replacing repeated computations with CacheFetch operators, significantly pruning the plan.
Following this, a Templated Radix Tree is constructed to precisely model prompt structures and dependencies. The cache-aware scheduler then uses this information to determine an optimal execution sequence. For example, it strategically groups operators that share prompt prefixes to maximize KV cache reuse, while interleaving independent tasks to maintain high GPU utilization. This holistic view, from initial DAG to dynamic execution, allows Helium to achieve substantial latency reductions and throughput improvements, unachievable by systems lacking this global awareness.
Calculate Your Potential ROI with Helium
Estimate the efficiency gains and cost savings your enterprise could achieve by implementing Helium's workflow-aware LLM serving.
Your Path to Efficient LLM Agents
Our proven implementation roadmap ensures a seamless transition to Helium, maximizing your AI investment with minimal disruption.
Discovery & Strategy
In-depth assessment of current LLM workloads, identification of key agentic workflows, and alignment with business objectives. Develop a tailored strategy for Helium integration.
Pilot & Optimization
Setup a pilot deployment of Helium, integrate initial workflows, and run performance benchmarks. Leverage Helium's query optimizer for initial performance tuning and caching strategies.
Full Scale Deployment
Expand Helium across your enterprise, integrating all critical agentic workflows. Continuous monitoring and iterative optimization for sustained efficiency and scalability.
Continuous Improvement
Ongoing support, performance reviews, and updates to leverage the latest advancements in Helium. Adapt to evolving workload patterns and LLM models for peak performance.
Ready to Supercharge Your AI Agents?
Book a free consultation with our AI specialists to explore how Helium can dramatically improve the efficiency and scalability of your enterprise's LLM-powered workflows.