Cost Effective and Scalable Model Inference and Agentic AI on Amazon EKS

Overview

This solution implements a comprehensive, scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for cost-effective CPU-based inference and GPU instances for accelerated inference. The system provides a complete end-to-end platform for deploying large language models with agentic AI capabilities, including RAG (Retrieval Augmented Generation) and intelligent document processing.

Architecture

Infrastructure Architecture

Figure1. Reference Architecture for EKS cluster with add-ons

Agentic Application Architecture

Figure2. Reference Architecture Gen AI applications deployed on the EKS cluster

The architecture diagrams illustrate our scalable ML inference solution with the following components:

Amazon EKS Cluster: The foundation of our architecture, providing a managed Kubernetes environment with automated provisioning and configuration.
Karpenter Auto-scaling: Dynamically provisions and scales compute resources based on workload demands across multiple node pools.
Node Pools:
- Graviton-based nodes (ARM64): Cost-effective CPU inference using m8g/c8g instances
- GPU-based nodes (x86_64): High-performance inference using NVIDIA GPU instances (g5, g6 families)
- x86-based nodes: General purpose compute for compatibility requirements
Model Hosting Services:
- Ray Serve: Distributed model serving with automatic scaling
- Standalone Services: Direct model deployment for specific use cases
- Multi-modal Support: Text, vision, and reasoning model capabilities
Model Gateway:
- LiteLLM Proxy: Unified OpenAI-compatible API gateway with load balancing and routing
- Ingress Controller: External access management with SSL termination
Agentic AI Applications:
- RAG with OpenSearch: Intelligent document retrieval and question answering
- Intelligent Document Processing (IDP): Automated document analysis and extraction
- Multi-Agent Systems: Coordinated AI workflows with specialized agents
Observability & Monitoring:
- Langfuse: LLM observability and performance tracking
- Prometheus & Grafana: Infrastructure monitoring and alerting

This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.

Architecture Steps

Foundation Setup: The foundation begins with an Amazon EKS cluster, configured for application readiness and with compute plane managed by Karpenter. This setup efficiently provisions both AWS Graviton and GPU based instances across multiple Availability Zones (AZs), ensuring robust infrastructure distribution and high availability of various Kubernetes services.
User Request Entry: User interaction starts through HTTP requests directed to an exposed endpoint, managed by Elastic Load Balancing (ELB). This entry point serves as the gateway for all user queries and ensures balanced distribution of incoming traffic while maintaining consistent accessibility.
Orchestration and Analysis: The orchestrator agent, powered by Strands Agent SDK, serves as the central coordination hub. It processes incoming queries by connecting with reasoning models supported by LiteLLM and vLLM services, analyzing the requests, and determining the appropriate workflow and tools needed for response generation. LiteLLM functions as a unified API gateway for model management, providing centralized security controls and standardized access to both embedding and reasoning models.
Knowledge Validation: Knowledge validation begins as the orchestrator agent delegates to the RAG agent to verify knowledge base currency. When updates are needed, the RAG agent initiates the process of embedding new information into the Amazon Opensearch cluster, ensuring the knowledge base remains current and comprehensive.
Embedding Process: The embedding process is handled within a KubeRay cluster, where the Ray header dynamically manages worker node scaling based on resource demands. These worker nodes execute the embedding process through the llamacpp framework, while the RAG agent simultaneously embeds user questions and searches for relevant information within the OpenSearch cluster.
Quality Assurance: Quality assurance is performed by the Evaluation Agent, which leverages models hosted in Amazon Bedrock. This agent implements a feedback-based correction loop using RAGAS metrics to assess response quality and provides relevancy scores to the orchestrator agent for decision-making purposes.
Web Search Fallback: When the RAG agent's responses receive low relevancy scores, the orchestrator agent initiates a web search process. It retrieves the Tavily web search API tool from the web search MCP server and performs dynamic queries to obtain supplementary or corrective information.
Final Response Generation: The final response generation occurs on GPU instances running vLLM. The reasoning model processes the aggregated information, including both knowledge base data and web search results when applicable, refining and rephrasing the content to create a coherent and accurate response for the user.
Comprehensive Tracking: Throughout this entire process, the Strands Agent SDK maintains comprehensive interaction tracking. All agent activities and communications are automatically traced, with the resulting data visualized through the Langfuse service, providing complete transparency and monitoring of the system's operations.

Plan your deployment

Cost

You are responsible for the cost of the AWS services used while running this guidance. As of August 2025, the cost for running this guidance with the default settings in the US East (N. Virginia) Region is approximately $447.47/month.

We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.

Sample Cost Table

The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the us-east-1 (N. Virginia) Region for one month. This estimate is based on the AWS Pricing Calculator output for the full deployment as per the guidance and as of September, 2025 was around $447.47/mo in the us-east-1 region.

AWS service	Dimensions	Cost, month [USD]
Amazon EKS	1 cluster	$73.00
Amazon VPC	3 NAT Gateways	$98.67
Amazon EC2	2 m6g.large instances	$112.42
Amazon Managed Service for Prometheus (AMP)	Metric samples, storage, and queries	$100.60
Amazon Managed Grafana (AMG)	Metrics visualization - Editor and Viewer users	$14.00
Amazon EBS	gp2 storage volumes and snapshots	$17.97
Application Load Balancer	1 ALB for workloads	$16.66
Amazon VPC	Public IP addresses	$3.65
AWS Key Management Service (KMS)	Keys and requests	$7.00
Amazon CloudWatch	Metrics	$3.00
Amazon ECR	Data storage	$0.50
TOTAL		$447.47/month

For a more accurate estimate based on your specific configuration and usage patterns, we recommend using the AWS Pricing Calculator.

Third-Party Dependencies Disclaimer

This sample code utilizes various third-party packages, modules, models, and datasets, including but not limited to:

Strands Agent SDK
Qwen3-14B
snowflake-arctic-embed-s model
LiteLLM
LangFuse
Medical Text for Text Classification public dataset

Important Notice:

Amazon Web Services (AWS) is not associated to these third-party entities and their components.
The maintenance, updates, and security of these third-party dependencies are the sole responsibility of the customer/user.
Users should regularly review and update these dependencies to ensure security and compatibility.
Users are responsible for compliance with all applicable licenses and terms of use for these third-party components.

Please review and comply with all relevant licenses and terms of service for each third-party component before using in your applications.

Quick Start Guide

⚠️ Important Setup Instructions

Before proceeding with this solution, ensure you have:

AWS CLI configured with appropriate permissions for EKS, ECR, CloudFormation, and other AWS services
kubectl installed and configured to access your target AWS region
Docker installed and running (required for building and pushing container images)
Sufficient AWS service quotas - This solution requires multiple EC2 instances, EKS clusters, and other AWS resources
Valid Hugging Face token - Required for accessing models (see instructions below)
Tavily API key - Required for web search functionality in agentic applications

Recommended Setup Verification:

# Verify AWS CLI access
aws sts get-caller-identity

# Verify kubectl installation
kubectl version --client

# Verify Docker is running
docker ps

# Check available AWS regions and quotas
aws ec2 describe-regions
aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A

Cost Awareness: This solution will incur AWS charges. Review the cost breakdown section below and set up billing alerts before deployment.

NOTE: For detailed instructions on Deployment options for this guidance, running model infrenece and Agentic AI workflows and uninstallation please see this Detailed Installation Guide

Important Notes

🔍 Architecture Benefits

Modularity: Each agent has specific responsibilities
Scalability: Agents can be scaled independently
Reliability: Isolated failures don't affect the entire system
Extensibility: Easy to add new capabilities
Observability: Comprehensive monitoring and tracing via Strands SDK
Standards Compliance: Uses MCP for tool integration and OpenTelemetry for tracing

🔧 Key Improvements

Unified Architecture

Single Codebase: No separate "enhanced" versions - all functionality is built into the standard agents
Built-in Tracing: OpenTelemetry tracing is automatically enabled through Strands SDK
Simplified Deployment: One main application with all features included
Consistent API: All agents use the same tracing and configuration patterns

Enhanced Developer Experience

Automatic Instrumentation: No manual trace management required
Multiple Export Options: Console, OTLP, Jaeger, Langfuse support out of the box
Environment-based Configuration: Easy setup through environment variables
Clean Code Structure: Removed duplicate wrapper functions and complex manual tracing
Async Warning Management: Clean test runner filters harmless async cleanup warnings
Robust Error Handling: Fallback mechanisms ensure system reliability

Notices

Include below mandatory legal disclaimer for Guidance

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
agentic-apps		agentic-apps
base_eks_setup		base_eks_setup
benchmark		benchmark
image		image
milvus		milvus
model-gateway		model-gateway
model-hosting		model-hosting
model-observability		model-observability
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cost Effective and Scalable Model Inference and Agentic AI on Amazon EKS

Table of Contents

Overview

Architecture

Infrastructure Architecture

Agentic Application Architecture

Architecture Steps

Plan your deployment

Cost

Sample Cost Table

Third-Party Dependencies Disclaimer

Quick Start Guide

⚠️ Important Setup Instructions

Important Notes

🔍 Architecture Benefits

🔧 Key Improvements

Unified Architecture

Enhanced Developer Experience

Notices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

aws-solutions-library-samples/guidance-for-scalable-model-inference-and-agentic-ai-on-amazon-eks

Folders and files

Latest commit

History

Repository files navigation

Cost Effective and Scalable Model Inference and Agentic AI on Amazon EKS

Table of Contents

Overview

Architecture

Infrastructure Architecture

Agentic Application Architecture

Architecture Steps

Plan your deployment

Cost

Sample Cost Table

Third-Party Dependencies Disclaimer

Quick Start Guide

⚠️ Important Setup Instructions

Important Notes

🔍 Architecture Benefits

🔧 Key Improvements

Unified Architecture

Enhanced Developer Experience

Notices

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages