- Overview
- Architecture
- Architecture Steps
- Plan Your Deployment
- Quick Start Guide
- Important Notes
- Notices
This solution implements a comprehensive, scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for cost-effective CPU-based inference and GPU instances for accelerated inference. The system provides a complete end-to-end platform for deploying large language models with agentic AI capabilities, including RAG (Retrieval Augmented Generation) and intelligent document processing.
Figure1. Reference Architecture for EKS cluster with add-ons
Figure2. Reference Architecture Gen AI applications deployed on the EKS cluster
The architecture diagrams illustrate our scalable ML inference solution with the following components:
-
Amazon EKS Cluster: The foundation of our architecture, providing a managed Kubernetes environment with automated provisioning and configuration.
-
Karpenter Auto-scaling: Dynamically provisions and scales compute resources based on workload demands across multiple node pools.
-
Node Pools:
- Graviton-based nodes (ARM64): Cost-effective CPU inference using m8g/c8g instances
- GPU-based nodes (x86_64): High-performance inference using NVIDIA GPU instances (g5, g6 families)
- x86-based nodes: General purpose compute for compatibility requirements
-
Model Hosting Services:
- Ray Serve: Distributed model serving with automatic scaling
- Standalone Services: Direct model deployment for specific use cases
- Multi-modal Support: Text, vision, and reasoning model capabilities
-
Model Gateway:
- LiteLLM Proxy: Unified OpenAI-compatible API gateway with load balancing and routing
- Ingress Controller: External access management with SSL termination
-
Agentic AI Applications:
- RAG with OpenSearch: Intelligent document retrieval and question answering
- Intelligent Document Processing (IDP): Automated document analysis and extraction
- Multi-Agent Systems: Coordinated AI workflows with specialized agents
-
Observability & Monitoring:
- Langfuse: LLM observability and performance tracking
- Prometheus & Grafana: Infrastructure monitoring and alerting
This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
-
Foundation Setup: The foundation begins with an Amazon EKS cluster, configured for application readiness and with compute plane managed by Karpenter. This setup efficiently provisions both AWS Graviton and GPU based instances across multiple Availability Zones (AZs), ensuring robust infrastructure distribution and high availability of various Kubernetes services.
-
User Request Entry: User interaction starts through HTTP requests directed to an exposed endpoint, managed by Elastic Load Balancing (ELB). This entry point serves as the gateway for all user queries and ensures balanced distribution of incoming traffic while maintaining consistent accessibility.
-
Orchestration and Analysis: The orchestrator agent, powered by Strands Agent SDK, serves as the central coordination hub. It processes incoming queries by connecting with reasoning models supported by LiteLLM and vLLM services, analyzing the requests, and determining the appropriate workflow and tools needed for response generation. LiteLLM functions as a unified API gateway for model management, providing centralized security controls and standardized access to both embedding and reasoning models.
-
Knowledge Validation: Knowledge validation begins as the orchestrator agent delegates to the RAG agent to verify knowledge base currency. When updates are needed, the RAG agent initiates the process of embedding new information into the Amazon Opensearch cluster, ensuring the knowledge base remains current and comprehensive.
-
Embedding Process: The embedding process is handled within a KubeRay cluster, where the Ray header dynamically manages worker node scaling based on resource demands. These worker nodes execute the embedding process through the llamacpp framework, while the RAG agent simultaneously embeds user questions and searches for relevant information within the OpenSearch cluster.
-
Quality Assurance: Quality assurance is performed by the Evaluation Agent, which leverages models hosted in Amazon Bedrock. This agent implements a feedback-based correction loop using RAGAS metrics to assess response quality and provides relevancy scores to the orchestrator agent for decision-making purposes.
-
Web Search Fallback: When the RAG agent's responses receive low relevancy scores, the orchestrator agent initiates a web search process. It retrieves the Tavily web search API tool from the web search MCP server and performs dynamic queries to obtain supplementary or corrective information.
-
Final Response Generation: The final response generation occurs on GPU instances running vLLM. The reasoning model processes the aggregated information, including both knowledge base data and web search results when applicable, refining and rephrasing the content to create a coherent and accurate response for the user.
-
Comprehensive Tracking: Throughout this entire process, the Strands Agent SDK maintains comprehensive interaction tracking. All agent activities and communications are automatically traced, with the resulting data visualized through the Langfuse service, providing complete transparency and monitoring of the system's operations.
You are responsible for the cost of the AWS services used while running this guidance. As of August 2025, the cost for running this guidance with the default settings in the US East (N. Virginia) Region is approximately $447.47/month.
We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.
The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the us-east-1
(N. Virginia) Region for one month. This estimate is based on the AWS Pricing Calculator output for the full deployment as per the guidance and as of September, 2025 was around $447.47/mo in the us-east-1
region.
AWS service | Dimensions | Cost, month [USD] |
---|---|---|
Amazon EKS | 1 cluster | $73.00 |
Amazon VPC | 3 NAT Gateways | $98.67 |
Amazon EC2 | 2 m6g.large instances | $112.42 |
Amazon Managed Service for Prometheus (AMP) | Metric samples, storage, and queries | $100.60 |
Amazon Managed Grafana (AMG) | Metrics visualization - Editor and Viewer users | $14.00 |
Amazon EBS | gp2 storage volumes and snapshots | $17.97 |
Application Load Balancer | 1 ALB for workloads | $16.66 |
Amazon VPC | Public IP addresses | $3.65 |
AWS Key Management Service (KMS) | Keys and requests | $7.00 |
Amazon CloudWatch | Metrics | $3.00 |
Amazon ECR | Data storage | $0.50 |
TOTAL | $447.47/month |
For a more accurate estimate based on your specific configuration and usage patterns, we recommend using the AWS Pricing Calculator.
This sample code utilizes various third-party packages, modules, models, and datasets, including but not limited to:
- Strands Agent SDK
- Qwen3-14B
- snowflake-arctic-embed-s model
- LiteLLM
- LangFuse
- Medical Text for Text Classification public dataset
Important Notice:
- Amazon Web Services (AWS) is not associated to these third-party entities and their components.
- The maintenance, updates, and security of these third-party dependencies are the sole responsibility of the customer/user.
- Users should regularly review and update these dependencies to ensure security and compatibility.
- Users are responsible for compliance with all applicable licenses and terms of use for these third-party components.
Please review and comply with all relevant licenses and terms of service for each third-party component before using in your applications.
Before proceeding with this solution, ensure you have:
- AWS CLI configured with appropriate permissions for EKS, ECR, CloudFormation, and other AWS services
- kubectl installed and configured to access your target AWS region
- Docker installed and running (required for building and pushing container images)
- Sufficient AWS service quotas - This solution requires multiple EC2 instances, EKS clusters, and other AWS resources
- Valid Hugging Face token - Required for accessing models (see instructions below)
- Tavily API key - Required for web search functionality in agentic applications
Recommended Setup Verification:
# Verify AWS CLI access
aws sts get-caller-identity
# Verify kubectl installation
kubectl version --client
# Verify Docker is running
docker ps
# Check available AWS regions and quotas
aws ec2 describe-regions
aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A
Cost Awareness: This solution will incur AWS charges. Review the cost breakdown section below and set up billing alerts before deployment.
NOTE: For detailed instructions on Deployment options for this guidance, running model infrenece and Agentic AI workflows and uninstallation please see this Detailed Installation Guide
- Modularity: Each agent has specific responsibilities
- Scalability: Agents can be scaled independently
- Reliability: Isolated failures don't affect the entire system
- Extensibility: Easy to add new capabilities
- Observability: Comprehensive monitoring and tracing via Strands SDK
- Standards Compliance: Uses MCP for tool integration and OpenTelemetry for tracing
- Single Codebase: No separate "enhanced" versions - all functionality is built into the standard agents
- Built-in Tracing: OpenTelemetry tracing is automatically enabled through Strands SDK
- Simplified Deployment: One main application with all features included
- Consistent API: All agents use the same tracing and configuration patterns
- Automatic Instrumentation: No manual trace management required
- Multiple Export Options: Console, OTLP, Jaeger, Langfuse support out of the box
- Environment-based Configuration: Easy setup through environment variables
- Clean Code Structure: Removed duplicate wrapper functions and complex manual tracing
- Async Warning Management: Clean test runner filters harmless async cleanup warnings
- Robust Error Handling: Fallback mechanisms ensure system reliability
Include below mandatory legal disclaimer for Guidance
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.