⚠️ Disclaimer: This repository includes intentional fault injection and stress test scenarios designed to demonstrate the AWS DevOps Agent's investigation capabilities. These scripts deliberately introduce issues such as memory leaks, network partitions, database stress, and service latency. Do not run these scripts in production environments. They are intended for learning and demonstration purposes only.📦 Source Code: The source code for the Retail Store Sample Application can be found at: https://github.com/aws-containers/retail-store-sample-app
A self-paced hands-on lab for learning ECS troubleshooting with AWS DevOps Agent
| Section | Description |
|---|---|
| Overview | Lab introduction and learning objectives |
| Application Architecture | Microservices and infrastructure components |
| Quick Start | Deploy the infrastructure |
| AWS DevOps Agent Setup | Configure the DevOps Agent |
| Troubleshooting Labs | 10 hands-on labs |
| Observability | CloudWatch monitoring setup |
| Cleanup | Destroy resources |
The lab scripts (inject/fix) require a Linux/macOS bash environment.
Windows Users: The fault injection scripts are shell scripts that will not run natively on Windows. You have two options:
- Recommended: Use AWS CloudShell - a browser-based shell with AWS CLI pre-installed
- Alternative: Use WSL2 (Windows Subsystem for Linux), Git Bash, or SSH into a Linux EC2 instance
Terraform commands can be run from any terminal (Windows PowerShell, CMD, or Linux/macOS).
If you're familiar with ECS and just want to get started:
git clone https://github.com/aws-samples/sample-devops-agent-ecs-workshop.git
cd sample-devops-agent-ecs-workshop/terraform/ecs/default
terraform init && terraform applyThis lab provides a production-ready Amazon ECS deployment environment for learning how to troubleshoot containerized applications using AWS DevOps Agent. You'll deploy a multi-service retail store application, inject real faults, and use the DevOps Agent to investigate and resolve issues.
| Lab Information | Details |
|---|---|
| Duration | 2-3 hours |
| Level | 300 (Advanced) |
| Target Audience | DevOps Engineers, SREs, Platform Engineers |
| Prerequisites | Basic AWS knowledge, familiarity with containers |
| Cost | ~$3-4/hour (remember to clean up!) |
This project is intended for educational purposes only and not for production use.
- Deploy a distributed microservices application to Amazon ECS using Terraform
- Configure AWS DevOps Agent to monitor your ECS infrastructure
- Execute chaos engineering experiments using fault injection
- Use DevOps Agent to investigate incidents and identify root causes
- Apply recommended mitigations to resolve issues
The lab deploys the AWS Retail Store Sample Application, a fully functional e-commerce application consisting of 5 microservices:
| Service | Language | Description | Backend |
|---|---|---|---|
| UI | Java (Spring Boot) | Store frontend, serves web pages | Calls other services |
| Catalog | Go | Product catalog API | RDS MariaDB |
| Cart | Java (Spring Boot) | Shopping cart management | DynamoDB |
| Checkout | Node.js (NestJS) | Checkout orchestration | ElastiCache Redis |
| Orders | Java (Spring Boot) | Order processing | RDS MariaDB + Amazon MQ |
Note: This lab uses pre-built container images from Amazon ECR. The application source code is available in the AWS Retail Store Sample App repository.
| Category | Components |
|---|---|
| Compute | ECS Cluster (Fargate), 5 ECS Services, Application Load Balancer |
| Data Stores | RDS MariaDB (Catalog, Orders), DynamoDB (Cart), ElastiCache Redis (Checkout), Amazon MQ (Orders) |
| Networking | VPC with public/private subnets, NAT Gateway, Security Groups, ECS Service Connect |
| Observability | CloudWatch Container Insights (Enhanced), CloudWatch Logs, Alarms, Dashboard |
All resources are tagged with ecsdevopsagent=true to enable AWS DevOps Agent discovery. This tag is applied to:
- ECS Cluster and Services
- RDS Database instances
- DynamoDB Tables
- ElastiCache clusters
- Application Load Balancer
- CloudWatch Log Groups
- IAM Roles
- Security Groups
- Git - Installation guide
- AWS CLI - Installed and configured with appropriate credentials (Installation guide)
- Terraform >= 1.0 - Installation guide
- Session Manager Plugin - Required for ECS Exec (Installation guide)
- jq - JSON processor for lab scripts (Installation guide)
- AWS Permissions - Administrator access recommended. The lab creates multiple AWS resources (ECS, RDS, DynamoDB, ElastiCache, Amazon MQ, VPC, IAM roles, etc.). Using limited permissions may result in deployment failures.
- Bash Shell (for lab scripts) - macOS/Linux terminal, AWS CloudShell, WSL2, or Git Bash on Windows
git clone https://github.com/aws-samples/sample-devops-agent-ecs-workshop.git
cd sample-devops-agent-ecs-workshop# Navigate to Terraform directory
cd terraform/ecs/default
# Initialize Terraform
terraform init
# Preview changes (optional)
terraform plan
# Deploy (~15-20 minutes)
terraform apply
# Type 'yes' when promptedAfter Terraform completes, it displays output values including the application URL:
Outputs:
ecs_cluster_name = "retail-store-ecs-cluster"
ui_service_url = "http://retail-xxxxx.us-east-1.elb.amazonaws.com"
Verify the application is running:
- Copy the
ui_service_urlfrom the Terraform output - Open it in your browser - you should see the Retail Store home page
- Verify services in the ECS Console → Clusters →
retail-store-ecs-cluster→ Services
You should see all 5 services running with 1/1 tasks:
Optional: Verify via CLI (Linux/macOS/CloudShell only)
# Get application URL
APP_URL=$(terraform output -raw ui_service_url)
echo "Application URL: $APP_URL"
# Test the application
curl -I $APP_URL
# Verify all services are running
aws ecs describe-services \
--cluster $(terraform output -raw ecs_cluster_name) \
--services ui catalog carts checkout orders \
--query 'services[*].[serviceName,runningCount,desiredCount]' \
--output tableOpen the APP_URL in your browser. You should see the Retail Store home page.
Test the application by:
- Home Page - Featured products and categories
- Catalog - Browse all products (powered by Catalog service)
- Cart - Add/remove items (powered by Carts service)
- Checkout - Complete your purchase (powered by Checkout service)
- Orders - Order confirmation (powered by Orders service)
AWS DevOps Agent is a frontier AI agent that helps accelerate incident response and improve system reliability. It investigates incidents and identifies operational improvements like an experienced DevOps engineer.
Note: AWS DevOps Agent is currently in public preview and available in US East (N. Virginia) (
us-east-1). The agent can monitor applications deployed in any AWS region.
An Agent Space is a logical container that defines the tools and infrastructure that AWS DevOps Agent has access to. It represents the boundary of what the agent can access and investigate during incident response.
The agent uses a dual-console architecture:
- AWS Management Console - Administrators create and manage Agent Spaces, configure integrations, and set up access controls
- DevOps Agent Web App - Operations teams use this for day-to-day incident response, investigations, and viewing recommendations
- Navigate to the AWS DevOps Agent Console
- Click Begin setup (or Create Agent Space if you have existing spaces)
- Enter details:
- Name:
retail-store-ecs-lab - Description: Agent Space for ECS Troubleshooting Lab
- Name:
- In Give this Agent Space AWS resource access, select Auto-create a new DevOps Agent role
- Review the permissions that will be granted to the role
- (Optional) Customize the role name if desired
Since this lab uses Terraform (not CloudFormation), you need to add a tag so the agent can discover your resources.
- In the Include AWS tags section, click Add tag
- Add tag:
ecsdevopsagent=true
This tag enables the DevOps Agent to discover all lab resources including ECS cluster, services, RDS databases, DynamoDB tables, and related infrastructure.
- In Enabling the Agent Space Web App, select Auto-create a new AWS DevOps Agent role
- Review the permissions that will be granted
- Leave other settings as default
- Click Create
- Wait 1-2 minutes for the Agent Space to be created
- Click Admin access to open the Web App
- Navigate to DevOps Center to view the discovered topology
- Verify you can see the ECS cluster and services
You should see the following resources discovered:
- ECS Cluster:
retail-store-ecs-cluster - ECS Services: ui, catalog, cart, checkout, orders
- RDS Instances: catalog-db, orders-db
- DynamoDB Table: carts
- ElastiCache: checkout-redis
- Amazon MQ: RabbitMQ broker
Note: These commands require a bash shell (Linux/macOS/CloudShell)
# Verify ECS cluster tags
aws ecs describe-clusters --clusters retail-store-ecs-cluster \
--query 'clusters[0].tags' --output table
# List all resources with the ecsdevopsagent tag
aws resourcegroupstaggingapi get-resources \
--tag-filters Key=ecsdevopsagent,Values=true \
--query 'ResourceTagMappingList[].ResourceARN' --output tableFrom the DevOps Agent Web App:
- Click Start Investigation
- Enter a prompt describing what you want to investigate:
Check the health of my ECS services in the retail-store-ecs-cluster - Leave other options as default and click Start Investigating
- The agent will analyze your infrastructure and provide insights
| Mechanism | Description |
|---|---|
| Read-Only by Default | The agent only reads data; it does not modify resources |
| Scoped Access | Access is limited to resources within the Agent Space |
| Audit Logging | All agent actions are logged to CloudTrail |
| Human-in-the-Loop | Mitigation recommendations require human approval |
⚠️ Windows Users: The lab scripts require a bash shell environment. Use one of these options:
- AWS CloudShell (Recommended) - Browser-based, no setup required
- WSL2 (Windows Subsystem for Linux)
- Git Bash (comes with Git for Windows)
- SSH into a Linux EC2 instance
Before running lab scripts, ensure you have
jqinstalled:jq --version
The labs are organized into two categories:
These labs focus on common ECS misconfigurations that cause service failures:
| Lab | Issue | Service | Difficulty |
|---|---|---|---|
| Lab 1 | CloudWatch Logs Not Delivered | Catalog | Basic |
| Lab 2 | Unable to Pull Secrets | Orders | Basic |
| Lab 3 | Health Check Failures | UI | Basic |
| Lab 4 | Security Group Blocked (Database Connectivity) | Catalog → RDS | Intermediate |
| Lab 5 | Task Resource Limits (OOM) | Checkout | Intermediate |
| Lab 6 | Service Connect Communication Broken | UI → Catalog | Intermediate |
These labs inject real performance issues to simulate production incidents:
| Lab | Issue | Service | Difficulty |
|---|---|---|---|
| Lab 7 | CPU Stress | Catalog | Intermediate |
| Lab 8 | DDoS Attack Simulation | UI/ALB | Advanced |
| Lab 9 | DynamoDB Attack | Carts | Advanced |
| Lab 10 | Auto-Scaling Not Working | Catalog | Advanced |
Each lab follows a consistent pattern:
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ 1. Inject Fault │────▶│ 2. Observe Symptoms│────▶│ 3. Start │
│ (run inject script)│ │ (check app/metrics)│ │ Investigation │
└─────────────────────┘ └─────────────────────┘ └──────────┬──────────┘
│
▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ 6. Rollback Fault │◀────│ 5. Apply Fix │◀────│ 4. Agent Analyzes │
│ (run rollback │ │ (follow agent │ │ & Identifies Root │
│ script) │ │ recommendations) │ │ Cause │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
Scenario: The catalog service has stopped sending logs to CloudWatch. Without logs, you can't monitor the service's health or debug issues.
Inject:
./labs/lab1-logs-not-delivered/inject.shSymptoms:
- Catalog service tasks failing to start
- Service events showing
ResourceInitializationError - No new logs appearing in CloudWatch
Investigation Prompts:
Why is the catalog service failing to start new tasks?
Check the ECS service events for the catalog service
Root Cause: Task definition references a non-existent CloudWatch log group.
Fix:
./labs/lab1-logs-not-delivered/fix.shScenario: The orders service can't start because it can't retrieve database credentials from Secrets Manager.
Inject:
./labs/lab2-secrets-access-denied/inject.shSymptoms:
- Orders service tasks fail to start
- Error: "unable to pull secrets or registry auth"
- Customers cannot place orders
Investigation Prompts:
Why is the orders service failing to start?
What IAM permissions does the orders service task execution role have?
Root Cause: Task execution role is missing secretsmanager:GetSecretValue permission.
Fix:
./labs/lab2-secrets-access-denied/fix.shScenario: The UI service tasks keep restarting every few minutes. Customers see intermittent 503 errors.
Inject:
./labs/lab3-health-check-failures/inject.shSymptoms:
- Tasks continuously restart
- Service never stabilizes
- Service events show "unhealthy" messages
Investigation Prompts:
Why does the UI service keep restarting tasks?
What health check configuration is the UI service using?
Root Cause: Health check path is misconfigured (/wrong-health-endpoint instead of /actuator/health).
Fix:
./labs/lab3-health-check-failures/fix.shScenario: The product catalog stopped loading. The catalog service is running but returns errors when fetching products. Database connection timeouts appear in the logs.
Inject:
./labs/lab4-security-group-blocked/inject.shSymptoms:
- Catalog returns errors
- Service is running and healthy
- Database connection timeouts in logs
- RDS appears healthy
Investigation Prompts:
The catalog service can't connect to the database. What's wrong?
What security groups are attached to the catalog service and the RDS database?
Root Cause: RDS security group is missing ingress rule allowing traffic from catalog service on port 3306.
Fix:
./labs/lab4-security-group-blocked/fix.shScenario: The checkout service is crashing repeatedly. Tasks start but crash within seconds due to memory exhaustion.
Inject:
./labs/lab5-task-resource-limits/inject.shSymptoms:
- Tasks crash shortly after starting
- Container shows
OutOfMemoryError: Container killed due to memory usage - Checkout unavailable - customers cannot complete purchases
- Rapid task cycling as ECS keeps trying to start new tasks
Investigation Prompts:
Why is the checkout service crashing? The tasks keep restarting.
What is the exit code for the stopped checkout tasks? Is it an OOM kill?
Show me the memory configuration for the checkout service task definition
Root Cause: A memory-stress sidecar container is consuming more memory than the task limit allows, causing OOM kills.
Fix:
./labs/lab5-task-resource-limits/fix.shScenario: The UI loads but the product catalog is empty. The catalog service appears healthy but the UI can't communicate with it.
Inject:
./labs/lab6-service-connect-broken/inject.shSymptoms:
- UI loads but catalog is empty
- Catalog service is healthy
- UI logs show connection errors
Investigation Prompts:
The product catalog is empty but the catalog service looks healthy. What's wrong?
How does the UI service connect to the catalog service?
Root Cause: UI service environment variable points to wrong endpoint (http://catalog-broken instead of http://catalog).
Fix:
./labs/lab6-service-connect-broken/fix.shScenario: Users report the product catalog is loading slowly. Page load times increased from under 1 second to 5-10 seconds.
Inject:
./labs/lab7-cpu-stress/inject.shSymptoms:
- Slow response times
- High CPU in Container Insights
- Service is running but slow
Investigation Prompts:
The catalog service is slow. Is there high CPU utilization?
Show me the CPU metrics for the catalog service from Container Insights
Root Cause: stress-ng process consuming CPU inside the container.
Rollback:
./labs/lab7-cpu-stress/rollback.sh
# Or wait 5 minutes for auto-rollbackScenario: The retail application is under attack! Users are reporting extremely slow page loads and timeouts. ALB metrics show a massive spike in request count - far beyond normal traffic levels.
Inject:
./labs/lab8-ddos-simulation/inject.shSymptoms:
- Slow page loads and timeouts
- ALB RequestCount through the roof (~300 req/s attack traffic)
- 5XX errors increasing
- Rogue ECS tasks running
http-flood-attack
Investigation Prompts:
The retail app is extremely slow. Users are complaining about timeouts. What's happening?
We're seeing a massive traffic spike on the ALB. Is this a DDoS attack?
Root Cause: Rogue ECS tasks flooding the ALB with HTTP requests using curl and GNU parallel.
Rollback:
./labs/lab8-ddos-simulation/fix.shScenario: The shopping cart service is completely broken. Users cannot add items to cart - all operations are failing with throttling errors. CloudWatch shows massive spikes in DynamoDB ThrottledRequests. This looks like a DDoS attack on the database!
Inject:
./labs/lab9-dynamodb-attack/inject.shSymptoms:
- Cart operations failing with throttling errors
- Massive ThrottledRequests spike in CloudWatch
- Rogue ECS tasks running
dynamodb-stress-attack - Service returning 500 errors
Investigation Prompts:
The carts service is completely broken. Users can't add items to cart. Check DynamoDB for issues.
DynamoDB is being throttled heavily. What's consuming all the read capacity?
Are there any suspicious ECS tasks running that might be attacking DynamoDB?
Root Cause: Rogue ECS tasks flooding DynamoDB with scan requests. Table switched to low provisioned capacity (5 RCU) which is easily overwhelmed.
Rollback:
./labs/lab9-dynamodb-attack/fix.shScenario: The catalog service is experiencing high CPU load during a traffic spike. Auto-scaling should kick in to add more tasks, but the service isn't scaling. Users are complaining about slow response times.
Inject:
./labs/lab10-autoscaling-broken/inject.shSymptoms:
- High CPU utilization visible in CloudWatch metrics
- CloudWatch alarm in ALARM state
- Service does NOT scale out (stays at current task count)
- Application becomes slow/unresponsive
Investigation Prompts:
Why isn't my ECS service scaling even though CPU is high?
Check the auto-scaling configuration for the catalog service
Show me the CloudWatch alarms for the catalog service. Are the alarm actions enabled?
Root Cause: CloudWatch alarm actions are disabled, so even though the alarm fires, it doesn't trigger the scaling policy.
Fix:
./labs/lab10-autoscaling-broken/fix.shThe labs/ directory contains all lab scripts organized by lab number:
| Lab | Inject Script | Fix Script | Target | Duration |
|---|---|---|---|---|
| Lab 7 | labs/lab7-cpu-stress/inject.sh |
labs/lab7-cpu-stress/fix.sh |
catalog | Until fixed |
| Lab 8 | labs/lab8-ddos-simulation/inject.sh |
labs/lab8-ddos-simulation/fix.sh |
ui/ALB | Until fixed |
| Lab 9 | labs/lab9-dynamodb-attack/inject.sh |
labs/lab9-dynamodb-attack/fix.sh |
carts | Until fixed |
| Lab 10 | labs/lab10-autoscaling-broken/inject.sh |
labs/lab10-autoscaling-broken/fix.sh |
catalog | Until fixed |
| Variable | Default | Description |
|---|---|---|
CLUSTER_NAME |
retail-store-ecs-cluster |
ECS cluster name |
SERVICE_NAME |
varies | Target ECS service |
AWS_REGION |
us-east-1 |
AWS region |
STRESS_DURATION |
300 |
Duration in seconds |
CPU_WORKERS |
2 |
Number of CPU stress workers |
MEMORY_PERCENT |
80 |
Target memory percentage |
LATENCY_MS |
500 |
Network latency in milliseconds |
The deployment includes production-grade observability to enable effective troubleshooting with AWS DevOps Agent:
Container Insights is enabled in enhanced mode, providing:
- CPU and memory utilization per service and task
- Network I/O metrics for traffic analysis
- Running task counts for availability monitoring
- Performance metrics at container level
All ECS tasks send application logs to CloudWatch Logs:
- Each service has its own log stream for easy isolation
- Configurable retention (default: 30 days)
- ECS Exec session logging for audit trails
- Optional KMS encryption
When cloudwatch_alarms_enabled = true (default), pre-configured alarms monitor:
- CPU utilization > 80% per service
- Memory utilization > 80% per service
- Running task count < 1 (service down)
- ALB 5XX errors spike detection
- ALB latency p95 > 2 seconds
A unified dashboard displays service health, resource utilization, ALB metrics, and error rates:
terraform output cloudwatch_dashboard_urlThis observability stack provides AWS DevOps Agent with the data it needs to correlate symptoms, identify root causes, and recommend mitigations during incidents.
Important: Remember to destroy all resources to avoid ongoing charges!
The destroy script handles all dependencies automatically, ensuring a clean one-shot destruction:
./scripts/destroy.shThis script will:
- Scale down and delete all ECS services
- Delete Load Balancers
- Delete VPC Endpoints (common blocker for subnet deletion)
- Delete NAT Gateways
- Clean up orphaned network interfaces
- Remove any terraform state locks
- Run
terraform destroy
If you prefer manual control:
If you have any active lab faults, restore them first:
# Run fix scripts for any active labs
./labs/lab1-logs-not-delivered/fix.sh
./labs/lab2-secrets-access-denied/fix.sh
# ... etccd terraform/ecs/default
terraform destroy
# Type 'yes' when promptedDestruction takes ~10-15 minutes.
If terraform destroy fails with DependencyViolation errors on subnets, there are likely resources still using them:
# Find what's blocking subnet deletion
aws ec2 describe-network-interfaces \
--filters "Name=subnet-id,Values=<subnet-id>" \
--query "NetworkInterfaces[*].{ID:NetworkInterfaceId,Type:InterfaceType,Description:Description}"
# Common blockers are VPC Endpoints - delete them first
aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=<vpc-id>" --query 'VpcEndpoints[*].VpcEndpointId'
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids <endpoint-id>
# Then retry terraform destroy
terraform destroyIf you get a state lock error:
# Force unlock (use the lock ID from the error message)
terraform force-unlock <lock-id>
# Or remove the lock file for local state
rm -f .terraform.tfstate.lock.info- Navigate to AWS DevOps Agent Console
- Select your Agent Space
- Click Delete and confirm
- AWS DevOps Agent Documentation
- Amazon ECS Documentation
- CloudWatch Container Insights
- ECS Exec Documentation
See CONTRIBUTING.md for guidelines.
This project is licensed under the MIT-0 License. See LICENSE for details.




