
While perusing Stackdriver’s blog I happenstanced across this little nugget about common AWS problems and it immediately solidified the “Chickens vs Cattle” metaphor for me.
Read these, and tell me how your legacy enterprise crapplication would cope if this happened to it!! I particularly love the ephemeral disk corruption one and the answer? SHOOT THE CHICKEN, GET ANOTHER!
Git yer cloud game on.
Availability
Ephemeral Disk Corruption
When running an instance with ephemeral disks available, they can become corrupt. The filesystem will become read-only, and you will be unable to write to disk. Messages will also show up in dmesg related to filesystem corruption.
- Solution: Relaunch the Instance
DNS Unavailable
It is possible for the internal DNS resolver on the hypervisor to crash which causes DNS lookups to fail from within the instance.
- Solution: Change your DNS Resolver, Relaunch (Instance Store), Stop/Start (EBS Backed)
Disk Full
Although one of the draws of AWS (and cloud computing in general) is that you can scale up as needed, it is an unfortunate reality that ephemeral disks don’t do so. You can upsize EBS but not automatically. You can check how full your disk is by checking disk usage metrics.
- Solution: Upsize disk or Remove unused files
Performance
Instance at Capacity
One of the best features of EC2 is its range of instances, but this can also make it more difficult to find the right mix between cost and size. Many people misestimate the size that they require, resulting in an instance at capacity. You can can discover that your instance at capacity through checking the CPU, memory, and disk IO.
- Solution: Upsize your EC2 Instance
RDS at Capacity
Like instances, databases can also hit capacity. This can be discovered through checking the CPU, memory, or disk IO on the RDS instance. You may also notice slow application response times, or queries taking longer than expected.
- Solution: Resize RDS or Diagnose Slow Queries
Host Contention
When distributing server space, Amazon will often oversubscribe a server under the assumption that all users won’t operate at full capacity. Using a command such as ‘iostat 1’ you can measure the amount of CPU Steal your instance is experiencing. High CPU Steal is usually an indicator of noisy neighbors.
- Solution: Relaunch (Instance Store), or Stop/Start (EBS Backed) in order to get a new host.
ELB Misconfiguration
Suboptimal configuration of your ELB can cause problems like latency, timeouts and 503 errors. It can be discovered by comparing your configuration with best practices. All of the instances in the load balancer should be healthy, and only AZs with instances in them should be configured for your load balancer.
- Solution: Reconfigure ELB
Cost
Underutilization of EC2 or RDS
Another effect of EC2’s wide variety is that resources can be underutilized also and eat into your bill. This can be discovered through looking at CPU, Memory or Disk metrics. Consider your availability as well before removing an instance.
- Solution: Reduce instance size or the number running
Application
Security Group Misconfiguration
Not configuring your security groups correctly can cause problems like timeouts, 50x errors or app unavailability. Often you remember to open the local firewall and port on an instance, but forget to open it in the EC2 security group.
- Solution: Open port on security group