// AWS study guide

AWS Study
Guide

36 QUESTIONS 7 DOMAINS DEVOPS ENGINEER LEVEL
MASTERED
0 / 36
FILTER:
EASY
MEDIUM
HARD
🔐
IAM & Security 6 questions
01 What is the difference between an IAM Role and an IAM User?
An IAM User is a permanent identity with long-term credentials (access keys, password) tied to a specific person or service. An IAM Role is an identity with temporary credentials assumed by trusted entities — EC2 instances, Lambda functions, other AWS accounts, or federated users.
  • Users have static credentials; Roles issue short-lived STS tokens (15 min – 12 hrs)
  • Roles are the right pattern for EC2/Lambda/ECS — never embed access keys in compute resources
  • Cross-account access is always done via Role assumption, never by sharing User credentials
If asked "how does your app authenticate to AWS?" — the answer should always be "via an IAM Role attached to the compute resource, not hardcoded keys."
02 Explain the principle of least privilege and how you enforce it in AWS.
Least privilege means granting only the exact permissions needed to perform a task — nothing more. Enforcement in AWS:
  • IAM Access Analyzer — identifies unused permissions and generates least-privilege policies from CloudTrail activity
  • Service Control Policies (SCPs) — set guardrails at the org level that not even root can override
  • Permission Boundaries — cap the max permissions a role/user can ever have, regardless of attached policies
  • Conditions in policies — restrict by aws:SourceIp, aws:RequestedRegion, MFA required, etc.
  • Regular IAM Credential Reports to audit unused access keys and rotate or delete them
Mention that you use Access Analyzer to right-size policies rather than manually auditing JSON — shows tooling awareness.
03 What are SCPs and how do they differ from IAM policies?
Service Control Policies (SCPs) are org-level guardrails applied to OUs or accounts. Key distinctions:
  • SCPs do not grant permissions — they define the maximum boundary of what IAM policies can grant
  • IAM policies are evaluated within the SCP boundary — if SCP denies ec2:TerminateInstances, no IAM policy can allow it
  • SCPs apply to all principals in the account including root, except the management account itself
  • Common pattern: deny leaving org, deny disabling CloudTrail, deny creating IAM users (force SSO)
04 Walk me through how you would audit and remediate an IAM over-permissioned role in production.
Step 1 — Discover: Use IAM Access Analyzer → generate a policy based on CloudTrail activity for the last 90 days. This shows exactly which actions were called.

Step 2 — Compare: Diff the generated least-privilege policy against the attached policy to identify excess permissions.

Step 3 — Test safely: Attach a Permission Boundary scoped to the generated policy alongside the existing policy. This won't break anything but limits effective permissions.

Step 4 — Replace: Swap the attached managed policy for the tightened version after validating no app errors in staging.

Step 5 — Automate: Set up a recurring Access Analyzer scan with EventBridge + Lambda to alert on drift.
Using Permission Boundaries as a non-breaking intermediate step is a production-safe detail that shows real operational maturity.
05 What is AssumeRole and how does cross-account access work?
sts:AssumeRole requests temporary credentials from AWS STS to act as a different role. For cross-account:
  • Account B creates a Role with a Trust Policy allowing Account A's principal to assume it
  • Account A's principal calls aws sts assume-role --role-arn arn:aws:iam::ACCOUNT_B:role/MyRole
  • STS returns temporary AccessKeyId, SecretAccessKey, and SessionToken (valid 15 min – 12 hrs)
  • Both the trust policy (who can assume) and the role's permission policy (what they can do) must align
06 How does AWS Secrets Manager differ from SSM Parameter Store?
Secrets Manager: Purpose-built for secrets. Supports automatic rotation (built-in Lambda for RDS, Redshift, DocumentDB), cross-account sharing, replication across regions. Has a cost per secret per month.

SSM Parameter Store: General config store. Standard tier is free; Advanced tier adds larger values and policies. No native rotation — you build it yourself. Better for non-sensitive config values.

Rule of thumb: Secrets Manager for anything that needs rotation (DB passwords, API keys). Parameter Store for environment config, feature flags, and non-sensitive values.
🌐
Networking & VPC 6 questions
07 What is the difference between a Security Group and a NACL?
Security Groups (SGs): Stateful — if you allow inbound, return traffic is automatically allowed. Applied at the ENI/instance level. Allow-only (no explicit deny rules). Evaluate all rules before deciding.

NACLs: Stateless — you must explicitly allow both inbound and return traffic. Applied at the subnet level. Support both allow and deny rules. Rules evaluated in order (lowest number first); first match wins.
  • SGs are your primary tool — NACLs add a coarse subnet-level layer
  • Common pattern: use NACLs to block known bad IPs/ranges at the subnet boundary
08 Explain VPC peering vs Transit Gateway. When do you use each?
VPC Peering: Direct 1:1 connection between two VPCs. No transitive routing — if A peers B and B peers C, A cannot reach C through B. Simple, low latency, low cost for small topologies.

Transit Gateway (TGW): Hub-and-spoke router that connects VPCs and on-prem networks. Supports transitive routing. Scales to thousands of VPCs. Supports inter-region peering between TGWs. Higher cost but far simpler to manage at scale.
  • Use peering: 2–3 VPCs, intra-region, simple topology
  • Use TGW: multiple accounts, multi-region, any-to-any routing needs, on-prem integration
09 How do you connect an on-premises data center to AWS? Compare the options.
Site-to-Site VPN: IPSec tunnel over the public internet. Quick to set up (<1hr). ~1.25 Gbps max. Variable latency. Good for: dev/test, backup path, fast initial connectivity.

AWS Direct Connect: Dedicated private fiber to an AWS edge location. 1–100 Gbps. Consistent, low latency. 4–12 weeks to provision. Good for: production workloads with consistent throughput/latency requirements, large data transfer volume.

Direct Connect + VPN: Use VPN as encrypted overlay on DX for compliance requirements needing encryption in transit on a private circuit.
Interviewers love the hybrid pattern: "We used Direct Connect as the primary path and a Site-to-Site VPN as an encrypted backup failover."
10 An EC2 instance in a private subnet can't reach the internet. Walk me through how you'd debug it.
Systematic layer-by-layer diagnosis:
  • Route table — does the private subnet's route table have a route for 0.0.0.0/0 pointing to a NAT Gateway (not IGW)?
  • NAT Gateway — is it in a public subnet? Does its subnet's route table have 0.0.0.0/0 → IGW?
  • IGW — is an Internet Gateway attached to the VPC?
  • Security Group — does the instance SG allow outbound on the relevant port (80/443)?
  • NACL — does the NACL allow outbound AND inbound return traffic (ephemeral ports 1024–65535)?
  • DNS — is enableDnsSupport and enableDnsHostnames true on the VPC?
Start from the routing layer, not the security layer — most "can't reach internet" issues are a missing NAT route or NAT GW in the wrong subnet.
11 What is a VPC Endpoint and when would you use one?
A VPC Endpoint allows instances in a VPC to communicate with AWS services without traversing the public internet.

Gateway Endpoint: Free. Supports only S3 and DynamoDB. Added to route tables.

Interface Endpoint (PrivateLink): Creates an ENI in your subnet. Supports most AWS services + SaaS. Has hourly + data processing charges.

Use cases: compliance mandates that no data leave the VPC perimeter, reduce NAT Gateway data charges (S3/DynamoDB traffic via Gateway Endpoint is free), access services in other VPCs/accounts privately.
12 What is the difference between an ALB and an NLB?
ALB (Layer 7): HTTP/HTTPS aware. Supports path-based and host-based routing, sticky sessions, WebSocket, HTTP/2, authentication (Cognito/OIDC), and Lambda targets. Best for microservices and web apps.

NLB (Layer 4): TCP/UDP/TLS. Extremely high throughput and ultra-low latency (<100ms). Preserves client source IP. Supports static IPs / Elastic IPs. Best for non-HTTP workloads: databases, MQTT, gaming, gRPC at very high RPS.
If asked about a 502 bad gateway — that's an ALB issue (layer 7). The target returned an invalid HTTP response or didn't respond at all.
⚙️
CI/CD & CodePipeline 5 questions
13 Describe the AWS CI/CD toolchain. What does each service do?
  • CodeCommit — managed Git repository (largely superseded by GitHub/GitLab integrations)
  • CodeBuild — fully managed build service. Runs buildspec.yml. Scales to zero between builds. Outputs artifacts to S3.
  • CodeDeploy — automates deployments to EC2, ECS, Lambda, or on-prem. Supports rolling, blue/green, and canary strategies.
  • CodePipeline — orchestration layer that chains source → build → test → deploy stages. Integrates with third-party tools at each stage.
In practice: most teams use GitHub Actions or GitLab CI for build, then CodeDeploy or CDK Pipelines for the deploy stage into AWS. Know both patterns.
14 What deployment strategies does CodeDeploy support? When do you use each?
Rolling (In-place): Updates instances in batches. Faster and cheaper, but brief mix of old/new versions. Risk: if new version is broken, some requests already hit it.

Blue/Green: Provision a new environment (green), shift traffic, keep blue as rollback target. Zero-downtime, instant rollback by redirecting traffic. Best for production ECS/EKS services.

Canary: Route a small % of traffic (e.g., 10%) to new version first, then either roll forward or rollback based on CloudWatch alarms. Best for Lambda and high-risk changes.

All-at-once: Deploy to all instances simultaneously. Fastest but highest risk. Only for dev/test.
15 How would you design a pipeline that deploys to multiple environments with gates and approvals?
Example architecture: dev → staging → production with gates:

  • Source stage: GitHub trigger on merge to main
  • Build stage: CodeBuild runs tests, produces artifact, pushes image to ECR
  • Deploy to Dev: Automatic. CodeDeploy or ECS rolling update. Runs smoke tests via CodeBuild.
  • Gate: Manual approval action in CodePipeline (SNS notification to Slack). Blocks promotion to staging.
  • Deploy to Staging: Blue/green. CloudWatch alarm monitors error rate for 10 min. If alarm fires → auto-rollback.
  • Gate: Manual approval from release manager.
  • Deploy to Production: Canary 10% for 15 min, then full. Alarm-linked auto-rollback.
The combination of alarm-linked automatic rollback at canary stage + manual approval gates is a senior-level design pattern.
16 How do you store and inject secrets securely into a CodeBuild build?
Never put secrets in environment variables in plain text or in the buildspec committed to source control.

Correct approach:
  • Store in Secrets Manager or SSM Parameter Store SecureString
  • Give the CodeBuild service role permission to read those specific ARNs
  • Reference in buildspec.yml under env.secrets-manager or env.parameter-store — CodeBuild fetches and injects at runtime
  • These values are masked in build logs
env:\n secrets-manager:\n MY_TOKEN: "arn:aws:secretsmanager:..."
17 What is immutable infrastructure and how does AMI baking fit into it?
Immutable infrastructure means servers are never modified after deployment — to update, you replace them with new instances built from a new image.

AMI baking: Use a tool like Packer or EC2 Image Builder to pre-bake your application, runtime, and config into an AMI at build time. At deploy time, just launch the new AMI and terminate old instances (via Auto Scaling or CodeDeploy).

Benefits: deployments are faster (no bootstrap time), all instances are identical, rollback = launch previous AMI version, no config drift.
Contrast with "snowflake servers" — unique, hand-configured instances nobody dares touch. AMI baking eliminates that pattern entirely.
📦
Containers (ECS / EKS) 5 questions
18 ECS vs EKS — how do you decide which to use?
ECS: AWS-native, simpler ops model. Tight integration with ALB, IAM, CloudWatch. No Kubernetes expertise required. Fargate makes it serverless. Best when you want to run containers without managing K8s control plane complexity.

EKS: Managed Kubernetes. Choose when: your team has K8s expertise, you need K8s-native tooling (Helm, Argo CD, KEDA, Istio), multi-cloud portability matters, or you're migrating existing K8s workloads.

Rule: greenfield AWS-only? Start with ECS. Complex orgs with existing K8s investment? EKS.
19 What is Fargate and what problem does it solve?
Fargate is a serverless compute engine for containers — you define CPU/memory, and AWS manages the underlying EC2 hosts entirely. You never patch, scale, or right-size instances.

Tradeoffs:
  • Pro: No cluster management, per-task billing, per-task IAM role, strong workload isolation
  • Con: ~10–30% cost premium vs EC2 launch type at steady load, no GPU support, cold starts for rarely-invoked tasks
Best for: microservices, batch jobs, unpredictable burst traffic, teams without dedicated infrastructure engineers.
20 An ECS task keeps failing and restarting. How do you debug it?
1. Check stopped task reason: aws ecs describe-tasks on a stopped task — stoppedReason and containers[].reason often tell you everything.

2. CloudWatch Logs: Check the container's log stream (awslogs driver). Look for application exceptions, OOM, permission errors.

3. Exit code:
  • exit 1 — application error
  • exit 137 — OOM kill (SIGKILL). Increase memory reservation.
  • exit 143 — SIGTERM received, app didn't handle graceful shutdown fast enough
4. Health check: If ALB health check fails → task gets drained and replaced. Fix the health check endpoint or the app startup sequence.

5. IAM: AccessDeniedException in logs → task role missing permissions.
Exit code 137 (OOM) is extremely common — always look at memory metrics in CloudWatch Container Insights first.
21 How does ECS service auto scaling work?
ECS Service Auto Scaling uses Application Auto Scaling to adjust desired task count based on CloudWatch metrics.

Target Tracking: Maintain a metric at a target value (e.g., CPU 50%). Simplest — automatically adds/removes tasks. Use for most workloads.

Step Scaling: Define thresholds and step adjustments. More control for predictable traffic patterns.

Scheduled Scaling: Pre-scale before known traffic spikes (business hours, marketing events).

Key: if using Fargate, capacity is instant. If using EC2 launch type, you also need Cluster Auto Scaling (EC2 instance scaling) via a Capacity Provider.
22 How do you give an ECS task permissions to access other AWS services?
Via a Task IAM Role — not an instance profile, not hardcoded credentials.

How it works: ECS injects credentials into the task via the Task Metadata Endpoint. The AWS SDKs automatically discover and refresh these credentials. Each task can have its own role with least-privilege permissions.
  • Specify taskRoleArn in the task definition
  • Different from executionRoleArn — the execution role is used by ECS agent to pull images and write to CloudWatch; the task role is used by your application code
🏗️
Terraform & IaC 5 questions
23 What does terraform plan output and how do you read it?
terraform plan shows the diff between current state and desired state. Symbol key:
  • + — resource will be created
  • - — resource will be destroyed
  • ~ — resource will be updated in-place
  • -/+ — resource will be destroyed and recreated (forces replacement)
Always pay attention to forces replacement annotations — these indicate that changing this attribute requires the resource to be deleted and re-created, which can cause downtime or data loss (e.g., changing an RDS storage type, renaming a security group).
Interviewers often show you a plan output and ask "what's the risk here?" — look for -/+ symbols on stateful resources like RDS, ElastiCache, and ASGs.
24 How do you manage Terraform state safely in a team environment?
Remote state: Store terraform.tfstate in S3 (with versioning enabled). Never commit state to Git.

State locking: Use a DynamoDB table as a lock backend — prevents concurrent applies from corrupting state. Configure in the backend "s3" block with dynamodb_table.

Encryption: Enable S3 SSE-KMS on the state bucket. State can contain sensitive resource attributes (passwords, private keys).

Workspaces or separate state files: Use separate state files per environment (dev/staging/prod) either via workspaces or separate S3 prefixes/keys.
25 Someone manually changed an AWS resource outside of Terraform. What happens and how do you handle it?
This is state drift. Terraform's state no longer matches reality.

terraform plan will show the drift as a diff — it'll want to revert the manual change back to what's in code.

Option 1 — Revert: Run terraform apply to bring reality back to the declared state. Correct if the manual change was unauthorized.

Option 2 — Import: If the manual change was intentional, update your Terraform code to match, then run terraform apply (should show no changes).

Option 3 — terraform import: For resources created entirely outside Terraform that you now want to manage, import them into state.

Prevent drift: Use AWS Config + CloudTrail to detect out-of-band changes. terraform plan in CI on a schedule (drift detection).
26 What are Terraform modules and why should you use them?
Modules are reusable, parameterized packages of Terraform resources. Think of them as functions: inputs (variables), logic (resources), outputs.

Benefits:
  • DRY: Define a standard VPC or ECS service pattern once, reuse across dev/staging/prod with different inputs
  • Encapsulation: Hide complexity. Consumers only see the interface (variables/outputs), not internal resource details
  • Consistency: Org-wide modules enforce standards (tagging, naming conventions, security baselines)
  • Versioning: Pin module versions in environments to control rollout of infrastructure changes
27 Terraform vs AWS CDK vs CloudFormation — when do you use each?
CloudFormation: Native AWS, no third-party dependency. JSON/YAML only. Best when you need native AWS service support on day 0 (new services often appear in CFN before Terraform providers). Verbose for complex stacks.

CDK: Write infrastructure in TypeScript/Python/Java. Compiles to CloudFormation. Great for developer-friendly IaC, generating repetitive stacks, and CDK Pipelines (self-mutating CI/CD). AWS-only.

Terraform: Multi-cloud/multi-provider (AWS + Datadog + GitHub + PagerDuty in one plan). HCL is purpose-built for infra declaration. Strong ecosystem of modules. Best choice for organizations using more than just AWS.
📊
Observability & Monitoring 5 questions
28 What are the three pillars of observability and how does AWS address each?
Metrics — quantitative measurements over time. AWS: CloudWatch Metrics (EC2, ECS, Lambda built-in), custom metrics via PutMetricData, Container Insights for K8s/ECS.

Logs — timestamped records of events. AWS: CloudWatch Logs, Log Insights for querying. Centralize all logs from all services into log groups.

Traces — distributed request paths across microservices. AWS: X-Ray for tracing request flow across Lambda, ECS, API Gateway. Identifies latency bottlenecks and error origins.
29 How do you set up alerting for a Lambda function that starts throwing errors?
  • CloudWatch Metric Alarm on the Errors metric for the function. Set threshold, evaluation periods, and datapoints-to-alarm to avoid noise.
  • Alarm triggers SNS topic → subscribers: email, PagerDuty, Slack (via Lambda), or OpsGenie.
  • For more nuanced detection, use CloudWatch Logs Metric Filter — create a custom metric from log pattern matches (e.g., ERROR or Exception) and alarm on that.
  • Monitor Throttles and Duration alongside Errors — throttles spike before errors become visible.
Don't alarm on every single error — use anomaly detection or alarm on error rate (Errors / Invocations) to reduce noise from bursty legitimate errors.
30 Production is slow but error rate is flat. Walk me through how you'd investigate.
Latency increase without errors usually means resource contention or a slow dependency.

1. ALB metrics: Check TargetResponseTime P99 in CloudWatch. Is the latency at the ALB or inside the target?

2. X-Ray traces: Identify which service or downstream call is slow. Subsegment breakdown shows database, S3, external API call durations.

3. RDS/database: Check DatabaseConnections, ReadLatency, WriteLatency, CPUUtilization, FreeableMemory. Use RDS Performance Insights to see query-level breakdown — often a missing index or N+1 query.

4. EC2/ECS CPU/Memory: Is the compute tier saturated? Are tasks being throttled?

5. External dependencies: Third-party API slow? Check your outbound latency metrics or X-Ray external service segments.
31 What is CloudTrail and how is it different from CloudWatch?
CloudTrail: Records API call history — who called what AWS API, when, from where. Answers "who deleted that S3 bucket?" or "who changed this security group?" 90-day event history free; trails store to S3 for longer retention. Security, compliance, and audit tool.

CloudWatch: Operational monitoring — metrics, logs, alarms, dashboards. Answers "is my service healthy right now?" Real-time operational tool.

They're complementary: CloudWatch tells you something is wrong; CloudTrail tells you what changed that caused it.
32 How do you centralize logs from hundreds of EC2 instances?
Install the CloudWatch Unified Agent on each instance (via SSM State Manager for automated fleet-wide deployment). Configure it via amazon-cloudwatch-agent.json to tail log files and stream to CloudWatch Log Groups.

Architecture:
  • Group logs by service/environment: /app/myservice/prod
  • Use Log Insights for ad-hoc querying across all streams
  • Stream to OpenSearch (Elasticsearch) via subscription filter + Kinesis for full-text search at scale
  • Set retention policies on log groups — don't pay to store debug logs forever
High Availability & Cost 4 questions
33 What is the difference between RTO and RPO?
RTO (Recovery Time Objective): How long can the system be down? Maximum acceptable time from failure to full restoration.

RPO (Recovery Point Objective): How much data can you afford to lose? Maximum acceptable age of the backup/snapshot you restore from.

Example: RTO = 1 hour means you must restore service within 1 hour of an outage. RPO = 15 minutes means you can tolerate losing at most 15 minutes of data, so you must back up at least every 15 minutes.

Always connect this to architecture: low RPO → continuous replication (RDS Multi-AZ, cross-region replica). Low RTO → pre-warmed standby (pilot light or warm standby patterns), not cold backups.
34 Design a multi-region active-active architecture for a web application.
Key components:
  • Route 53 with latency-based or geolocation routing → directs users to nearest region. Health checks + failover routing for automatic region-level failover.
  • Compute: ECS/EKS running in both regions behind ALBs. Deploy via pipeline to both simultaneously.
  • Database: Aurora Global Database — primary region handles writes, secondary has <1s replication lag, can be promoted in <1 min if primary fails.
  • Session/Cache: ElastiCache Global Datastore (Redis) for cross-region cache replication.
  • Static assets: S3 with Cross-Region Replication + CloudFront with an origin group (primary + failover origin).
  • Conflict resolution: If truly active-active writes, use a conflict-free data model or route write traffic to a single "home region" per user (active-active reads, active-passive writes).
35 How do you reduce AWS costs without sacrificing reliability?
Right-sizing: Use Compute Optimizer recommendations to downsize over-provisioned EC2/RDS. Start with instances using <20% CPU consistently.

Savings Plans / Reserved Instances: Commit to 1–3 years for stable baseline workloads. Compute Savings Plans are most flexible (cover EC2, Fargate, Lambda).

Spot Instances: For interruptible workloads (batch, CI/CD builders, dev environments). Up to 90% savings.

S3 Intelligent-Tiering: Automatically moves objects between access tiers. Eliminates manual lifecycle policy tuning.

CloudFront + S3 endpoint: Reduce data transfer costs — CDN eliminates repeated S3/EC2 outbound charges.

Terminate idle resources: Use AWS Trusted Advisor and Cost Explorer to find unattached EBS volumes, idle ELBs, unused Elastic IPs.
36 Tell me about a time you dealt with a production outage or incident in AWS. How did you handle it?
This is your war story question. Use this 5-part framework:

1. Setup (30s): What system, what was the business impact? "Our payment processing pipeline on ECS was dropping 15% of transactions."

2. Discovery (45s): How did you find it? What was your first signal? "CloudWatch alarm on ALB 5xx rate. X-Ray showed database calls timing out."

3. Diagnosis (60s): Root cause — what was it actually? "RDS connection pool exhausted — a new deploy had increased connection count per task but we hadn't scaled the RDS instance class."

4. Remediation (30s): What did you do to fix it? "Rolled back the task definition to the prior revision, then scaled RDS to the next instance class in maintenance window."

5. Prevention (30s): What did you change so it doesn't happen again? "Added RDS DatabaseConnections to our pre-deploy checklist and a CloudWatch alarm at 80% connection limit."
Prepare 1–2 real stories from Carvana, PetSmart, or ADT before the interview. The details and specificity are what make this answer land.