CI/CD Study Guide

🔄

Pipeline Fundamentals 5 questions

01 What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

›

Continuous Integration (CI) is the practice of merging code changes frequently and automatically building and testing every commit. The goal is to catch integration issues early.

Continuous Delivery (CD) extends CI by ensuring the codebase is always in a deployable state. Every passing build is a release candidate — deployment to production is a manual, one-click decision.

Continuous Deployment removes the manual gate — every passing build that passes all automated checks deploys automatically to production with no human approval.

CI answers: "Does this code work in isolation and with the rest of the codebase?"
Continuous Delivery answers: "Is this code safe to ship whenever we choose?"
Continuous Deployment answers: "Is this code deployed automatically every time it passes?"
Most mature teams use Continuous Delivery — not Continuous Deployment — because production still requires business approval gates

Interviewers often test whether you conflate Delivery and Deployment. Be precise: Delivery = always deployable, manual release. Deployment = fully automated release.

02 What are the core stages of a CI/CD pipeline?

›

A well-structured pipeline typically flows through these stages:

Source — trigger on commit/PR to a branch
Build — compile code, build Docker image, resolve dependencies
Test — unit tests, integration tests, linting, static analysis
Security scan — SAST (code), SCA (dependencies), image scanning
Artifact publishing — push image to registry, store build artifacts
Deploy to staging — apply to a non-production environment
Acceptance/smoke tests — verify deployment is healthy
Deploy to production — with approval gate (Continuous Delivery) or automatically (Continuous Deployment)
Post-deploy verification — smoke tests, synthetic monitoring

Mention that you fail fast — tests that catch the most common issues should run first to avoid wasting time on later stages.

03 What are DORA metrics and why do they matter?

›

DORA (DevOps Research and Assessment) metrics are the four key indicators that measure software delivery performance:

Deployment Frequency — how often you deploy to production. Elite: multiple times/day. Low: less than once/month.
Lead Time for Changes — time from commit to production. Elite: <1 hour. Low: 1–6 months.
Change Failure Rate — % of deployments that cause a production incident. Elite: 0–15%. Low: 46–60%.
Mean Time to Restore (MTTR) — how long to recover from a production failure. Elite: <1 hour. Low: 1 week–1 month.

They matter because they're validated by research as predictors of organizational performance — teams with elite metrics also have better business outcomes.

Know which DORA tier your current/past team falls into. "We deployed weekly and our MTTR was under 2 hours" is far more credible than "we had a good pipeline."

04 What is pipeline drift and how do you prevent it?

›

Pipeline drift occurs when environments or pipeline configurations diverge over time — production no longer matches staging, or your pipeline definition no longer matches reality.

Common causes:

Manual changes made directly to environments without updating IaC
Different pipeline configs per branch that diverge independently
Pinned dependency versions that differ between stages
Environment-specific scripts that aren't version-controlled

Prevention:

Store all pipeline config as code (Jenkinsfile, .github/workflows, .gitlab-ci.yml) in the same repo as the application
Use identical container images across all stages — build once, promote the same artifact
Enforce IaC (Terraform/Ansible) for all environment configuration — no manual console changes
Use drift detection tools like terraform plan in CI to catch infrastructure drift

05 A pipeline that used to take 8 minutes now takes 45. Walk me through how you'd diagnose and fix it.

›

Step 1 — Identify the bottleneck: Check pipeline stage timing history. Most CI systems show per-stage duration. Find which stage grew from its baseline.

Step 2 — Common culprits by stage:

Build stage slow — cache miss (Docker layer cache invalidated, npm/pip cache missing). Fix: restore dependency cache before build, use multi-stage Docker builds, cache layers properly.
Test stage slow — test count grew, flaky tests retrying, no parallelism. Fix: split tests across parallel runners, add --shard flags, identify and quarantine flaky tests.
Network slow — pulling large base images repeatedly, downloading packages from internet. Fix: mirror images to internal registry, cache package downloads.
Security scan slow — scanning a growing image with no cache. Fix: scan base image separately and cache results, only scan new layers.

Step 3 — Quick wins: Parallelize independent stages. Move slow optional checks (full integration tests) to a nightly pipeline rather than every commit.

Frame your answer around "instrument first, then optimize" — you don't fix what you haven't measured. Mentioning that you looked at historical timing data shows operational maturity.

⚙️

GitHub Actions 6 questions

06 What are the core components of a GitHub Actions workflow?

›

Workflow — a YAML file in .github/workflows/. Defines the full automation process.
Trigger (on:) — what starts the workflow: push, pull_request, schedule, workflow_dispatch, workflow_call
Job — a group of steps that run on the same runner. Jobs run in parallel by default.
Step — a single command or action within a job. Steps run sequentially.
Action — a reusable unit of code (from GitHub Marketplace or your own repo) called with uses:
Runner — the machine that executes jobs. GitHub-hosted (ubuntu-latest) or self-hosted.
Environment — a deployment target with protection rules and secrets.

on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm test

07 How do you share data between jobs in GitHub Actions?

›

Jobs run on separate runners so they don't share filesystem state. Three mechanisms:

Artifacts — actions/upload-artifact / actions/download-artifact. Use for files (build outputs, test reports). Stored in GitHub, max 90 days.
Job outputs — small string values set via $GITHUB_OUTPUT and referenced with needs.job-name.outputs.key. Use for version numbers, flags, short strings.
Cache — actions/cache for dependency caching (node_modules, pip packages). Keyed by hash of lockfile.

# Set output
- run: echo "version=1.2.3" >> $GITHUB_OUTPUT
  id: version
# Read in downstream job
- run: echo ${{ needs.build.outputs.version }}

Artifacts are for files; outputs are for strings. Getting this distinction right shows you've actually built multi-job workflows.

08 What is the difference between GitHub-hosted and self-hosted runners? When do you use each?

›

GitHub-hosted runners are ephemeral VMs managed by GitHub. Fresh environment every run. Use for: standard builds, open source projects, anything that doesn't need VPC access.

Pros: zero maintenance, always up to date, free tier included
Cons: no access to internal network, slower for large builds (cache cold), can't install persistent tooling

Self-hosted runners are machines you own and register with GitHub. Use for:

Builds that need access to private infrastructure (deploy to internal k8s cluster)
Compliance requirements (code never leaves your network)
Large compute needs (GPU, high-memory) where GitHub-hosted is too expensive
Faster builds with warm caches and pre-installed dependencies

Self-hosted runners on public repos are a security risk — anyone can fork and run malicious code. Never use self-hosted runners on public repos without strict controls.

09 How do you implement a matrix build in GitHub Actions?

›

A matrix strategy runs the same job multiple times with different variable values — useful for testing against multiple OS versions, Node versions, or environment combinations.

jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest]
        node: [18, 20, 22]
      fail-fast: false  # don't cancel other matrix jobs on failure
    steps:
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm test

fail-fast: false — run all combinations even if one fails, so you get a full picture
include: / exclude: — add specific combinations or remove invalid ones
Matrix generates N jobs running in parallel — 2 OS × 3 Node versions = 6 parallel jobs

10 What are reusable workflows and composite actions? When do you use each?

›

Both allow reuse, but at different levels:

Composite Actions bundle multiple steps into a single action. They run in the calling job's runner. Use for: grouping steps you repeat within workflows (e.g., "setup and authenticate").

# .github/actions/setup-aws/action.yml
runs:
  using: composite
  steps:
    - uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ inputs.role-arn }}

Reusable Workflows define an entire job (or jobs) that can be called by other workflows. They run on their own runner. Use for: full deployment pipelines you want to centralize across repos.

# called from another workflow:
jobs:
  deploy:
    uses: org/shared-workflows/.github/workflows/deploy.yml@main
    with:
      environment: production

Composite = reusable steps within a job
Reusable workflow = reusable entire job, can have its own runner and environment

Platform teams should publish reusable workflows in a central repo so all product teams consume the same tested deployment logic — reduces fragmentation.

11 How do you securely authenticate GitHub Actions to AWS without storing long-term credentials?

›

Use OIDC (OpenID Connect) — GitHub acts as an identity provider and issues short-lived tokens that AWS trusts.

Setup:

In AWS: create an IAM OIDC identity provider for token.actions.githubusercontent.com
Create an IAM Role with a trust policy that allows the GitHub OIDC provider to assume it, scoped to your specific repo/branch
In the workflow: use aws-actions/configure-aws-credentials with role-to-assume

permissions:
  id-token: write  # required for OIDC
  contents: read
steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789:role/github-deploy
      aws-region: us-east-1

No access keys stored in GitHub Secrets — credentials are ephemeral (15 min)
Trust policy can restrict to specific repo, branch, or environment for least privilege

If you're still using long-term access keys in GitHub Secrets for AWS auth, that's an immediate red flag in a security-conscious interview. OIDC is the right answer.

🏗️

Jenkins 5 questions

12 What is a Jenkinsfile and what are the two pipeline syntax options?

›

A Jenkinsfile is a text file committed to your repo that defines the Jenkins pipeline as code. It enables version-controlled, reviewable pipeline definitions.

Declarative Pipeline — structured, opinionated syntax with a fixed schema. Easier to read and write, has built-in validation:

pipeline {
  agent any
  stages {
    stage('Build') { steps { sh 'npm run build' } }
    stage('Test')  { steps { sh 'npm test' } }
  }
}

Scripted Pipeline — full Groovy code, maximum flexibility, no structural constraints:

node {
  stage('Build') { sh 'npm run build' }
  stage('Test')  { sh 'npm test' }
}

Prefer Declarative unless you need logic that Declarative can't express — it's more maintainable
Declarative supports script {} blocks for when you need Groovy logic inside it

13 How do you implement parallel stages in Jenkins?

›

In Declarative, use the parallel directive inside a stage:

stage('Test') {
  parallel {
    stage('Unit Tests') {
      steps { sh 'npm run test:unit' }
    }
    stage('Integration Tests') {
      steps { sh 'npm run test:integration' }
    }
    stage('Lint') {
      steps { sh 'npm run lint' }
    }
  }
}

Key considerations:

Each parallel branch needs its own agent/executor slot — ensure enough executors are available
failFast: true inside parallel {} cancels remaining branches if one fails
Use parallel for independent stages only — test suites, code quality checks, build targets
In Scripted, use the parallel() step with a map of closures

14 What are Jenkins Shared Libraries and why do you use them?

›

Shared Libraries are Groovy code stored in a separate repo and loaded into Jenkins pipelines to share common logic across multiple Jenkinsfiles.

Structure:

vars/ — global pipeline steps callable as myStep() in any pipeline
src/ — Groovy classes for more complex logic
resources/ — static files accessible via libraryResource

Use cases:

Standard deploy function used by 20 different service pipelines
Common notification logic (Slack alerts on failure)
Security scanning steps that must run in every pipeline

// Jenkinsfile
@Library('my-shared-lib') _
pipeline {
  stages {
    stage('Deploy') { steps { standardDeploy(env: 'staging') } }
  }
}

Shared libraries are Jenkins' equivalent of reusable workflows in GitHub Actions. Mentioning them shows you've worked in multi-team Jenkins environments.

15 How do you handle credentials securely in Jenkins?

›

Jenkins Credentials Store — store secrets centrally in Jenkins (Manage Jenkins → Credentials). Never hardcode credentials in Jenkinsfiles.
Use the credentials() binding in pipelines: environment { AWS_CREDS = credentials('aws-prod-creds') }
Credential types: username/password, secret text, SSH key, secret file, certificate
Scope: Global (all pipelines) vs System (only Jenkins internals) vs per-folder — use folder-scoped credentials to limit blast radius
Credentials are masked in console output automatically when bound via credentials()
For cloud auth, prefer role-based auth (IAM roles on EC2 agents) over stored credentials — same OIDC principle as GitHub Actions

If asked about secrets rotation, mention integrating Jenkins with Vault or AWS Secrets Manager via plugins — Jenkins' built-in credential store doesn't auto-rotate.

16 How do you configure Jenkins for high availability and scale?

›

Jenkins HA is complex because the controller is stateful. Approaches:

Scale build capacity (agents):

Use the Kubernetes plugin or EC2 plugin to spin up ephemeral agents dynamically — agents provision on demand and terminate after the job
Ephemeral agents eliminate "dirty" build environments and scale to zero cost when idle

Controller HA:

Jenkins HA Plugin (CloudBees) — active/standby with shared storage
Or run Jenkins on Kubernetes with persistent volume for JENKINS_HOME — on pod restart it recovers from disk
Regular backups of JENKINS_HOME (jobs, credentials, plugins) using the Backup plugin or snapshotting the PVC

Configuration as Code: Use the JCasC (Jenkins Configuration as Code) plugin so controller config is reproducible from a YAML file — makes disaster recovery from scratch viable.

🚀

Deployment Strategies 5 questions

17 What is the difference between blue/green, canary, and rolling deployments?

›

Blue/Green: Two identical environments. Blue = live. Green = new version. Switch traffic 100% from blue to green via load balancer or DNS. Rollback = switch back to blue.

Pros: instant rollback, zero downtime, full test of new environment before cutover
Cons: doubles infrastructure cost, database migrations must be backward-compatible

Canary: New version deployed to a small % of infrastructure (1–10%). Monitor for errors. Gradually increase traffic to canary until 100%.

Pros: real user validation before full rollout, limits blast radius
Cons: longer rollout time, requires metrics-based promotion logic, complex routing

Rolling: Replace instances one at a time (or in batches). Old and new versions run simultaneously during rollout.

Pros: no extra infrastructure cost, gradual
Cons: rollback requires another rolling update, old/new API versions must be compatible during transition

In Kubernetes: Rolling is the default. Blue/Green is done with two Deployments + Service swap. Canary uses weighted routing with a service mesh (Istio) or Argo Rollouts.

18 What is a feature flag and how does it decouple deployment from release?

›

A feature flag (feature toggle) is a conditional in code that enables or disables functionality at runtime without deploying new code.

Deployment vs Release:

Deployment = getting code onto servers
Release = making a feature visible to users

With feature flags you can deploy code daily (even incomplete features) while releasing only when business-ready. This is trunk-based development's superpower.

Use cases:

A/B testing — show feature to 50% of users, measure conversion
Gradual rollout — enable for 1% → 10% → 100% of users
Kill switch — instantly disable a broken feature without rolling back a deploy
Beta programs — enable for internal users or specific accounts first

Tools: LaunchDarkly, AWS AppConfig, Unleash, Flagsmith.

The key insight: feature flags make deployment boring and release strategic. Teams that deploy without flags conflate two separate concerns, which slows delivery.

19 How do you safely handle database migrations in a CI/CD pipeline?

›

Database migrations are the hardest part of zero-downtime deployments because the DB is shared between old and new code during rollout.

The expand-contract (parallel change) pattern:

Expand: Add new column/table while keeping old schema. Both old and new code work. Deploy this first.
Migrate: Backfill data into new structure. New code reads new column, writes to both.
Contract: Remove old column/table in a future deploy once all instances use new code.

Tools: Flyway, Liquibase for versioned migration scripts. Run migrations as a pipeline stage before deploying app code.

Never:

Drop or rename columns in the same deploy that changes the code referencing them
Run migrations during the deployment that applies the app code — run them as a separate prior step

This pattern shows you understand the hardest real-world constraint in zero-downtime deployments. Most candidates skip it.

20 How do you implement an automated rollback strategy in your pipeline?

›

Triggering rollback: Run post-deploy smoke tests and health checks. On failure, trigger rollback automatically. Common signals: HTTP error rate spike, latency increase, failed health check endpoint.

Rollback mechanisms by platform:

Kubernetes: kubectl rollout undo deployment/app — reverts to previous ReplicaSet. Previous image is cached.
ECS: Update service to previous task definition revision
Blue/Green: Switch load balancer target group back to blue
Argo Rollouts: kubectl argo rollouts abort — auto-rolls back canary

What to preserve:

Rollback the application code, never the database (schema changes must be backward-compatible)
Tag every production image with a git SHA — makes rollback to any previous version deterministic
Store deploy history so you know exactly what image is running in each environment

21 How do you design a pipeline that promotes the same artifact through dev → staging → production?

›

The principle is build once, promote many times — never rebuild the artifact for each environment. What changes is the configuration, not the code.

Implementation:

Build Docker image tagged with git SHA: app:a1b2c3d
Push to container registry (ECR, GHCR, Artifactory)
Dev pipeline deploys app:a1b2c3d to dev namespace with dev config (environment variables, secrets, replicas)
After dev tests pass, promotion job deploys same app:a1b2c3d to staging with staging config
Manual approval gate (or automated based on metrics) promotes same image to production

Configuration separation:

Environment-specific values in Kubernetes ConfigMaps/Secrets or Helm values files per environment
Twelve-factor app: all config from environment variables, never baked into the image
The image is immutable — the same bits hit production that passed all tests

Rebuilding for each environment is a common antipattern — "works in staging, broken in prod" is often caused by a build that produced a different artifact. Promote; don't rebuild.

🔐

Secrets & Security in CI/CD 4 questions

22 How do you prevent secret leakage in CI/CD pipelines?

›

Secrets leak in three ways: accidentally committed to the repo, printed to build logs, or exposed through pipeline artifacts.

Pre-commit prevention:

git-secrets or detect-secrets — pre-commit hooks that block commits containing credential patterns
GitHub secret scanning — automatically detects common credential patterns in pushes and alerts/blocks
.gitignore for .env, key files, credentials files — but hooks are more reliable

In-pipeline protection:

Never echo secret values to stdout — they'll appear in logs
Use the platform's native secret injection (GitHub Secrets, Jenkins Credentials) — they mask values in logs automatically
Avoid writing secrets to disk; if unavoidable, clean up in a finally / post-pipeline step

Runtime secrets:

Pull secrets from Vault/AWS Secrets Manager at runtime rather than injecting at build time — shorter exposure window

23 What is SAST vs DAST and how do you integrate them into a pipeline?

›

SAST (Static Application Security Testing) — analyzes source code without running it. Catches: SQL injection patterns, hardcoded secrets, insecure functions, dependency vulnerabilities.

Tools: Semgrep, Snyk Code, SonarQube, Bandit (Python), ESLint security plugins
Where in pipeline: early — run on every PR alongside unit tests. Fast feedback loop.
Also run SCA (Software Composition Analysis) — Snyk, Dependabot, OWASP Dependency-Check for known CVEs in dependencies

DAST (Dynamic Application Security Testing) — tests a running application by sending malicious inputs. Catches: runtime vulnerabilities, auth issues, injection flaws that only manifest at runtime.

Tools: OWASP ZAP, Burp Suite Enterprise, Nuclei
Where in pipeline: staging environment after deployment. Too slow for every PR.

Shift left: run SAST on every commit, DAST on staging deploys. Don't gate prod on DAST alone — it has false positives that would block pipelines.

24 How do you sign and verify container images in a CI/CD pipeline?

›

Image signing provides a cryptographic guarantee that an image came from your trusted pipeline and hasn't been tampered with.

Tools:

Cosign (Sigstore) — the modern standard. Keyless signing using OIDC identity (no key management).
Notation (CNCF) — enterprise signing with X.509 certificates, supported by AWS ECR

Pipeline flow with Cosign:

Build and push image to registry: docker push registry/app:sha123
Sign in CI: cosign sign --key cosign.key registry/app:sha123 (or keyless via OIDC)
Signature stored alongside image in registry

Enforce at deploy time:

Kubernetes admission controller (Kyverno or OPA/Gatekeeper) rejects unsigned images
Policy: only images signed by the CI pipeline's OIDC identity are admitted to the cluster

Keyless signing with Cosign + OIDC is the modern pattern — no key management, signing identity is tied to the CI job. Mention Kyverno policy enforcement to show you close the loop.

25 What is supply chain security and what is SLSA?

›

Software Supply Chain Security addresses attacks that target the build and distribution process — not the running application itself. The SolarWinds and XZ Utils attacks are canonical examples: the pipeline or dependency was compromised, not the production system.

SLSA (Supply-chain Levels for Software Artifacts) is a security framework (from Google/CNCF) with four levels:

SLSA 1 — Build process is scripted and produces provenance (a signed record of what produced the artifact)
SLSA 2 — Version-controlled build process, hosted build service generates provenance
SLSA 3 — Hardened build platform; no unreviewed code can influence the build
SLSA 4 — Two-person review for all changes; hermetic, reproducible builds

Practical steps for SLSA 1–2:

Generate SBOM (Software Bill of Materials) — syft, trivy sbom
Generate and sign provenance with slsa-github-generator in GitHub Actions
Pin all action versions to SHA (not tag) in GitHub Actions

🧪

Testing in Pipelines 5 questions

26 What is the testing pyramid and how does it inform CI/CD pipeline design?

›

The testing pyramid (Mike Cohn) describes the ideal distribution of test types:

Unit tests (base — many): Test individual functions in isolation. Fast (milliseconds), no external dependencies, cheap to maintain. Run on every commit.

Integration tests (middle — some): Test how components interact (service + DB, API client + server). Slower (seconds), need running dependencies. Run on every PR.

E2E / UI tests (top — few): Test full user flows through a real browser. Slow (minutes), brittle, expensive. Run on main branch or pre-release.

Pipeline implications:

Put fast tests first — fail early before spending time on slow stages
Parallelise integration tests to avoid blocking the pipeline for minutes
Don't block every commit on E2E — it destroys developer velocity
A heavy E2E suite with no unit tests (the "ice cream cone" antipattern) makes CI slow and unreliable

27 What are flaky tests and how do you handle them in CI?

›

Flaky tests produce inconsistent results — passing and failing on the same code without any changes. They erode trust in the pipeline ("oh it's probably just a flaky test") and mask real failures.

Root causes:

Race conditions / timing issues (hardcoded sleeps instead of proper waits)
Shared mutable state between tests
External service dependency (network calls, real databases)
Order-dependent tests

Handling strategy:

Quarantine — move flaky tests to a separate suite that runs but doesn't block the pipeline. Track and fix them on a schedule.
Retry on failure — some CI systems support retry: 2. Acceptable for known-flaky tests but masks the root cause.
Track flakiness — GitHub Actions and JUnit reporters can identify historically flaky tests. Fix the ones that fail most.

Never auto-retry flaky tests and call it fixed. Quarantine + fix is the professional answer — it shows you care about test suite integrity.

28 How do you implement smoke tests and health checks post-deployment?

›

Health check endpoint — every service should expose /health or /ready:

/health (liveness) — is the process alive? Return 200 if yes.
/ready (readiness) — can the service handle traffic? Check DB connection, cache, dependencies.

Smoke tests — a small set of critical path tests run against the deployed environment:

Hit the health endpoint and assert 200
Test the most critical user journeys (login, key API endpoints)
Should run in <2 minutes — fast validation, not full regression

Pipeline integration:

- name: Wait for deployment
  run: kubectl rollout status deployment/app --timeout=5m
- name: Smoke test
  run: |
    curl -f https://staging.app.com/health
    curl -f https://staging.app.com/api/v1/ping

If smoke tests fail, trigger automatic rollback before the issue reaches more traffic.

29 How do you speed up a slow test suite in CI?

›

Parallelize:

Split test suite across multiple runners using sharding: jest --shard=1/4, pytest-xdist, RSpec parallel
In GitHub Actions, use a matrix strategy to run N shards simultaneously

Cache aggressively:

Cache node_modules, .venv, Maven ~/.m2 — keyed on lockfile hash
Cache Docker build layers properly (order Dockerfile instructions from least to most frequently changing)

Run less:

Affected-only testing — only run tests for changed modules (Nx, Turborepo, Bazel)
Move slow integration/E2E tests off the PR pipeline to a scheduled nightly run

Fix the tests:

Profile test runtime — usually a small % of tests take the majority of time
Replace slow real-database integration tests with fast in-memory or Docker-based equivalents

30 How do you test infrastructure code (Terraform, Ansible) in a CI pipeline?

›

Infrastructure code testing is often overlooked. A layered approach:

Static analysis (fast, every PR):

terraform validate — syntax and config validity
terraform fmt --check — formatting
tflint — provider-specific linting rules
tfsec / Checkov — security policy checks (no public S3 buckets, encrypted EBS)
ansible-lint for Ansible playbooks

Plan review (on PR):

terraform plan against a test account and post the plan diff as a PR comment (Atlantis, Terraform Cloud)
Require human approval of destructive changes (resource deletion)

Integration testing (on merge to main):

Terratest (Go) — provision real infrastructure in a test account, assert it works, destroy it
Kitchen-Terraform — similar for Terraform modules
Use isolated AWS accounts per test run, clean up with defer

🛠️

Troubleshooting & Optimization 5 questions

31 How do you debug a pipeline that works locally but fails in CI?

›

This is one of the most common CI problems. Systematic approach:

Environment differences to check first:

Dependency versions — local has cached/different version than CI. Fix: commit lockfiles (package-lock.json, Pipfile.lock) and use npm ci not npm install
Environment variables — something set in local shell profile that CI doesn't have
File permissions — scripts not executable in CI (fix: git update-index --chmod=+x script.sh)
OS differences — macOS vs Linux (line endings, filesystem case-sensitivity)
Network access — CI can't reach internal service that local can

Debugging techniques:

Add env and pwd debug steps to print the CI environment
GitHub Actions: use tmate action to SSH into a running runner
Run the exact CI Docker image locally: docker run --rm -it ubuntu:latest bash
Use act (GitHub) to run workflows locally

32 How do you cache dependencies effectively in CI/CD?

›

Cache key strategy: Key on the lockfile hash. Cache is invalidated automatically when dependencies change.

# GitHub Actions
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('package-lock.json') }}
    restore-keys: npm-

What to cache by ecosystem:

Node.js: ~/.npm or node_modules (prefer npm cache dir)
Python: ~/.cache/pip or the virtualenv directory
Go: ~/go/pkg/mod
Maven: ~/.m2/repository
Docker: layer cache via --cache-from or BuildKit registry cache

Docker layer caching:

# Copy dependency files first, install, THEN copy source code
COPY package-lock.json package.json ./
RUN npm ci
COPY . .       # source changes don't invalidate the npm install layer

Order Dockerfile instructions from least to most frequently changing — dependency installs before source code copies. This single optimization often cuts build time by 60-80%.

33 How do you handle multi-service (microservices) CI/CD pipelines?

›

Microservices pipelines have unique challenges: independent deployability vs. cross-service integration, shared libraries, and testing interactions.

Repo structure:

Polyrepo — each service has its own repo and pipeline. Simple isolation, but cross-service changes require coordinating multiple PRs.
Monorepo — all services in one repo. Tools like Nx, Turborepo, Bazel detect which services changed and only run their pipelines.

Versioning and contracts:

Each service publishes a versioned artifact (Docker image tagged with git SHA)
Use contract testing (Pact) to verify consumer/provider API contracts without deploying all services
Maintain backward-compatible APIs — new consumers shouldn't break old producers

Deployment coordination:

Deploy services independently — that's the whole point of microservices
Use a service mesh or API gateway to handle version routing during transitions
Maintain a deployment manifest (which version of each service is in each env) — GitOps manages this naturally

34 How would you design a CI/CD system from scratch for a team moving from manual deployments?

›

This is a system design question. Show a phased, pragmatic approach:

Phase 1 — Automate the build (Week 1–2):

Source control everything if not already (Git, branch protection on main)
Set up CI to run tests on every PR — even just "it compiles" is a win
Build and push a Docker image on merge to main

Phase 2 — Automate deployment (Week 3–4):

One-click deploy to staging from the CI/CD tool
Manual approval gate to production
Document the deploy process as pipeline-as-code (Jenkinsfile/.github/workflows)

Phase 3 — Add safety nets (Month 2):

Post-deploy smoke tests with automatic rollback
Secrets management via Vault or cloud-native secrets
Security scanning integrated into PR pipeline

Phase 4 — Optimize (Month 3+):

Measure DORA metrics — establish baselines
Improve caching, parallelize tests, reduce pipeline time
Move toward Continuous Deployment for lower-risk services

Frame your answer as iterative — "automate the most painful thing first" wins over "build the perfect system." Teams that try to do everything at once usually ship nothing.

35 Tell me about a production incident caused by a CI/CD failure and how you handled it.

›

This is your war story question. Use the 5-part framework:

1. Setup (30s): What system, what was the business impact? "Our deployment pipeline auto-deployed a breaking API change to production because the integration tests didn't cover that contract."

2. Discovery (45s): How did you find it? "Error rate alert fired 3 minutes after deploy. On-call checked the deploy timeline — last deploy was the culprit."

3. Diagnosis (60s): Root cause. "A schema change in our user service removed a field that two downstream services still read. The pipeline only tested the service in isolation, not the contract."

4. Remediation (30s): What fixed it? "Rolled back via the pipeline's rollback job. Production restored in 8 minutes."

5. Prevention (30s): What changed? "Added contract tests (Pact) between the three services. The broken contract would now fail the PR pipeline before merge."

Prepare 1–2 real stories before the interview. Specifics (service names, error rates, recovery time) make this answer land. Vague stories don't impress.