// CI/CD study guide

CI/CD Study
Guide

35 QUESTIONS 7 DOMAINS DEVOPS ENGINEER LEVEL
MASTERED
0 / 35
FILTER:
EASY
MEDIUM
HARD
๐Ÿ”„
Pipeline Fundamentals 5 questions
01 What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?
โ€บ
Continuous Integration (CI) is the practice of merging code changes frequently and automatically building and testing every commit. The goal is to catch integration issues early.

Continuous Delivery (CD) extends CI by ensuring the codebase is always in a deployable state. Every passing build is a release candidate โ€” deployment to production is a manual, one-click decision.

Continuous Deployment removes the manual gate โ€” every passing build that passes all automated checks deploys automatically to production with no human approval.

  • CI answers: "Does this code work in isolation and with the rest of the codebase?"
  • Continuous Delivery answers: "Is this code safe to ship whenever we choose?"
  • Continuous Deployment answers: "Is this code deployed automatically every time it passes?"
  • Most mature teams use Continuous Delivery โ€” not Continuous Deployment โ€” because production still requires business approval gates
Interviewers often test whether you conflate Delivery and Deployment. Be precise: Delivery = always deployable, manual release. Deployment = fully automated release.
02 What are the core stages of a CI/CD pipeline?
โ€บ
A well-structured pipeline typically flows through these stages:
  • Source โ€” trigger on commit/PR to a branch
  • Build โ€” compile code, build Docker image, resolve dependencies
  • Test โ€” unit tests, integration tests, linting, static analysis
  • Security scan โ€” SAST (code), SCA (dependencies), image scanning
  • Artifact publishing โ€” push image to registry, store build artifacts
  • Deploy to staging โ€” apply to a non-production environment
  • Acceptance/smoke tests โ€” verify deployment is healthy
  • Deploy to production โ€” with approval gate (Continuous Delivery) or automatically (Continuous Deployment)
  • Post-deploy verification โ€” smoke tests, synthetic monitoring
Mention that you fail fast โ€” tests that catch the most common issues should run first to avoid wasting time on later stages.
03 What are DORA metrics and why do they matter?
โ€บ
DORA (DevOps Research and Assessment) metrics are the four key indicators that measure software delivery performance:
  • Deployment Frequency โ€” how often you deploy to production. Elite: multiple times/day. Low: less than once/month.
  • Lead Time for Changes โ€” time from commit to production. Elite: <1 hour. Low: 1โ€“6 months.
  • Change Failure Rate โ€” % of deployments that cause a production incident. Elite: 0โ€“15%. Low: 46โ€“60%.
  • Mean Time to Restore (MTTR) โ€” how long to recover from a production failure. Elite: <1 hour. Low: 1 weekโ€“1 month.
They matter because they're validated by research as predictors of organizational performance โ€” teams with elite metrics also have better business outcomes.
Know which DORA tier your current/past team falls into. "We deployed weekly and our MTTR was under 2 hours" is far more credible than "we had a good pipeline."
04 What is pipeline drift and how do you prevent it?
โ€บ
Pipeline drift occurs when environments or pipeline configurations diverge over time โ€” production no longer matches staging, or your pipeline definition no longer matches reality.

Common causes:
  • Manual changes made directly to environments without updating IaC
  • Different pipeline configs per branch that diverge independently
  • Pinned dependency versions that differ between stages
  • Environment-specific scripts that aren't version-controlled
Prevention:
  • Store all pipeline config as code (Jenkinsfile, .github/workflows, .gitlab-ci.yml) in the same repo as the application
  • Use identical container images across all stages โ€” build once, promote the same artifact
  • Enforce IaC (Terraform/Ansible) for all environment configuration โ€” no manual console changes
  • Use drift detection tools like terraform plan in CI to catch infrastructure drift
05 A pipeline that used to take 8 minutes now takes 45. Walk me through how you'd diagnose and fix it.
โ€บ
Step 1 โ€” Identify the bottleneck: Check pipeline stage timing history. Most CI systems show per-stage duration. Find which stage grew from its baseline.

Step 2 โ€” Common culprits by stage:
  • Build stage slow โ€” cache miss (Docker layer cache invalidated, npm/pip cache missing). Fix: restore dependency cache before build, use multi-stage Docker builds, cache layers properly.
  • Test stage slow โ€” test count grew, flaky tests retrying, no parallelism. Fix: split tests across parallel runners, add --shard flags, identify and quarantine flaky tests.
  • Network slow โ€” pulling large base images repeatedly, downloading packages from internet. Fix: mirror images to internal registry, cache package downloads.
  • Security scan slow โ€” scanning a growing image with no cache. Fix: scan base image separately and cache results, only scan new layers.
Step 3 โ€” Quick wins: Parallelize independent stages. Move slow optional checks (full integration tests) to a nightly pipeline rather than every commit.
Frame your answer around "instrument first, then optimize" โ€” you don't fix what you haven't measured. Mentioning that you looked at historical timing data shows operational maturity.
โš™๏ธ
GitHub Actions 6 questions
06 What are the core components of a GitHub Actions workflow?
โ€บ
  • Workflow โ€” a YAML file in .github/workflows/. Defines the full automation process.
  • Trigger (on:) โ€” what starts the workflow: push, pull_request, schedule, workflow_dispatch, workflow_call
  • Job โ€” a group of steps that run on the same runner. Jobs run in parallel by default.
  • Step โ€” a single command or action within a job. Steps run sequentially.
  • Action โ€” a reusable unit of code (from GitHub Marketplace or your own repo) called with uses:
  • Runner โ€” the machine that executes jobs. GitHub-hosted (ubuntu-latest) or self-hosted.
  • Environment โ€” a deployment target with protection rules and secrets.
on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm test
07 How do you share data between jobs in GitHub Actions?
โ€บ
Jobs run on separate runners so they don't share filesystem state. Three mechanisms:
  • Artifacts โ€” actions/upload-artifact / actions/download-artifact. Use for files (build outputs, test reports). Stored in GitHub, max 90 days.
  • Job outputs โ€” small string values set via $GITHUB_OUTPUT and referenced with needs.job-name.outputs.key. Use for version numbers, flags, short strings.
  • Cache โ€” actions/cache for dependency caching (node_modules, pip packages). Keyed by hash of lockfile.
# Set output
- run: echo "version=1.2.3" >> $GITHUB_OUTPUT
  id: version
# Read in downstream job
- run: echo ${{ needs.build.outputs.version }}
Artifacts are for files; outputs are for strings. Getting this distinction right shows you've actually built multi-job workflows.
08 What is the difference between GitHub-hosted and self-hosted runners? When do you use each?
โ€บ
GitHub-hosted runners are ephemeral VMs managed by GitHub. Fresh environment every run. Use for: standard builds, open source projects, anything that doesn't need VPC access.
  • Pros: zero maintenance, always up to date, free tier included
  • Cons: no access to internal network, slower for large builds (cache cold), can't install persistent tooling
Self-hosted runners are machines you own and register with GitHub. Use for:
  • Builds that need access to private infrastructure (deploy to internal k8s cluster)
  • Compliance requirements (code never leaves your network)
  • Large compute needs (GPU, high-memory) where GitHub-hosted is too expensive
  • Faster builds with warm caches and pre-installed dependencies
Self-hosted runners on public repos are a security risk โ€” anyone can fork and run malicious code. Never use self-hosted runners on public repos without strict controls.
09 How do you implement a matrix build in GitHub Actions?
โ€บ
A matrix strategy runs the same job multiple times with different variable values โ€” useful for testing against multiple OS versions, Node versions, or environment combinations.
jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest]
        node: [18, 20, 22]
      fail-fast: false  # don't cancel other matrix jobs on failure
    steps:
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm test
  • fail-fast: false โ€” run all combinations even if one fails, so you get a full picture
  • include: / exclude: โ€” add specific combinations or remove invalid ones
  • Matrix generates N jobs running in parallel โ€” 2 OS ร— 3 Node versions = 6 parallel jobs
10 What are reusable workflows and composite actions? When do you use each?
โ€บ
Both allow reuse, but at different levels:

Composite Actions bundle multiple steps into a single action. They run in the calling job's runner. Use for: grouping steps you repeat within workflows (e.g., "setup and authenticate").
# .github/actions/setup-aws/action.yml
runs:
  using: composite
  steps:
    - uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ inputs.role-arn }}
Reusable Workflows define an entire job (or jobs) that can be called by other workflows. They run on their own runner. Use for: full deployment pipelines you want to centralize across repos.
# called from another workflow:
jobs:
  deploy:
    uses: org/shared-workflows/.github/workflows/deploy.yml@main
    with:
      environment: production
  • Composite = reusable steps within a job
  • Reusable workflow = reusable entire job, can have its own runner and environment
Platform teams should publish reusable workflows in a central repo so all product teams consume the same tested deployment logic โ€” reduces fragmentation.
11 How do you securely authenticate GitHub Actions to AWS without storing long-term credentials?
โ€บ
Use OIDC (OpenID Connect) โ€” GitHub acts as an identity provider and issues short-lived tokens that AWS trusts.

Setup:
  • In AWS: create an IAM OIDC identity provider for token.actions.githubusercontent.com
  • Create an IAM Role with a trust policy that allows the GitHub OIDC provider to assume it, scoped to your specific repo/branch
  • In the workflow: use aws-actions/configure-aws-credentials with role-to-assume
permissions:
  id-token: write  # required for OIDC
  contents: read
steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789:role/github-deploy
      aws-region: us-east-1
  • No access keys stored in GitHub Secrets โ€” credentials are ephemeral (15 min)
  • Trust policy can restrict to specific repo, branch, or environment for least privilege
If you're still using long-term access keys in GitHub Secrets for AWS auth, that's an immediate red flag in a security-conscious interview. OIDC is the right answer.
๐Ÿ—๏ธ
Jenkins 5 questions
12 What is a Jenkinsfile and what are the two pipeline syntax options?
โ€บ
A Jenkinsfile is a text file committed to your repo that defines the Jenkins pipeline as code. It enables version-controlled, reviewable pipeline definitions.

Declarative Pipeline โ€” structured, opinionated syntax with a fixed schema. Easier to read and write, has built-in validation:
pipeline {
  agent any
  stages {
    stage('Build') { steps { sh 'npm run build' } }
    stage('Test')  { steps { sh 'npm test' } }
  }
}
Scripted Pipeline โ€” full Groovy code, maximum flexibility, no structural constraints:
node {
  stage('Build') { sh 'npm run build' }
  stage('Test')  { sh 'npm test' }
}
  • Prefer Declarative unless you need logic that Declarative can't express โ€” it's more maintainable
  • Declarative supports script {} blocks for when you need Groovy logic inside it
13 How do you implement parallel stages in Jenkins?
โ€บ
In Declarative, use the parallel directive inside a stage:
stage('Test') {
  parallel {
    stage('Unit Tests') {
      steps { sh 'npm run test:unit' }
    }
    stage('Integration Tests') {
      steps { sh 'npm run test:integration' }
    }
    stage('Lint') {
      steps { sh 'npm run lint' }
    }
  }
}
Key considerations:
  • Each parallel branch needs its own agent/executor slot โ€” ensure enough executors are available
  • failFast: true inside parallel {} cancels remaining branches if one fails
  • Use parallel for independent stages only โ€” test suites, code quality checks, build targets
  • In Scripted, use the parallel() step with a map of closures
14 What are Jenkins Shared Libraries and why do you use them?
โ€บ
Shared Libraries are Groovy code stored in a separate repo and loaded into Jenkins pipelines to share common logic across multiple Jenkinsfiles.

Structure:
  • vars/ โ€” global pipeline steps callable as myStep() in any pipeline
  • src/ โ€” Groovy classes for more complex logic
  • resources/ โ€” static files accessible via libraryResource
Use cases:
  • Standard deploy function used by 20 different service pipelines
  • Common notification logic (Slack alerts on failure)
  • Security scanning steps that must run in every pipeline
// Jenkinsfile
@Library('my-shared-lib') _
pipeline {
  stages {
    stage('Deploy') { steps { standardDeploy(env: 'staging') } }
  }
}
Shared libraries are Jenkins' equivalent of reusable workflows in GitHub Actions. Mentioning them shows you've worked in multi-team Jenkins environments.
15 How do you handle credentials securely in Jenkins?
โ€บ
  • Jenkins Credentials Store โ€” store secrets centrally in Jenkins (Manage Jenkins โ†’ Credentials). Never hardcode credentials in Jenkinsfiles.
  • Use the credentials() binding in pipelines: environment { AWS_CREDS = credentials('aws-prod-creds') }
  • Credential types: username/password, secret text, SSH key, secret file, certificate
  • Scope: Global (all pipelines) vs System (only Jenkins internals) vs per-folder โ€” use folder-scoped credentials to limit blast radius
  • Credentials are masked in console output automatically when bound via credentials()
  • For cloud auth, prefer role-based auth (IAM roles on EC2 agents) over stored credentials โ€” same OIDC principle as GitHub Actions
If asked about secrets rotation, mention integrating Jenkins with Vault or AWS Secrets Manager via plugins โ€” Jenkins' built-in credential store doesn't auto-rotate.
16 How do you configure Jenkins for high availability and scale?
โ€บ
Jenkins HA is complex because the controller is stateful. Approaches:

Scale build capacity (agents):
  • Use the Kubernetes plugin or EC2 plugin to spin up ephemeral agents dynamically โ€” agents provision on demand and terminate after the job
  • Ephemeral agents eliminate "dirty" build environments and scale to zero cost when idle
Controller HA:
  • Jenkins HA Plugin (CloudBees) โ€” active/standby with shared storage
  • Or run Jenkins on Kubernetes with persistent volume for JENKINS_HOME โ€” on pod restart it recovers from disk
  • Regular backups of JENKINS_HOME (jobs, credentials, plugins) using the Backup plugin or snapshotting the PVC
Configuration as Code: Use the JCasC (Jenkins Configuration as Code) plugin so controller config is reproducible from a YAML file โ€” makes disaster recovery from scratch viable.
๐Ÿš€
Deployment Strategies 5 questions
17 What is the difference between blue/green, canary, and rolling deployments?
โ€บ
Blue/Green: Two identical environments. Blue = live. Green = new version. Switch traffic 100% from blue to green via load balancer or DNS. Rollback = switch back to blue.
  • Pros: instant rollback, zero downtime, full test of new environment before cutover
  • Cons: doubles infrastructure cost, database migrations must be backward-compatible
Canary: New version deployed to a small % of infrastructure (1โ€“10%). Monitor for errors. Gradually increase traffic to canary until 100%.
  • Pros: real user validation before full rollout, limits blast radius
  • Cons: longer rollout time, requires metrics-based promotion logic, complex routing
Rolling: Replace instances one at a time (or in batches). Old and new versions run simultaneously during rollout.
  • Pros: no extra infrastructure cost, gradual
  • Cons: rollback requires another rolling update, old/new API versions must be compatible during transition
In Kubernetes: Rolling is the default. Blue/Green is done with two Deployments + Service swap. Canary uses weighted routing with a service mesh (Istio) or Argo Rollouts.
18 What is a feature flag and how does it decouple deployment from release?
โ€บ
A feature flag (feature toggle) is a conditional in code that enables or disables functionality at runtime without deploying new code.

Deployment vs Release:
  • Deployment = getting code onto servers
  • Release = making a feature visible to users
With feature flags you can deploy code daily (even incomplete features) while releasing only when business-ready. This is trunk-based development's superpower.

Use cases:
  • A/B testing โ€” show feature to 50% of users, measure conversion
  • Gradual rollout โ€” enable for 1% โ†’ 10% โ†’ 100% of users
  • Kill switch โ€” instantly disable a broken feature without rolling back a deploy
  • Beta programs โ€” enable for internal users or specific accounts first
Tools: LaunchDarkly, AWS AppConfig, Unleash, Flagsmith.
The key insight: feature flags make deployment boring and release strategic. Teams that deploy without flags conflate two separate concerns, which slows delivery.
19 How do you safely handle database migrations in a CI/CD pipeline?
โ€บ
Database migrations are the hardest part of zero-downtime deployments because the DB is shared between old and new code during rollout.

The expand-contract (parallel change) pattern:
  • Expand: Add new column/table while keeping old schema. Both old and new code work. Deploy this first.
  • Migrate: Backfill data into new structure. New code reads new column, writes to both.
  • Contract: Remove old column/table in a future deploy once all instances use new code.
Tools: Flyway, Liquibase for versioned migration scripts. Run migrations as a pipeline stage before deploying app code.

Never:
  • Drop or rename columns in the same deploy that changes the code referencing them
  • Run migrations during the deployment that applies the app code โ€” run them as a separate prior step
This pattern shows you understand the hardest real-world constraint in zero-downtime deployments. Most candidates skip it.
20 How do you implement an automated rollback strategy in your pipeline?
โ€บ
Triggering rollback: Run post-deploy smoke tests and health checks. On failure, trigger rollback automatically. Common signals: HTTP error rate spike, latency increase, failed health check endpoint.

Rollback mechanisms by platform:
  • Kubernetes: kubectl rollout undo deployment/app โ€” reverts to previous ReplicaSet. Previous image is cached.
  • ECS: Update service to previous task definition revision
  • Blue/Green: Switch load balancer target group back to blue
  • Argo Rollouts: kubectl argo rollouts abort โ€” auto-rolls back canary
What to preserve:
  • Rollback the application code, never the database (schema changes must be backward-compatible)
  • Tag every production image with a git SHA โ€” makes rollback to any previous version deterministic
  • Store deploy history so you know exactly what image is running in each environment
21 How do you design a pipeline that promotes the same artifact through dev โ†’ staging โ†’ production?
โ€บ
The principle is build once, promote many times โ€” never rebuild the artifact for each environment. What changes is the configuration, not the code.

Implementation:
  • Build Docker image tagged with git SHA: app:a1b2c3d
  • Push to container registry (ECR, GHCR, Artifactory)
  • Dev pipeline deploys app:a1b2c3d to dev namespace with dev config (environment variables, secrets, replicas)
  • After dev tests pass, promotion job deploys same app:a1b2c3d to staging with staging config
  • Manual approval gate (or automated based on metrics) promotes same image to production
Configuration separation:
  • Environment-specific values in Kubernetes ConfigMaps/Secrets or Helm values files per environment
  • Twelve-factor app: all config from environment variables, never baked into the image
  • The image is immutable โ€” the same bits hit production that passed all tests
Rebuilding for each environment is a common antipattern โ€” "works in staging, broken in prod" is often caused by a build that produced a different artifact. Promote; don't rebuild.
๐Ÿ”
Secrets & Security in CI/CD 4 questions
22 How do you prevent secret leakage in CI/CD pipelines?
โ€บ
Secrets leak in three ways: accidentally committed to the repo, printed to build logs, or exposed through pipeline artifacts.

Pre-commit prevention:
  • git-secrets or detect-secrets โ€” pre-commit hooks that block commits containing credential patterns
  • GitHub secret scanning โ€” automatically detects common credential patterns in pushes and alerts/blocks
  • .gitignore for .env, key files, credentials files โ€” but hooks are more reliable
In-pipeline protection:
  • Never echo secret values to stdout โ€” they'll appear in logs
  • Use the platform's native secret injection (GitHub Secrets, Jenkins Credentials) โ€” they mask values in logs automatically
  • Avoid writing secrets to disk; if unavoidable, clean up in a finally / post-pipeline step
Runtime secrets:
  • Pull secrets from Vault/AWS Secrets Manager at runtime rather than injecting at build time โ€” shorter exposure window
23 What is SAST vs DAST and how do you integrate them into a pipeline?
โ€บ
SAST (Static Application Security Testing) โ€” analyzes source code without running it. Catches: SQL injection patterns, hardcoded secrets, insecure functions, dependency vulnerabilities.
  • Tools: Semgrep, Snyk Code, SonarQube, Bandit (Python), ESLint security plugins
  • Where in pipeline: early โ€” run on every PR alongside unit tests. Fast feedback loop.
  • Also run SCA (Software Composition Analysis) โ€” Snyk, Dependabot, OWASP Dependency-Check for known CVEs in dependencies
DAST (Dynamic Application Security Testing) โ€” tests a running application by sending malicious inputs. Catches: runtime vulnerabilities, auth issues, injection flaws that only manifest at runtime.
  • Tools: OWASP ZAP, Burp Suite Enterprise, Nuclei
  • Where in pipeline: staging environment after deployment. Too slow for every PR.
Shift left: run SAST on every commit, DAST on staging deploys. Don't gate prod on DAST alone โ€” it has false positives that would block pipelines.
24 How do you sign and verify container images in a CI/CD pipeline?
โ€บ
Image signing provides a cryptographic guarantee that an image came from your trusted pipeline and hasn't been tampered with.

Tools:
  • Cosign (Sigstore) โ€” the modern standard. Keyless signing using OIDC identity (no key management).
  • Notation (CNCF) โ€” enterprise signing with X.509 certificates, supported by AWS ECR
Pipeline flow with Cosign:
  • Build and push image to registry: docker push registry/app:sha123
  • Sign in CI: cosign sign --key cosign.key registry/app:sha123 (or keyless via OIDC)
  • Signature stored alongside image in registry
Enforce at deploy time:
  • Kubernetes admission controller (Kyverno or OPA/Gatekeeper) rejects unsigned images
  • Policy: only images signed by the CI pipeline's OIDC identity are admitted to the cluster
Keyless signing with Cosign + OIDC is the modern pattern โ€” no key management, signing identity is tied to the CI job. Mention Kyverno policy enforcement to show you close the loop.
25 What is supply chain security and what is SLSA?
โ€บ
Software Supply Chain Security addresses attacks that target the build and distribution process โ€” not the running application itself. The SolarWinds and XZ Utils attacks are canonical examples: the pipeline or dependency was compromised, not the production system.

SLSA (Supply-chain Levels for Software Artifacts) is a security framework (from Google/CNCF) with four levels:
  • SLSA 1 โ€” Build process is scripted and produces provenance (a signed record of what produced the artifact)
  • SLSA 2 โ€” Version-controlled build process, hosted build service generates provenance
  • SLSA 3 โ€” Hardened build platform; no unreviewed code can influence the build
  • SLSA 4 โ€” Two-person review for all changes; hermetic, reproducible builds
Practical steps for SLSA 1โ€“2:
  • Generate SBOM (Software Bill of Materials) โ€” syft, trivy sbom
  • Generate and sign provenance with slsa-github-generator in GitHub Actions
  • Pin all action versions to SHA (not tag) in GitHub Actions
๐Ÿงช
Testing in Pipelines 5 questions
26 What is the testing pyramid and how does it inform CI/CD pipeline design?
โ€บ
The testing pyramid (Mike Cohn) describes the ideal distribution of test types:

Unit tests (base โ€” many): Test individual functions in isolation. Fast (milliseconds), no external dependencies, cheap to maintain. Run on every commit.

Integration tests (middle โ€” some): Test how components interact (service + DB, API client + server). Slower (seconds), need running dependencies. Run on every PR.

E2E / UI tests (top โ€” few): Test full user flows through a real browser. Slow (minutes), brittle, expensive. Run on main branch or pre-release.

Pipeline implications:
  • Put fast tests first โ€” fail early before spending time on slow stages
  • Parallelise integration tests to avoid blocking the pipeline for minutes
  • Don't block every commit on E2E โ€” it destroys developer velocity
  • A heavy E2E suite with no unit tests (the "ice cream cone" antipattern) makes CI slow and unreliable
27 What are flaky tests and how do you handle them in CI?
โ€บ
Flaky tests produce inconsistent results โ€” passing and failing on the same code without any changes. They erode trust in the pipeline ("oh it's probably just a flaky test") and mask real failures.

Root causes:
  • Race conditions / timing issues (hardcoded sleeps instead of proper waits)
  • Shared mutable state between tests
  • External service dependency (network calls, real databases)
  • Order-dependent tests
Handling strategy:
  • Quarantine โ€” move flaky tests to a separate suite that runs but doesn't block the pipeline. Track and fix them on a schedule.
  • Retry on failure โ€” some CI systems support retry: 2. Acceptable for known-flaky tests but masks the root cause.
  • Track flakiness โ€” GitHub Actions and JUnit reporters can identify historically flaky tests. Fix the ones that fail most.
Never auto-retry flaky tests and call it fixed. Quarantine + fix is the professional answer โ€” it shows you care about test suite integrity.
28 How do you implement smoke tests and health checks post-deployment?
โ€บ
Health check endpoint โ€” every service should expose /health or /ready:
  • /health (liveness) โ€” is the process alive? Return 200 if yes.
  • /ready (readiness) โ€” can the service handle traffic? Check DB connection, cache, dependencies.
Smoke tests โ€” a small set of critical path tests run against the deployed environment:
  • Hit the health endpoint and assert 200
  • Test the most critical user journeys (login, key API endpoints)
  • Should run in <2 minutes โ€” fast validation, not full regression
Pipeline integration:
- name: Wait for deployment
  run: kubectl rollout status deployment/app --timeout=5m
- name: Smoke test
  run: |
    curl -f https://staging.app.com/health
    curl -f https://staging.app.com/api/v1/ping
If smoke tests fail, trigger automatic rollback before the issue reaches more traffic.
29 How do you speed up a slow test suite in CI?
โ€บ
Parallelize:
  • Split test suite across multiple runners using sharding: jest --shard=1/4, pytest-xdist, RSpec parallel
  • In GitHub Actions, use a matrix strategy to run N shards simultaneously
Cache aggressively:
  • Cache node_modules, .venv, Maven ~/.m2 โ€” keyed on lockfile hash
  • Cache Docker build layers properly (order Dockerfile instructions from least to most frequently changing)
Run less:
  • Affected-only testing โ€” only run tests for changed modules (Nx, Turborepo, Bazel)
  • Move slow integration/E2E tests off the PR pipeline to a scheduled nightly run
Fix the tests:
  • Profile test runtime โ€” usually a small % of tests take the majority of time
  • Replace slow real-database integration tests with fast in-memory or Docker-based equivalents
30 How do you test infrastructure code (Terraform, Ansible) in a CI pipeline?
โ€บ
Infrastructure code testing is often overlooked. A layered approach:

Static analysis (fast, every PR):
  • terraform validate โ€” syntax and config validity
  • terraform fmt --check โ€” formatting
  • tflint โ€” provider-specific linting rules
  • tfsec / Checkov โ€” security policy checks (no public S3 buckets, encrypted EBS)
  • ansible-lint for Ansible playbooks
Plan review (on PR):
  • terraform plan against a test account and post the plan diff as a PR comment (Atlantis, Terraform Cloud)
  • Require human approval of destructive changes (resource deletion)
Integration testing (on merge to main):
  • Terratest (Go) โ€” provision real infrastructure in a test account, assert it works, destroy it
  • Kitchen-Terraform โ€” similar for Terraform modules
  • Use isolated AWS accounts per test run, clean up with defer
๐Ÿ› ๏ธ
Troubleshooting & Optimization 5 questions
31 How do you debug a pipeline that works locally but fails in CI?
โ€บ
This is one of the most common CI problems. Systematic approach:

Environment differences to check first:
  • Dependency versions โ€” local has cached/different version than CI. Fix: commit lockfiles (package-lock.json, Pipfile.lock) and use npm ci not npm install
  • Environment variables โ€” something set in local shell profile that CI doesn't have
  • File permissions โ€” scripts not executable in CI (fix: git update-index --chmod=+x script.sh)
  • OS differences โ€” macOS vs Linux (line endings, filesystem case-sensitivity)
  • Network access โ€” CI can't reach internal service that local can
Debugging techniques:
  • Add env and pwd debug steps to print the CI environment
  • GitHub Actions: use tmate action to SSH into a running runner
  • Run the exact CI Docker image locally: docker run --rm -it ubuntu:latest bash
  • Use act (GitHub) to run workflows locally
32 How do you cache dependencies effectively in CI/CD?
โ€บ
Cache key strategy: Key on the lockfile hash. Cache is invalidated automatically when dependencies change.
# GitHub Actions
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('package-lock.json') }}
    restore-keys: npm-
What to cache by ecosystem:
  • Node.js: ~/.npm or node_modules (prefer npm cache dir)
  • Python: ~/.cache/pip or the virtualenv directory
  • Go: ~/go/pkg/mod
  • Maven: ~/.m2/repository
  • Docker: layer cache via --cache-from or BuildKit registry cache
Docker layer caching:
# Copy dependency files first, install, THEN copy source code
COPY package-lock.json package.json ./
RUN npm ci
COPY . .       # source changes don't invalidate the npm install layer
Order Dockerfile instructions from least to most frequently changing โ€” dependency installs before source code copies. This single optimization often cuts build time by 60-80%.
33 How do you handle multi-service (microservices) CI/CD pipelines?
โ€บ
Microservices pipelines have unique challenges: independent deployability vs. cross-service integration, shared libraries, and testing interactions.

Repo structure:
  • Polyrepo โ€” each service has its own repo and pipeline. Simple isolation, but cross-service changes require coordinating multiple PRs.
  • Monorepo โ€” all services in one repo. Tools like Nx, Turborepo, Bazel detect which services changed and only run their pipelines.
Versioning and contracts:
  • Each service publishes a versioned artifact (Docker image tagged with git SHA)
  • Use contract testing (Pact) to verify consumer/provider API contracts without deploying all services
  • Maintain backward-compatible APIs โ€” new consumers shouldn't break old producers
Deployment coordination:
  • Deploy services independently โ€” that's the whole point of microservices
  • Use a service mesh or API gateway to handle version routing during transitions
  • Maintain a deployment manifest (which version of each service is in each env) โ€” GitOps manages this naturally
34 How would you design a CI/CD system from scratch for a team moving from manual deployments?
โ€บ
This is a system design question. Show a phased, pragmatic approach:

Phase 1 โ€” Automate the build (Week 1โ€“2):
  • Source control everything if not already (Git, branch protection on main)
  • Set up CI to run tests on every PR โ€” even just "it compiles" is a win
  • Build and push a Docker image on merge to main
Phase 2 โ€” Automate deployment (Week 3โ€“4):
  • One-click deploy to staging from the CI/CD tool
  • Manual approval gate to production
  • Document the deploy process as pipeline-as-code (Jenkinsfile/.github/workflows)
Phase 3 โ€” Add safety nets (Month 2):
  • Post-deploy smoke tests with automatic rollback
  • Secrets management via Vault or cloud-native secrets
  • Security scanning integrated into PR pipeline
Phase 4 โ€” Optimize (Month 3+):
  • Measure DORA metrics โ€” establish baselines
  • Improve caching, parallelize tests, reduce pipeline time
  • Move toward Continuous Deployment for lower-risk services
Frame your answer as iterative โ€” "automate the most painful thing first" wins over "build the perfect system." Teams that try to do everything at once usually ship nothing.
35 Tell me about a production incident caused by a CI/CD failure and how you handled it.
โ€บ
This is your war story question. Use the 5-part framework:

1. Setup (30s): What system, what was the business impact? "Our deployment pipeline auto-deployed a breaking API change to production because the integration tests didn't cover that contract."

2. Discovery (45s): How did you find it? "Error rate alert fired 3 minutes after deploy. On-call checked the deploy timeline โ€” last deploy was the culprit."

3. Diagnosis (60s): Root cause. "A schema change in our user service removed a field that two downstream services still read. The pipeline only tested the service in isolation, not the contract."

4. Remediation (30s): What fixed it? "Rolled back via the pipeline's rollback job. Production restored in 8 minutes."

5. Prevention (30s): What changed? "Added contract tests (Pact) between the three services. The broken contract would now fail the PR pipeline before merge."
Prepare 1โ€“2 real stories before the interview. Specifics (service names, error rates, recovery time) make this answer land. Vague stories don't impress.