THE MODN CHRONICLES

Interview Prep

Interview Questions DevOps — CI/CD, Kubernetes, Terraform, and What Companies Actually Ask

DevOps interviews have moved beyond “explain CI/CD.” Companies now test live troubleshooting, architecture design, and incident response. Here is what to expect — by company type.

DevOps engineer working on infrastructure and deployment pipelines

DevOps interviews now test what you can do under pressure, not what you can recite from documentation.

The DevOps Interview Landscape

DevOps interviews test three things: can you build pipelines, can you manage infrastructure, can you respond to incidents at 2 AM. Indian IT services companies test tool knowledge. Product companies test system design. GCCs test everything.

The role has evolved rapidly. Five years ago, knowing Jenkins and basic Linux was enough. Now, companies expect Kubernetes fluency, Infrastructure as Code expertise, and monitoring/observability knowledge. The interview process reflects this — expect 2-4 rounds covering CI/CD design, container orchestration, infrastructure automation, and incident response scenarios.

This guide covers the actual questions asked in DevOps interviews — organized by domain, with practical answers and the depth interviewers expect. Whether you are targeting a service company or a product company, these are the questions you will face.

The best DevOps engineers are not the ones who know the most tools. They are the ones who can design systems that do not need them at 2 AM.

CI/CD Questions

CI/CD is the foundation of DevOps. Every interview starts here. The questions range from basic pipeline design to complex deployment strategies and database migration handling.

Q1: Design a CI/CD pipeline for a microservices application

Why they ask: This is an architecture question that tests whether you understand the full lifecycle — from code commit to production deployment. It reveals how you think about testing, security, and reliability.

What the interviewer wants: A structured pipeline with clear stages, not just “build, test, deploy.”

# Pipeline Stages (GitHub Actions / Jenkins):

1. Code Quality Gate
   - Lint check, static analysis (SonarQube)
   - Unit tests with coverage threshold (>80%)

2. Build & Package
   - Docker build with multi-stage Dockerfile
   - Tag with git SHA (not "latest")
   - Push to container registry (ECR/GCR)

3. Integration Testing
   - Spin up dependencies (docker-compose)
   - Run API contract tests
   - Run integration test suite

4. Security Scan
   - Container image scan (Trivy/Snyk)
   - Dependency vulnerability check
   - SAST/DAST if applicable

5. Deploy to Staging
   - Kubernetes deployment (Helm chart)
   - Smoke tests against staging
   - Performance baseline check

6. Production Deployment
   - Canary deployment (5% → 25% → 100%)
   - Health check monitoring
   - Automatic rollback on error rate spike

What separates good answers: Mentioning rollback strategies, explaining why you chose canary over blue-green for microservices, and discussing how each service has its own pipeline but shares common templates.

Q2: What is the difference between blue-green and canary deployments?

Why they ask: Deployment strategies directly impact uptime and risk. This tests whether you understand the tradeoffs, not just the definitions.

Blue-Green: Two identical environments. Route all traffic from blue (current) to green (new) at once. Instant rollback by switching back. Downside: requires double the infrastructure and does not catch issues that only appear under partial load.

Canary: Route a small percentage of traffic (5-10%) to the new version. Monitor error rates and latency. Gradually increase traffic if metrics are healthy. Downside: more complex to implement, requires good monitoring, and both versions run simultaneously (API compatibility required).

When to use which: Blue-green for monoliths or when you need instant cutover. Canary for microservices where you want to validate under real traffic before full rollout. Most product companies in India use canary for their customer-facing services.

Q3: How do you handle database migrations in a CI/CD pipeline?

Why they ask: Database migrations are the hardest part of CI/CD. Code deployments can be rolled back instantly — database changes cannot. This tests whether you have dealt with real production systems.

# Tools: Flyway or Liquibase

# Key principles:
1. Migrations are versioned and immutable (V1__create_users.sql)
2. Always forward-only (never edit a migration that has been applied)
3. Separate migration deployment from code deployment
4. Use expand-contract pattern for breaking changes:

   # Step 1 (expand): Add new column, keep old column
   ALTER TABLE users ADD COLUMN email_new VARCHAR(255);

   # Step 2: Deploy code that writes to both columns
   # Step 3: Backfill data from old to new column
   # Step 4: Deploy code that reads from new column only
   # Step 5 (contract): Drop old column after verification

Rollback strategy: Never use DROP or DELETE in migrations without a corresponding undo script. For critical changes, always have a rollback migration ready. Test migrations against a production-size dataset in staging before applying to production.

Kubernetes & Docker Questions

Kubernetes questions appear in 65%+ of DevOps interviews. The questions range from basic concepts to practical troubleshooting scenarios that test real-world experience.

Q1: Explain the difference between a Pod, Deployment, and Service in Kubernetes

Why they ask: This is the most asked Kubernetes question. It tests whether you understand the building blocks and how they relate to each other.

Pod: The smallest deployable unit. Contains one or more containers that share network and storage. Pods are ephemeral — they can be killed and recreated at any time. You almost never create pods directly.

Deployment: Manages a set of identical pods (ReplicaSet). Handles rolling updates, rollbacks, and scaling. When you say “deploy my app,” you create a Deployment, not a Pod. The Deployment ensures the desired number of pods are always running.

Service: Provides a stable network endpoint for a set of pods. Pods get new IPs when recreated — Services provide a consistent DNS name and IP. Types: ClusterIP (internal), NodePort (external via node port), LoadBalancer (external via cloud LB).

Q2: How do you debug a CrashLoopBackOff in Kubernetes?

Why they ask: This is a practical troubleshooting question. CrashLoopBackOff is the most common Kubernetes issue and how you debug it reveals your real-world experience.

# Step 1: Check pod status and events
kubectl describe pod <pod-name> -n <namespace>
# Look at: Events section, Exit Code, Restart Count

# Step 2: Check container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # logs from crashed container

# Step 3: Common causes and fixes:
# Exit Code 1: Application error (check logs for stack trace)
# Exit Code 137: OOMKilled (increase memory limits)
# Exit Code 0: Container completed (check if command is correct)

# Step 4: If logs are empty, check if image exists
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Step 5: Debug interactively
kubectl run debug --image=<image> --rm -it -- /bin/sh

Q3: What is the difference between a Docker image and a container?

Why they ask: This seems basic but many candidates confuse the two. The analogy that works best: an image is a class, a container is an object (instance of that class).

Image: A read-only template with application code, runtime, libraries, and dependencies. Built from a Dockerfile. Stored in a registry (Docker Hub, ECR, GCR). Immutable — once built, it never changes. Composed of layers (each Dockerfile instruction creates a layer).

Container: A running instance of an image. Has its own writable layer on top of the image layers. Has its own network interface, process space, and filesystem view. Multiple containers can run from the same image simultaneously. When a container is deleted, its writable layer is lost (unless you use volumes).

Infrastructure monitoring dashboard with metrics and alerts

DevOps is not about tools — it is about building systems that are reliable, observable, and recoverable.

Terraform & Infrastructure as Code

Infrastructure as Code has become a non-negotiable skill for DevOps roles. Terraform is the most tested IaC tool in interviews, followed by CloudFormation for AWS-specific roles.

Q1: What is Terraform state and why does it matter?

Why they ask: State management is where most Terraform problems occur in production. Understanding state separates someone who has done tutorials from someone who has managed real infrastructure.

What is state: Terraform state is a JSON file that maps your Terraform configuration to real-world resources. When you run terraform apply, Terraform compares your config against the state file to determine what needs to change. Without state, Terraform would not know what it has already created.

# Remote backend configuration (never use local state in production)
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "terraform-locks"  # State locking
    encrypt        = true
  }
}

# Why remote state matters:
# 1. Team collaboration (everyone reads the same state)
# 2. State locking (prevents concurrent modifications)
# 3. Encryption at rest (state contains sensitive data)
# 4. Versioning (S3 versioning for state recovery)

Common pitfall: The state file contains sensitive data (database passwords, API keys) in plain text. Never commit it to git. Always use remote backends with encryption enabled.

Q2: How do you handle secrets in Terraform?

Why they ask: Secret management is a security-critical topic. Getting this wrong means credentials in state files, git history, or logs — all of which are security incidents.

# Rule 1: Never hardcode secrets in .tf files
# Rule 2: Mark sensitive variables
variable "db_password" {
  type      = string
  sensitive = true  # Prevents display in logs and plan output
}

# Rule 3: Use a secrets manager
data "aws_ssm_parameter" "db_password" {
  name = "/prod/database/password"
}

# Rule 4: Use environment variables for CI/CD
# export TF_VAR_db_password="secret"
# Terraform automatically reads TF_VAR_ prefixed env vars

# Rule 5: For complex setups, use HashiCorp Vault
provider "vault" {}
data "vault_generic_secret" "db" {
  path = "secret/data/production/db"
}

Key point: Even with sensitive = true, secrets still appear in the state file. This is why remote state with encryption and access controls is mandatory. The state file is the most sensitive artifact in your Terraform workflow.

Monitoring & SRE

Monitoring and SRE questions test whether you can keep systems running in production. These questions are increasingly common as companies adopt SRE practices alongside DevOps.

Q1: How would you set up monitoring for a production application?

Why they ask: This tests whether you understand observability as a system, not just individual tools. The interviewer wants to see a structured approach to monitoring.

The RED Method (for request-driven services):

# RED Method — what to monitor:
# Rate:     Requests per second
# Errors:   Error rate (5xx responses / total responses)
# Duration: Latency (p50, p95, p99)

# Stack: Prometheus + Grafana
# 1. Application metrics (expose /metrics endpoint)
#    - Request count, error count, latency histogram
#    - Business metrics (orders/min, signups/hour)

# 2. Infrastructure metrics (node-exporter)
#    - CPU, memory, disk, network per node
#    - Container resource usage (cAdvisor)

# 3. Alerting rules (Prometheus Alertmanager)
#    - Error rate > 1% for 5 minutes → Page on-call
#    - P99 latency > 2s for 10 minutes → Slack alert
#    - Disk usage > 85% → Ticket

# 4. Dashboards (Grafana)
#    - Service overview (RED metrics per service)
#    - Infrastructure overview (cluster health)
#    - On-call dashboard (active alerts, recent deployments)

Q2: What is an SLO and how do you define one?

Why they ask: SLOs are the bridge between engineering and business. Understanding the SLI → SLO → SLA chain shows you think about reliability from a user perspective, not just a technical one.

SLI (Service Level Indicator): A measurable metric. Example: the proportion of requests that complete in under 200ms.

SLO (Service Level Objective): A target for the SLI. Example: 99.9% of requests should complete in under 200ms over a 30-day window. This gives you an error budget of 0.1% — roughly 43 minutes of downtime per month.

SLA (Service Level Agreement): A contract with consequences. Example: if availability drops below 99.5%, the customer gets service credits. SLAs are always less strict than SLOs — you need a buffer.

Error budgets: If your SLO is 99.9%, you have a 0.1% error budget. When the budget is healthy, you can deploy faster and take more risks. When the budget is nearly exhausted, you freeze deployments and focus on reliability. This is how SRE teams balance velocity and stability.

How to Prepare — By Company Type

DevOps interview preparation varies dramatically by company type. Here is a realistic plan for each:

Service Companies (1-2 weeks)

Focus on tool knowledge: Jenkins/GitHub Actions pipeline syntax, Docker commands and Dockerfile best practices, basic Kubernetes concepts (Pod, Deployment, Service), Linux commands (process management, networking, file permissions), and basic AWS/Azure services. They test breadth — know 10 tools at surface level rather than 3 tools deeply. Practice explaining CI/CD concepts clearly.

Product Companies (3-4 weeks)

Focus on system design and troubleshooting: design a CI/CD pipeline for microservices, debug Kubernetes issues (CrashLoopBackOff, OOMKilled, networking), Terraform state management and module design, monitoring strategy (what to monitor, how to alert, incident response). They give you scenarios and expect you to think through solutions. Practice on a home lab — deploy a real application on Kubernetes, break it, and fix it.

GCCs (4-6 weeks)

Everything from product company prep, plus: coding (Python/Go scripting for automation, writing Kubernetes operators), SRE concepts (SLOs, error budgets, incident management, postmortems), advanced Kubernetes (custom controllers, admission webhooks, network policies), and cloud architecture (multi-region, disaster recovery, cost optimization). GCCs also test behavioral questions heavily — prepare STAR-format stories about incidents you have handled.

Practice With Real Interview Simulations

DevOps interviews test practical skills under pressure. Practice with timed mock interviews that simulate real troubleshooting scenarios and architecture design questions.

TRY INTERVIEW PRACTICE →

DevOps interviews reward builders, not memorizers. The candidate who has deployed a broken app to Kubernetes and fixed it at midnight will always outperform the candidate who read about it in a blog post.

DevOps is one of the fastest-growing roles in India. The demand for engineers who can build reliable pipelines, manage cloud infrastructure, and respond to incidents far exceeds the supply. The interviews are getting harder because the stakes are higher — a bad deployment can cost a company millions. Prepare by building real systems, breaking them intentionally, and learning to fix them under pressure. That experience is worth more than any certification.

Prepare for Your DevOps Interview

Practice with AI-powered mock interviews, get your resume ATS-ready, and walk into your next DevOps interview with confidence.

Free · AI-powered · Instant feedback