// Study guide

Linux
Study Guide

35 Questions 7 Domains Engineer level
Mastered
0 / 35
Filter:
Easy Medium Hard
🐧
Core & FHS 5 questions
01 What is the difference between the Linux kernel and user space?
The kernel runs in privileged mode: scheduling, memory management, drivers, syscalls. User space is everything else — shells, daemons, your apps — talking to hardware only through syscalls.

DevOps relevance: when you tune sysctl, load modules, or debug “kernel” vs “application” issues, you’re straddling this boundary.
Pair with “strace shows syscalls — that’s the boundary in action.”
02 Name important directories under the Filesystem Hierarchy Standard (FHS).
/bin, /sbin — essential binaries; /etc — config; /var — variable data (logs, spool); /tmp — temp (often cleared); /usr — read-only app data; /opt — add-on software; /home — users; /dev — device nodes; /proc, /sys — virtual kernel interfaces.

Interviewers want intuition: logs in /var/log, service configs in /etc.
03 Symbolic link vs hard link — differences and when each breaks?
Hard link — another directory entry pointing to the same inode; same permissions and data; cannot cross filesystems; deleting one name leaves others.

Symlink — special file containing a path; can span filesystems; if target moves or is deleted, the link is broken (dangling).

Symlinks are usual for “current” release dirs; hard links are rare in ops except dedup or backup tools.
Know readlink -f for canonical symlink target.
04 What is an inode? What metadata lives there vs in the directory entry?
An inode stores metadata: owner, mode, size, timestamps, pointers to data blocks, link count. The directory entry maps a name → inode number.

ls -i shows inode numbers. Copying a file creates a new inode; moving within the same FS usually renames the directory entry only.

Inode exhaustion: disk reports space but df -i shows 100% — common with millions of tiny files.
“No space left on device” but df -h OK → check inodes.
05 How does systemd relate to SysV init and runlevels?
systemd is the init system + service manager on most distros now. Units are declarative (.service, .timer, .socket). Targets group units (roughly map to runlevels: multi-user.target, graphical.target).

systemctl isolate rescue.target — change target; systemctl get-default — boot target.

Legacy /etc/init.d scripts may still exist via compatibility layers.
🔐
Permissions & ACLs 5 questions
06 Decode chmod 755 and chmod 640 in rwx terms.
Octal digits = owner, group, others. Each digit is 4+2+1 for rwx.

755 = rwxr-xr-x — owner full, group/others read+execute (typical scripts).

640 = rw-r----- — owner rw, group r, others nothing (common for secrets file with correct group).
07 What is umask and how does it affect new files and directories?
umask subtracts permission bits from the process default (files often 666, dirs 777 before mask). umask 022 → files 644, dirs 755.

Stricter 027 removes group/others write more aggressively. Service umask can be set in systemd unit UMask=.

Does not change existing files — only new creates.
If uploads are “too open,” check app umask vs ACLs vs inherited parent dir setgid.
08 Explain setuid, setgid, and the sticky bit with examples.
Setuid (4000) — executable runs as file owner (passwd, sudo). Security-sensitive — avoid on scripts on many FS.

Setgid (2000) on dir — new files inherit group of directory; on executable runs with file’s group.

Sticky (1000) on dir (/tmp) — only owner can delete/rename own files despite world-writable dir.

Show with ls -l (s/t in mode) or find / -perm -4000 for audits.
09 When would you use POSIX ACLs (setfacl) instead of only chmod/chown?
Classic Unix mode = one owner, one group, others. ACLs grant specific users/groups extra ACEs (e.g. two teams need different RW on same tree).

Use when you can’t refactor group membership or split paths. Downside: harder to reason about, backup/restore must preserve ACLs (getfacl -R).

Samba/NFS and shared build dirs are common cases.
10 Why is chmod -R 777 almost always wrong in production?
World-writable means any local user or compromised process can alter or trojan files. Breaks least privilege, breaks some daemons that refuse insecure perms, and hides the real fix (correct user, group, or ACL).

Prefer minimal mode + correct service user + possibly setgid directory for collaboration.
If someone “fixed” prod with 777, your answer is “revert, apply proper ownership, audit what broke.”
Processes & systemd 5 questions
11 SIGTERM vs SIGKILL — when do you use each?
SIGTERM (15) — polite: process can handle signal, flush, clean up. Default for kill.

SIGKILL (9) — cannot be caught or ignored; kernel tears down immediately — no cleanup, risk of corrupt state (DBs, caches).

Flow: TERM, wait, then KILL if truly stuck. In Kubernetes, SIGTERM then grace period mirrors this.
12 What is a zombie process? How do you get rid of it?
A zombie is a terminated child whose parent hasn’t called wait() — kernel keeps minimal exit status row. Uses almost no resources but clutters ps (Z state).

Fix: fix/restart the parent so it reaps children, or if parent died, init adopts and reaps. You cannot kill -9 a zombie (already dead).

Many zombies → buggy parent or runaway fork bomb ancestor.
Distinguish zombie from defunct vs orphan (orphan reparented to init).
13 systemctl restart vs reload vs reload-or-restart?
restart — stop then start; drops connections, clears memory state.

reload — sends SIGHUP or unit-specific reload; service re-reads config without full downtime if implemented (nginx, sshd support this).

reload-or-restart — reload if supported, else restart — handy in automation when unit capability varies.
14 How do you use journalctl effectively for service debugging?
journalctl -u nginx -f — follow unit; --since "10 min ago" — window; -b — current boot only; -p err — priority filter; -o json-pretty for structured export.

journalctl _PID=1234 — logs mentioning PID. Persistent journals require /var/log/journal config.

Correlate with app logs in /var/log when services log both places.
Know journalctl -xe for boot errors — classic first step after failed deploy.
15 What happens under memory pressure? What is the OOM killer?
Linux uses swap, page cache reclaim, then extreme pressure triggers the OOM killer — selects a process (via heuristics: badness score) and sends SIGKILL to free RAM.

Symptoms: mysterious restarts, Killed in shell, dmesg / journal OOM lines. Mitigations: limits (systemd MemoryMax=, cgroups), right-size workloads, fix leaks, add RAM/swap carefully.

vm.overcommit_memory tuning affects allocation behavior — know it exists, don’t over-tweak without measurement.
💾
Disk & filesystem 5 questions
16 Why can df and du disagree on disk usage?
df sees filesystem block allocation; du walks directory tree summing file sizes. Disagreements from: deleted files still open (space held until FD closed — check lsof +L1), other mounts hiding paths, sparse files, internal FS reserved blocks.

Also bind mounts and snapshots (LVM, btrfs) complicate “where” space went.
Classic interview: “log rotated but process still writing deleted file — du path gone, df still full.”
17 How do you detect inode exhaustion?
df -i — IUse% at 100%. Errors like “No space left on device” while df -h shows free space often mean inodes.

Fix: delete many small files, restructure (fewer files), or recreate FS with more inodes (mkfs options) — last resort.

Common culprits: session/cache dirs, CI artifacts, mail queues, container layers.
18 What causes a filesystem to mount read-only? How do you investigate?
Causes: disk errors / journal failure (kernel remounts ro), admin mount -o remount,ro, cloud hypervisor storage glitch, full disk on root during boot.

Check dmesg, journalctl -b, SMART / cloud console for volume health. mount | column -t shows ro flags.

After fixing underlying issue, fsck offline if ext4 suggests, then remount rw.
19 What are typical fields in /etc/fstab?
device mountpoint fstype options dump pass — e.g. UUID=... / ext4 defaults 0 1.

pass = fsck order (root usually 1, others 2 or 0). dump legacy backup flag (often 0). Options include noatime, nodev, nfs specifics.

Use blkid for stable UUIDs instead of /dev/sdX names.
20 When do you run fsck and what are the risks?
fsck checks/repairs filesystem metadata. Run on unmounted or read-only FS (or boot maintenance). Running on mounted rw ext4 is unsafe.

Risks: data loss if severe corruption — fsck may delete orphaned inodes. Always have backups; for cloud EBS, snapshot first.

Containers often don’t need fsck on overlay — underlying host volume matters.
Different FS tools: xfs_repair, btrfs check — name awareness scores points.
🌐
Networking 5 questions
21 How do you read ss -tlnp output?
-t TCP, -l listening sockets, -n numeric (no DNS), -p process. Columns show State, Recv-Q/Send-Q, Local Address:Port, Peer, users:(("nginx",pid=...)).

Replaces much of netstat. Use ss -tulpn for UDP too.

If nothing listens on expected port → service down, wrong bind address (127.0.0.1 vs 0.0.0.0), or firewall.
22 How do you debug DNS from the shell?
dig +trace example.com — full resolution path; dig @8.8.8.8 — specific resolver; host/nslookup for quick checks.

Check /etc/resolv.conf, systemd-resolved (resolvectl), VPC DNS settings in cloud.

TTL and negative caching explain “DNS changed but box still resolves old IP.”
Mention getent hosts uses NSS — not only DNS.
23 Useful curl flags for API and load-balancer debugging?
-v verbose TLS + headers; -I HEAD only; -w '%{http_code}\n' format output; --connect-timeout / --max-time; -H 'Host: ...' SNI / vhost tests; -k skip TLS verify (debug only).

-L follow redirects. Combine with -o /dev/null -s for silent timing checks.
24 What is TCP TIME_WAIT and why do people mention it with high connection churn?
After close, socket stays in TIME_WAIT (~2×MSL) so stray segments don’t confuse new connections with same quad. Many short-lived outgoing connections → many TIME_WAIT sockets → ephemeral port or conntrack exhaustion.

Mitigations: connection pooling, tune net.ipv4.ip_local_port_range, tw_reuse (careful, context-specific), fix app to reuse connections.

Symptom: “Cannot assign requested address” under load.
25 How do iptables/nftables relate to cloud security groups?
Security groups (AWS etc.) filter outside the instance. iptables/nftables run on the host — another layer (forwarding, DNAT, pod networking).

Both must allow traffic for a service to be reachable. Kubernetes nodes often use iptables/IPVS for Service routing.

Debug: iptables -L -n -v or nft list ruleset — requires root and distro-specific defaults.
“SG says open but connection refused” → process not listening or local firewall.
⌨️
Shell & scripting 5 questions
26 What does set -euo pipefail do in bash?
-e — exit on first command failure (subtleties in conditionals). -u — error on unset variables. -o pipefail — pipeline fails if any stage fails (not only last).

Standard for safer CI/deploy scripts. Combine with IFS=$'\n\t' sometimes for parsing safety.

Know exceptions: some commands you expect to fail need || true or explicit check.
Mention set -x for verbose trace when debugging.
27 Why quote variables in shell scripts ("$var")?
Unquoted expansion undergoes word splitting and globbing — filenames with spaces, empty vars disappearing from argv, accidental * expansion.

"$var" preserves single argument. Exception: rare cases you want splitting — be explicit.

Same for command substitution: $(cmd) in quotes when passing to other commands.
28 Why use find ... -print0 | xargs -0?
Newline-separated filenames break when names contain spaces or newlines. -print0 emits NUL-delimited records; xargs -0 reads NUL-delimited — safe for arbitrary filenames.

Alternative: find -exec cmd {} + avoids separate xargs process.

Critical for cleanup scripts over user-upload trees.
29 Cron vs systemd timer — when prefer each?
Cron — simple schedules, per-user crontabs, universal familiarity.

systemd timers — dependency on units, journal integration, monotonic delays, easier to ship as packages with services, clearer enable/disable in homogenous systemd distros.

Many shops standardize new work on timers; cron won’t disappear soon.
30 What is strace / perf at a high level for troubleshooting?
strace traces syscalls — see file opens, connect failures, hang on read. Use strace -f -p PID or wrap command. Overhead can be high.

perf — CPU profiling, PMCs, flame graphs — find hot code paths.

Pair with app-level tracing. Stop strace in prod except brief controlled captures — can leak paths/args.
“EACCESS in strace open()” → immediate permission debugging win.
🔧
Troubleshooting & behavioral 5 questions
31 A user gets “Permission denied” opening a file — what do you check?
Chain: file mode for user/group/others; correct user/group ownership; parent directory x (execute) for traversal; SELinux/AppArmor denials (ausearch/aa-status); NFS root squash; immutable flag (lsattr); read-only FS.

namei -l /path/to/file shows per-component permissions.

Don’t chmod 777 — fix ownership or ACL.
Order matters: deny on /home before you ever reach the file.
32 Load average is high but CPU idle — what might be wrong?
Load average includes runnable and uninterruptible sleep (D state) tasks — often disk or NFS hangs, not CPU. Check vmstat 1, iostat -xz, ps aux | awk '$8 ~ /^D/'.

Also runaway threads blocked on locks can inflate load without CPU.

Don’t assume “scale out” before finding the I/O or kernel wait chain.
Mention iotop, blktrace for deep disk — name-drop if you’ve used them.
33 How does SELinux or AppArmor change troubleshooting?
Mandatory Access Control can deny operations that DAC (chmod) allows. Symptoms: works with SELinux permissive, fails in enforcing.

Tools: getenforce, sestatus, ausearch -m avc, audit2why; AppArmor aa-status, logs in journal.

Fix with proper booleans/contexts (chcon/semanage fcontext) or profiles — not permanently disabling without policy approval.
34 What’s in your first 5 minutes on an unfamiliar broken Linux host?
Scope: who/what’s impacted; uptime, load; df -h / df -i; free -m; dmesg -T | tail; journalctl -p err -b --no-pager | tail; check critical service systemctl status; ss -tlnp for listeners.

Recent change? Deploy, kernel update, cert expiry. Capture evidence before restarting things.

Communicate impact and ETA early.
Structured method beats random command soup — interviewers listen for a repeatable playbook.
35 Describe a production Linux incident you debugged and how you verified the fix.
Use STAR: situation (symptom, blast radius), task (SLO/escalation), actions — commands and data you used (strace, inode check, journal, network trace), result — restored service, time to recover.

Verification: synthetic check, metrics back to baseline, post-incident: runbook update, monitoring gap closed, blameless note.

Honest reflection beats exaggeration.
One concrete command that found the root cause is more memorable than ten buzzwords.