A deep dive into the Linux internals that matter for application developers: process lifecycle, memory management, networking stack, cgroups, eBPF, and the debugging tools that will save you at 3 AM.
Most application developers treat Linux like a black box. You deploy your code, it runs, and when something goes wrong you Google the error message and paste Stack Overflow commands until things work again. I did this for years. Then I started running production services on bare metal VPS instances, and the black box bit back — hard.
Your Node.js process got killed with no error message. Your Go service runs out of file descriptors at 3 AM. Your Python app is somehow using 4 GB of virtual memory despite holding a 200 MB dataset. Your Docker container creates zombie processes that slowly eat your PID space. Your server stops accepting connections even though CPU and memory look fine.
Every one of these problems has a clear, explainable cause rooted in Linux internals. Not obscure kernel-hacker-only internals — the fundamentals that the operating system is built on. Once you understand them, these problems go from mysterious to obvious. More importantly, you can prevent them instead of reacting to them.
This is not an operating systems textbook. This is the stuff I wish someone had explained to me before I spent years learning it through production incidents.
Every process on Linux begins with a single system call: fork(). When a process calls fork(), the kernel creates an almost identical copy of the calling process. The child gets a new PID, but it inherits a copy of the parent's memory space, file descriptors, signal handlers, and environment. The key word is "copy" — the child is independent. Changing memory in the child does not affect the parent, and vice versa (though the kernel is clever about this, as we will see with copy-on-write).
After fork(), the child process typically calls exec() to replace itself with a different program. This is the fork-exec pattern, and it is how virtually every program gets launched on Linux. When you type ls in your terminal, bash calls fork() to create a child, then the child calls exec("/bin/ls") to become the ls program.
# Watch the fork-exec dance in real time
strace -f -e trace=clone,execve bash -c "ls /tmp" 2>&1 | head -20Here is where it gets interesting. When a child process exits, it does not immediately disappear. It enters a state called "zombie" — the process has finished executing, but its entry in the process table persists because the kernel needs to keep the exit status around until the parent reads it with wait() or waitpid(). This is by design. The parent needs to know how its children exited.
In a normal system, init (PID 1) acts as a "reaper" — it adopts orphaned processes and calls wait() on zombie children. Systemd, the most common init system, does this automatically. But in a Docker container, your application is PID 1. If your Node.js app spawns child processes (maybe through child_process.exec() or a native addon) and those children exit, your app needs to call wait() on them. If it does not, you accumulate zombies.
# Check for zombie processes
ps aux | awk '$8 == "Z" { print }'
# Count zombies on a system
ps -eo stat | grep -c ZNode.js handles this for child processes it spawns through child_process, but it does not handle zombies created by grandchild processes or native addons. In Docker, the fix is to use an init process like tini as your entrypoint:
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["node", "server.js"]Or use Docker's built-in init flag:
docker run --init your-imagePID namespaces are why your container thinks its main process is PID 1 even though the host sees it as PID 48372. Each PID namespace has its own PID numbering starting from 1. The first process in the namespace becomes PID 1 and takes on the init role — which is exactly why the zombie problem matters in containers.
# See the PID namespace of a running container
ls -la /proc/self/ns/pid
# Compare with the host namespace
sudo ls -la /proc/1/ns/pidWhen a parent process dies before its children, the children become orphans. The kernel re-parents orphaned processes to the nearest "subreaper" process or to PID 1. This is usually fine, but it can cause confusion when you are trying to track down process trees. If you kill a shell script that spawned background processes, those processes do not die — they get adopted by init and keep running. This is why you sometimes find stale worker processes from a deploy three weeks ago still consuming resources.
# Mark current process as a subreaper (Linux 3.4+)
# Children of this process will be re-parented to it, not init
prctl(PR_SET_CHILD_SUBREAPER, 1)Linux memory management is one of those topics where knowing the basics saves you from days of debugging. The first thing to understand is that virtual memory is a lie — a useful lie, but a lie nonetheless.
Every process gets its own virtual address space. On a 64-bit system, this is a 128 TB address space. Your process thinks it has access to a vast, contiguous block of memory. In reality, the kernel uses page tables to map virtual addresses to physical memory pages (typically 4 KB each). Pages that have not been accessed yet do not have physical memory backing them. This is why a freshly forked process does not immediately double your memory usage — the kernel uses copy-on-write, only allocating new physical pages when one of the processes actually writes to a page.
Here is where things get dangerous. By default, Linux overcommits memory. When your application calls malloc(1GB), the kernel says "sure, here you go" — even if there is only 512 MB of free physical memory. The kernel is betting that you will not actually use all of the memory you requested. This is usually a good bet. Most applications allocate more memory than they use because memory allocators request chunks from the OS in large blocks.
The problem comes when the bet fails. If the system actually runs out of physical memory and swap, the kernel has no choice but to kill processes to free memory. This is the OOM (Out of Memory) Killer.
# Check current overcommit setting
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic overcommit (default)
# 1 = always overcommit (never refuse malloc)
# 2 = don't overcommit (strict)
# Check available memory including buffers/cache
free -h
# Check OOM killer scores for running processes
# Higher score = more likely to be killed
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
if [ -f /proc/$pid/oom_score ]; then
echo "PID $pid ($(cat /proc/$pid/comm 2>/dev/null)): $(cat /proc/$pid/oom_score 2>/dev/null)"
fi
done 2>/dev/null | sort -t: -k2 -rn | head -20The OOM Killer picks its victim based on a scoring algorithm that considers memory usage, the age of the process, and an adjustable oom_score_adj value. It favors killing processes that use a lot of memory but are not critical system services. The kill appears as a SIGKILL (signal 9), which cannot be caught or handled. Your process simply disappears. No error message, no stack trace, no log entry from your application. The only trace is in the kernel log.
# Check if your process was OOM-killed
dmesg | grep -i "oom\|out of memory\|killed process"
# Or via journalctl on systemd systems
journalctl -k | grep -i "oom\|killed process"You can adjust the OOM score of critical processes to make them less likely to be killed:
# Make a process very unlikely to be OOM-killed (-1000 to 1000)
echo -500 > /proc/$(pgrep -f "node server.js")/oom_score_adj
# In a systemd unit file
[Service]
OOMScoreAdjust=-500The lesson here is straightforward: monitor your memory. Set up alerts before you hit the wall. And always check dmesg when a process vanishes without explanation.
On Linux, everything is a file. Network sockets, pipes, device drivers, kernel interfaces, even process information — they are all accessed through file descriptors. A file descriptor is just an integer that serves as a handle to an open resource. When you open a file, the kernel returns the lowest available integer. STDIN is 0, STDOUT is 1, STDERR is 2, and everything else counts up from 3.
This design is elegant because it means one set of system calls (read, write, close, poll) works on almost everything. Your code reads from a TCP socket the same way it reads from a file on disk.
The /proc filesystem is a virtual filesystem that exposes kernel and process information as files. It does not exist on disk — the kernel generates the content on the fly when you read from it. This is your primary debugging interface for understanding what a running process is doing.
# Everything about a process, all in /proc
ls /proc/$(pgrep -f "node")/
# Memory map of a process
cat /proc/$(pgrep -f "node")/maps
# Open file descriptors
ls -la /proc/$(pgrep -f "node")/fd/
# Current working directory
readlink /proc/$(pgrep -f "node")/cwd
# The actual binary that's running
readlink /proc/$(pgrep -f "node")/exe
# Environment variables (null-separated)
cat /proc/$(pgrep -f "node")/environ | tr '\0' '\n'
# Network connections
cat /proc/$(pgrep -f "node")/net/tcpWhile /proc is process-oriented, /sys exposes the device model and kernel subsystems. It is where you go to query or configure hardware, adjust kernel parameters at runtime, and inspect cgroup hierarchies.
# Block device info
cat /sys/block/sda/queue/scheduler
# CPU frequency information
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# Network interface statistics
cat /sys/class/net/eth0/statistics/rx_bytesSignals are software interrupts delivered to processes. They are the primary mechanism for inter-process communication and process lifecycle management. Understanding the difference between signals is the difference between graceful shutdowns and data corruption.
SIGTERM (15): The polite request. "Please shut down when you're ready." This is what kill sends by default, what Docker sends when you run docker stop, and what Kubernetes sends before pod termination. Your application should handle this signal, close connections, flush buffers, and exit cleanly.
SIGINT (2): The keyboard interrupt. What happens when you press Ctrl+C. Functionally similar to SIGTERM, but conventionally means "the user wants to stop this." Most applications handle it the same as SIGTERM.
SIGKILL (9): The nuclear option. Cannot be caught, blocked, or ignored. The kernel immediately terminates the process. No cleanup handlers run. Buffers are not flushed. Temporary files are not deleted. Database transactions are not rolled back. Use this only as a last resort.
SIGHUP (1): Historically means "the terminal hung up." Modern usage is overloaded — many daemons interpret it as "reload your configuration file." Nginx, for example, reloads its config on SIGHUP.
Every production service should handle SIGTERM. Here is a pattern that works for an HTTP server:
const server = app.listen(3000);
let isShuttingDown = false;
function gracefulShutdown(signal) {
if (isShuttingDown) return;
isShuttingDown = true;
console.log(`Received ${signal}, starting graceful shutdown...`);
// Stop accepting new connections
server.close(() => {
console.log('All connections closed, shutting down.');
process.exit(0);
});
// Force shutdown after timeout
setTimeout(() => {
console.error('Could not close connections in time, forcing shutdown.');
process.exit(1);
}, 30000);
}
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));The critical detail: Docker sends SIGTERM and waits 10 seconds (configurable via stop_grace_period), then sends SIGKILL. If your shutdown takes longer than 10 seconds, your process gets killed. Kubernetes defaults to 30 seconds (terminationGracePeriodSeconds). Set these values based on how long your longest in-flight request or transaction takes.
# Send SIGTERM to a process
kill $PID # same as kill -15 $PID
# Send SIGHUP to reload config
kill -HUP $PID
# Send SIGKILL as last resort
kill -9 $PID
# Send signal to a process group (negative PID)
kill -TERM -$PGIDIf you are still using nohup node server.js & or a screen session to run production services, stop. Systemd is the init system on virtually all modern Linux distributions, and it provides everything you need for reliable service management: automatic restarts, dependency ordering, resource limits, logging, socket activation, and watchdog monitoring.
A systemd unit file tells the system how to run your service. Here is a production-quality example:
[Unit]
Description=My Application Server
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service
[Service]
Type=notify
User=appuser
Group=appgroup
WorkingDirectory=/var/www/myapp
ExecStart=/usr/bin/node server.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=3
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/www/myapp/data
# Resource limits
MemoryMax=2G
CPUQuota=200%
LimitNOFILE=65535
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=myapp
# Environment
EnvironmentFile=/etc/myapp/env
# OOM handling
OOMScoreAdjust=-200
[Install]
WantedBy=multi-user.targetThe key directives worth understanding:
Type=notify means the service sends a readiness notification to systemd. Better than Type=simple because systemd knows when the service is actually ready, not just when the process started.Restart=on-failure with RestartSec=5 means systemd restarts the service 5 seconds after a failure. The StartLimitBurst and StartLimitIntervalSec prevent restart loops — if the service fails 3 times within 60 seconds, systemd stops trying.NoNewPrivileges, ProtectSystem, PrivateTmp) sandbox the service with minimal overhead. These are free security wins.LimitNOFILE=65535 raises the file descriptor limit for the service, critical for servers handling many concurrent connections.Socket activation is an underused systemd feature. Instead of your service binding to a port on startup, systemd creates the socket and passes it to your service. This means zero-downtime restarts — the socket stays open while the service restarts, and incoming connections queue up instead of getting refused.
# myapp.socket
[Socket]
ListenStream=3000
Accept=no
[Install]
WantedBy=sockets.target# Enable socket activation
sudo systemctl enable --now myapp.socket
# The service starts automatically when a connection arrivesThe systemd watchdog kills and restarts your service if it stops sending heartbeats. This catches deadlocks and hangs that a simple process check would miss.
[Service]
WatchdogSec=30
# Service must call sd_notify("WATCHDOG=1") every 30 secondsContainers are not virtual machines. There is no hypervisor, no separate kernel, no hardware emulation. Containers are just regular Linux processes with two kernel features applied to them: namespaces for isolation and cgroups for resource limits. That is it. Understanding this demystifies containers entirely.
Namespaces partition kernel resources so that one set of processes sees one set of resources, and another set of processes sees a different set. Linux has several namespace types:
# See all namespaces for a process
ls -la /proc/$PID/ns/
# Create a new network namespace manually
sudo ip netns add test-ns
sudo ip netns exec test-ns ip addr
# You now have an isolated network environment
# Enter a container's namespaces manually
sudo nsenter -t $CONTAINER_PID -m -u -i -n -p /bin/bashCgroups limit, account for, and isolate resource usage. When you set --memory=512m on a Docker container, Docker creates a cgroup with a memory limit of 512 MB. When the processes in that cgroup exceed the limit, the kernel's OOM killer activates — but only within that cgroup, not system-wide.
# Cgroup v2 hierarchy (modern systems)
ls /sys/fs/cgroup/
# Check memory limit of a cgroup
cat /sys/fs/cgroup/system.slice/docker-$CONTAINER_ID.scope/memory.max
# Check current memory usage
cat /sys/fs/cgroup/system.slice/docker-$CONTAINER_ID.scope/memory.current
# CPU limits
cat /sys/fs/cgroup/system.slice/docker-$CONTAINER_ID.scope/cpu.maxThis is why container memory limits are important. Without a limit, a runaway process in a container can consume all host memory and trigger the host-level OOM killer, potentially taking down other containers. With a limit, the damage is contained.
When a packet arrives at your server's network interface, it takes a well-defined path through the kernel before your application sees it. Understanding this path explains why certain tuning parameters exist and why certain problems occur.
NIC receives the packet: The network interface card receives the electrical/optical signal and writes the packet data into a ring buffer in kernel memory via DMA (Direct Memory Access). The NIC triggers a hardware interrupt.
Interrupt and NAPI: The kernel acknowledges the interrupt and schedules a softirq to process the packets. Modern Linux uses NAPI (New API), which switches from interrupt-driven to polling mode under high load to prevent interrupt storms.
Netfilter/iptables: The packet passes through the netfilter framework, where iptables rules are evaluated. This happens at multiple points: PREROUTING (before routing decision), INPUT (destined for local process), FORWARD (being routed through), OUTPUT (generated locally), and POSTROUTING (leaving the system).
Connection tracking (conntrack): The netfilter connection tracking subsystem records the state of each network connection. This is what enables stateful firewall rules and NAT. It is also a common source of production problems when the conntrack table fills up.
Socket lookup: The kernel finds the socket that matches the packet's destination address and port.
Socket receive buffer: The packet data is placed into the socket's receive buffer. If the buffer is full, the packet is dropped.
Application read: Your application calls read() or recv() on the socket, which copies data from the kernel's socket buffer into user-space memory.
# Watch packets flow through netfilter
sudo iptables -t raw -A PREROUTING -p tcp --dport 3000 -j TRACE
sudo journalctl -k | grep TRACE
# Connection tracking table
sudo conntrack -L
sudo conntrack -C # count
# Check conntrack table size and maximum
sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_maxThis is a classic production problem. The conntrack table has a finite size (default is typically 65536 on many systems). When it fills up, new connections are dropped silently. You see this as random connection failures that look like network issues but are actually the kernel refusing to track new connections.
# Increase conntrack table size
sudo sysctl -w net.netfilter.nf_conntrack_max=262144
# Check if you're dropping packets due to conntrack
sudo dmesg | grep "table full"
# Or check the counter
cat /proc/sys/net/netfilter/nf_conntrack_countTCP has dozens of tunable parameters, but most of them are fine at their defaults. Here are the ones that actually cause production problems.
When a TCP connection closes, the side that initiates the close enters the TIME_WAIT state for 2 times the Maximum Segment Lifetime (typically 60 seconds total on Linux). This exists to ensure that delayed packets from the old connection do not get confused with packets from a new connection on the same port. In production, a server handling many short-lived connections can accumulate thousands of TIME_WAIT sockets, eventually exhausting ephemeral ports.
# Count TIME_WAIT connections
ss -s | grep TIME-WAIT
# Or more detail
ss -tan state time-wait | wc -l
# Enable TIME_WAIT socket reuse (safe for clients)
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
# Check ephemeral port range
sysctl net.ipv4.ip_local_port_range
# Default: 32768 - 60999 (28,231 ports)
# Expand ephemeral port range
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"The TCP listen backlog is a queue for connections that have completed the three-way handshake but have not yet been accepted by the application. If your application is slow to call accept(), this queue fills up and new connections are either dropped or receive RST packets.
# Check current listen backlog
ss -tlnp | grep :3000
# The default maximum backlog size
sysctl net.core.somaxconn
# Default: 4096 on modern kernels, 128 on older ones
# Increase if needed
sudo sysctl -w net.core.somaxconn=65535
# Check for SYN queue overflow (SYN flood or slow accept)
netstat -s | grep "SYNs to LISTEN"Your application also needs to set the backlog in its listen() call. In Node.js, server.listen(3000, { backlog: 511 }). Nginx defaults to 511. If either the application backlog or somaxconn is lower, the lower value wins.
TCP keepalive sends probe packets on idle connections to detect if the other side has disappeared (crashed, network failure, etc.). The defaults are extremely conservative — the first probe is sent after 2 hours of inactivity. For server applications, you almost always want to reduce these values:
# Default keepalive settings
sysctl net.ipv4.tcp_keepalive_time # 7200 seconds (2 hours!)
sysctl net.ipv4.tcp_keepalive_intvl # 75 seconds between probes
sysctl net.ipv4.tcp_keepalive_probes # 9 probes before declaring dead
# Reasonable production settings
sudo sysctl -w net.ipv4.tcp_keepalive_time=300
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=30
sudo sysctl -w net.ipv4.tcp_keepalive_probes=5Nagle's algorithm batches small TCP writes into larger segments to reduce overhead. This is usually good for bulk data transfer but terrible for interactive protocols or WebSocket traffic where latency matters. If you are seeing strange 40ms delays in your application, Nagle's algorithm combined with TCP delayed ACK is probably the culprit.
// Disable Nagle's algorithm in Node.js
socket.setNoDelay(true);
// In a net.Server
server.on('connection', (socket) => {
socket.setNoDelay(true);
});Disk I/O is the slowest thing your application does, by orders of magnitude. A random read from an NVMe SSD takes about 100 microseconds. A random read from a spinning HDD takes about 10 milliseconds. For context, reading from L1 CPU cache takes about 1 nanosecond. That means a disk read is 100,000 to 10,000,000 times slower than a cache hit.
Linux aggressively caches file data in RAM. When you read a file, the data goes into the page cache. Subsequent reads of the same data come from RAM, not disk. When you write to a file, the data goes into the page cache and is written back to disk later by the kernel's writeback threads. This is why free -h shows much of your RAM used by "buff/cache" — that is the page cache, and it is a good thing.
# See page cache usage
free -h
# The "available" column shows how much memory is truly available
# (free + reclaimable cache)
# Drop caches (for testing only, never in production)
echo 3 > /proc/sys/vm/drop_caches
# Check if a specific file is in the page cache
vmtouch /var/log/syslogThe page cache creates a window where data exists only in RAM and has not reached the disk. If the system crashes (power failure, kernel panic) during this window, that data is lost. This is why databases use fsync() — it forces the kernel to flush dirty pages to disk and waits for the disk to confirm the write is durable.
Calling fsync() is expensive. A single fsync() on a spinning disk takes at least one rotation (8ms at 7200 RPM). Even on SSDs, fsync() is relatively expensive because the drive needs to flush its internal write cache. This is the fundamental tension in storage: performance versus durability.
# Monitor disk I/O in real time
iostat -xz 1
# Check disk write throughput and latency
# %util near 100% means the disk is saturated
# await is average I/O latency in milliseconds
# Check how much dirty data is waiting to be written
cat /proc/meminfo | grep Dirty
# Tune writeback behavior
sysctl vm.dirty_ratio # % of RAM before synchronous writeback (default 20)
sysctl vm.dirty_background_ratio # % of RAM before background writeback (default 10)The kernel's I/O scheduler determines the order in which disk requests are processed. For SSDs, none (noop) or mq-deadline are appropriate because SSDs have no seek penalty. For HDDs, bfq or mq-deadline work well because they can coalesce and reorder requests to minimize seek time.
# Check current I/O scheduler
cat /sys/block/sda/queue/scheduler
# Change I/O scheduler (runtime)
echo mq-deadline > /sys/block/sda/queue/schedulerThese are the tools I reach for first when diagnosing production issues. Learn them before you need them.
strace traces system calls made by a process. It shows you every interaction between your application and the kernel — every file open, every network call, every memory allocation. It is the single most useful debugging tool on Linux.
# Trace a running process
sudo strace -p $PID
# Trace with timestamps and follow child processes
sudo strace -f -tt -p $PID
# Only trace network-related syscalls
sudo strace -e trace=network -p $PID
# Only trace file-related syscalls
sudo strace -e trace=file -p $PID
# Trace a new process and count syscall frequency
strace -c ls /tmpA word of caution: strace significantly slows down the traced process (often 10-100x). Do not use it on production processes under heavy load. For production tracing, use eBPF-based tools instead.
ss replaced netstat and is faster because it reads directly from kernel data structures instead of parsing /proc/net. Use it for everything network-related.
# All TCP connections with process info
ss -tlnp
# Connections in TIME_WAIT state
ss -tan state time-wait
# Connections to a specific port
ss -tan dst :5432
# Summary statistics
ss -s
# Show socket buffer sizes
ss -tmperf is the kernel's built-in profiler. It can profile CPU usage, cache misses, page faults, and much more using hardware performance counters.
# Profile CPU usage of a process for 10 seconds
sudo perf record -p $PID -g -- sleep 10
sudo perf report
# Count page faults
sudo perf stat -e page-faults -p $PID -- sleep 5
# See what functions are consuming CPU in real time
sudo perf top -p $PIDWhen something goes wrong at the kernel level — OOM kills, hardware errors, filesystem corruption, network driver issues — the evidence is in the kernel log.
# Kernel messages (recent, with timestamps)
dmesg -T | tail -50
# Follow kernel messages in real time
dmesg -w
# All logs for a specific service
journalctl -u myapp.service --since "1 hour ago"
# Kernel messages only
journalctl -k
# Follow logs in real time
journalctl -u myapp.service -f# lsof: list open files and network connections for a process
lsof -p $PID
# vmstat: virtual memory statistics (1 second interval)
vmstat 1
# sar: system activity report (historical data)
sar -u 1 10 # CPU usage, 10 samples, 1 second apart
sar -r 1 10 # Memory usage
sar -n DEV 1 # Network interface statistics
# htop: interactive process viewer (better than top)
htop
# nstat: network statistics counters
nstat -szeBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without modifying kernel source code or loading kernel modules. It has fundamentally changed how we observe and debug production systems.
Before eBPF, you had two options for kernel-level visibility: modify the kernel source and compile a custom kernel, or use kernel modules that could crash your system if buggy. eBPF programs are verified by the kernel before execution — they cannot crash the kernel, access arbitrary memory, or run forever. This makes them safe for production use.
The bcc and bpftrace toolkits provide ready-to-use eBPF tools for common debugging tasks:
# Trace all files opened by a process
sudo opensnoop-bpfcc -p $PID
# Trace TCP connections
sudo tcpconnect-bpfcc
# Trace high-latency disk I/O
sudo biolatency-bpfcc
# Trace DNS requests
sudo gethostlatency-bpfcc
# Custom bpftrace one-liner: histogram of read() sizes
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == '$PID'/ { @bytes = hist(args->ret); }'
# Trace TCP retransmits (indicates network problems)
sudo tcpretrans-bpfcc
# Profile function call latency
sudo funclatency-bpfcc -p $PID 'readline'eBPF is not just for debugging. It powers modern networking (Cilium uses eBPF for Kubernetes networking), security monitoring (Falco, Tetragon), and observability tools. If you are not familiar with eBPF yet, it is worth investing time in — it is becoming the standard way to extend and observe the Linux kernel.
These are the issues I have seen repeatedly across different teams and services. Every one of them is preventable.
Every open file, socket, pipe, and epoll instance uses a file descriptor. The default limit per process is typically 1024, which is far too low for a server handling many concurrent connections. When you hit the limit, calls to open(), socket(), and accept() fail with EMFILE (too many open files).
# Check current limits for a process
cat /proc/$PID/limits | grep "open files"
# Check how many FDs a process is using
ls /proc/$PID/fd | wc -l
# Set limits in systemd unit file
[Service]
LimitNOFILE=65535
# Set system-wide limits
# /etc/security/limits.conf
# appuser soft nofile 65535
# appuser hard nofile 65535
# Also ensure the system-wide maximum is high enough
sysctl fs.file-max
sudo sysctl -w fs.file-max=2097152The fix is simple: raise the limit. But also investigate why you are using so many file descriptors. Common culprits include connection pools that do not close connections, leaked socket handles, and applications that open files without closing them.
When your application makes outbound connections (to a database, a cache, an upstream API), the kernel assigns an ephemeral source port from a configured range. The default range is 32768-60999, giving you about 28,000 ports. With TCP connections in TIME_WAIT occupying ports for 60 seconds, a server making many short-lived outbound connections can exhaust this range.
# Check for port exhaustion
ss -tan | awk '{print $4}' | sort | uniq -c | sort -rn | head
# Count connections in TIME_WAIT to a specific destination
ss -tan state time-wait dst 10.0.0.5 | wc -l
# Solutions:
# 1. Expand the ephemeral port range
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# 2. Enable TIME_WAIT reuse
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
# 3. Use connection pooling (the real fix)
# Keep connections open instead of making new ones for every requestEach file on a filesystem uses an inode. The number of inodes is set when the filesystem is created and cannot be increased without reformatting. Log files, small temp files, and session files can silently exhaust inodes while you still have plenty of disk space.
# Check inode usage
df -i
# Find directories with the most files
find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20
# Common culprits:
# - /tmp with millions of session files
# - /var/spool with queued mail
# - /var/log with rotated log files
# - Container overlay filesystemsThe symptoms are identical to "disk full" — you cannot create files, you cannot write to databases, log files stop growing. But df -h shows plenty of space. Always check df -i when you see write failures with available disk space.
These show up in dmesg as "BUG: soft lockup" or "rcu_sched detected stalls." They mean a CPU has been stuck in kernel code for too long without yielding. Common causes include: a kernel bug, a badly behaving kernel module, an overloaded system where CPU interrupts cannot be serviced, or VM hypervisor issues where the host over-provisioned CPU time.
# Check for soft lockups
dmesg | grep -i "soft lockup\|rcu.*stall\|hung_task"
# Increase the soft lockup threshold if it's a false positive
# (e.g., in VMs with overcommitted CPU)
sudo sysctl -w kernel.softlockup_panic=0All the sysctl changes shown above are lost on reboot. To make them persistent, write them to a file in /etc/sysctl.d/:
cat > /etc/sysctl.d/99-app-tuning.conf << 'EOF'
# Network performance
net.core.somaxconn = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
net.netfilter.nf_conntrack_max = 262144
# Memory
vm.overcommit_memory = 0
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# File descriptors
fs.file-max = 2097152
EOF
# Apply immediately
sudo sysctl --systemThe common thread through all of this is that Linux is transparent. Every kernel data structure, every network connection, every process state, every resource limit — it is all exposed through /proc, /sys, and a rich set of tools. The kernel does not hide information from you. But it also does not volunteer it. You have to know where to look and what to look for.
My advice: do not try to memorize all of this. Instead, when you hit a production problem, come back to the relevant section. Run the commands. Read the output. Build mental models of how the kernel manages resources. Over time, these concepts become intuitive, and you start anticipating problems before they happen.
The best sysadmins I have worked with do not have superhuman knowledge. They just have a systematic approach: check the logs (dmesg, journalctl), check the resources (free, df, ss), trace the process (strace, /proc), and read the documentation (man pages are excellent for system calls). If you adopt this approach, you will solve most Linux problems faster than the people who immediately start Googling.
Linux rewards curiosity. Dig into /proc, read the kernel documentation, break things in a VM. The internals are not as scary as they look from the outside. They are just code — well-documented, battle-tested, and running the majority of the internet's infrastructure.