How do I troubleshoot pod crashes in Kubernetes?

Troubleshooting pod crashes in Kubernetes can involve several steps, depending on the root cause of the issue. Here’s a comprehensive guide to identifying and resolving pod crashes:

1. Identify the Problem

Start by gathering information about the pod that is crashing:
bash kubectl get pods kubectl describe pod <pod-name> kubectl logs <pod-name>

kubectl get pods: Check the pod status; it might show statuses like CrashLoopBackOff, Error, or Terminating.
kubectl describe pod: Look at the events at the bottom for insights into what triggered the crash (e.g., failed liveness/readiness probes, resource limits exceeded, etc.).
kubectl logs: Review the application logs for errors or exceptions (if the container logs are accessible).

2. Examine CrashLoopBackOff Issues

If the pod is in CrashLoopBackOff, it means the container is repeatedly crashing and Kubernetes is retrying to start it.

Common Causes:

Application Errors: The application inside the container might be encountering a runtime exception or misconfiguration.
Check container logs using kubectl logs.
Ensure that the application is properly configured (e.g., environment variables, database connections, etc.).
Resource Limits: The pod might be exceeding resource limits (CPU or memory).
Check the pod specification for resource requests and limits in the YAML.
Use kubectl top pod to monitor resource usage.
Missing Dependencies: Verify that all dependencies (e.g., ConfigMaps, Secrets, volumes, or services) are properly configured.
Incorrect Command/Args: Check the command and args in the pod specification for typos or invalid syntax.

3. Check Liveness and Readiness Probes

Faulty probes can cause Kubernetes to kill and restart pods unnecessarily.

Inspect the probes defined in the pod spec (livenessProbe and readinessProbe).
Ensure the probe commands (e.g., HTTP GET, TCP socket check) are valid and pointing to the correct ports.
Temporarily disable probes to see if the pod stabilizes.

4. Inspect Events

Use kubectl describe pod to check for events at the bottom of the output. Common events include:
– Back-off restarting failed container: Indicates frequent restarts due to a crash.
– Failed scheduling: Could indicate resource constraints or node affinity issues.

5. Analyze Node Conditions

The problem might be related to the node where the pod is scheduled:
– Verify node health:
bash kubectl get nodes kubectl describe node <node-name>
– Check for disk pressure, memory pressure, or CPU pressure that might prevent pods from running.
– Ensure the node has enough capacity for the pod’s resource requests.

6. Examine Container Logs

If the pod is crashing due to an application error, the container logs are your best source of information:
bash kubectl logs <pod-name> --previous
This command retrieves logs from the previous container instance before the crash.

7. Investigate Image Issues

Verify that the container image is valid and accessible:
– Ensure the image exists in the container registry.
– Check for image pull errors (ErrImagePull or ImagePullBackOff).
– Verify that the correct image tag is being used.

8. Check for OOMKilled (Out of Memory)

Pods can crash if they consume more memory than allocated:
bash kubectl describe pod <pod-name>
Look for the OOMKilled reason under the State section.

Possible solutions:
– Increase memory limits in the pod spec.
– Optimize the application to use less memory.

9. Verify Networking

If the pod depends on external services or APIs, networking issues could cause crashes:
– Test network connectivity using kubectl exec to run commands inside the pod.
– Ensure proper DNS resolution within the cluster.

10. Debugging with Ephemeral Containers

If the pod crashes too quickly for logs to be useful, use ephemeral containers to debug:
bash kubectl debug pod/<pod-name> -c <container-name>
This allows you to attach a temporary container to the pod for investigation.

11. Check Resource Quotas and Limits

Ensure your namespace has sufficient resources:
bash kubectl describe quota kubectl describe limitrange
If the pod exceeds quota or limit constraints, adjust them accordingly.

12. Review Deployment Configuration

If the pod is part of a Deployment, check the Deployment spec for issues:
bash kubectl describe deployment <deployment-name> kubectl get rs # Check replicasets
Ensure that the Deployment strategy and replicas are correctly configured.

13. Check Storage Volumes

If the pod uses persistent volumes (PVs) or persistent volume claims (PVCs), verify that storage is correctly provisioned:
bash kubectl get pvc kubectl describe pvc <pvc-name> kubectl describe pv <pv-name>
Ensure the volume is mounted and accessible.

14. Use Kubernetes Dashboard or Monitoring Tools

Use tools like Kubernetes Dashboard, Prometheus, Grafana, or Lens to get a visual overview of the cluster and identify anomalies.

15. Escalate or Recreate

If all else fails:
– Delete the pod to allow Kubernetes to recreate it (if managed by a Deployment or ReplicaSet):
bash kubectl delete pod <pod-name>
– Revisit your application code or container image for deeper issues.

Summary:

Pod crashes can occur due to application errors, resource constraints, misconfigurations, or environmental issues. Follow a systematic approach to gather logs, inspect events, review configurations, and analyze resource usage to pinpoint and resolve the issue.

Let me know if you need further assistance!