Shashikant shah

Sunday, 2 March 2025

Troubleshooting for k8s.



 i).Checking Cluster Health

1.Check Cluster Status

# kubectl cluster-info

2.Check Node Health

# kubectl get nodes

# kubectl describe node <node-name>

3.Check Pods in the kube-system Namespace

# kubectl get pods -n kube-system

# kubectl describe pod <pod-name> -n kube-system

4.debug a failing pod.

# kubectl logs <pod-name> -n kube-system

5.Check Component Status.

# kubectl get componentstatuses





6.Check API Server Health.

# kubectl get --raw /readyz

 (It should return ok if the API server is healthy.)

7.check API server logs.

# kubectl logs -n kube-system -l component=kube-apiserver

8.Kubelet logs on the Node.

# journalctl -u kubelet -f


ii). Troubleshooting Pods Issues

1.View logs of a Pod.

Exit Code 1: Application error.

Exit Code 137: OOMKilled (Out of Memory).

Exit Code 126: Permission denied.

Exit Code 127: Command not found.

# kubectl run nginx --image=nginx --port=80

# kubectl logs <pod-name>

# kubectl logs <pod-name> -n <namespace>

# kubectl logs <pod-name> -n <namespace> -c <container-name>

# kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

# kubectl logs --since=1h <pod-name> 

# kubectl logs --tail=20 <pod-name>

# kubectl logs <pod-name>  > pod.log 

2.Get detailed information about a Pod.

# kubectl describe pod my-pod -n my-namespace

3.List Pods and their statuses.

Running: Pod is working fine.

Pending: Pod is waiting for resources.

CrashLoopBackOff: The container keeps crashing and restarting.

ImagePullBackOff / ErrImagePull: Issues pulling the image.

OOMKilled: Container ran out of memory.

Completed: Pod has finished its job (for jobs or init containers).

 Terminating: Pod is shutting down (might be stuck).

# kubectl get pods -n <namespace>

# kubectl get pods -A

# kubectl get pods --all-namespaces

4.Open a shell inside a running container.

# kubectl exec -it <pod-name>  -- /bin/bash

# kubectl exec -it <pod-name>  -n <namespace>  -- /bin/bash

# kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

5.Forward a local port to a Pod.

# kubectl port-forward my-app-pod -n my-namespace 8080:8080

6.Debugging with temporary container.

# kubectl debug -it my-pod --image=busybox --target=my-container

7.Check Events for Errors.

# kubectl get events --sort-by=.metadata.creationTimestamp

# kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

# kubectl get events -w

# kubectl get events 

# kubectl get events -n kube-system

8.Check CPU or memory pods.

# kubectl top pods -n <namespace>

9.Check DNS Resolution Issues.

# kubectl exec -it <pod-name> -- nslookup google.com

# kubectl exec -it <pod-name> -- cat /etc/resolv.conf

# kubectl get pods -n kube-system -l k8s-app=kube-dns

# kubectl logs -n kube-system -l k8s-app=kube-dns

10.Restart CoreDNS if needed.

# kubectl rollout restart deployment coredns -n kube-system

11.Pod Stuck in Terminating.

# kubectl delete pod <pod-name> --grace-period=0 --force -n <namespace>

# kubectl patch pod <pod-name> -n <namespace> -p '{"metadata":{"finalizers":null}}'

12.Generating Kubernetes YAML Files Using Commands.

For Deployment

# kubectl create deployment nginx-deployment --image=nginx --replicas=2 --dry-run=client -o yaml > nginx-deployment.yaml

For Pod.

# kubectl run my-pod --image=nginx --restart=Never --dry-run=client -o yaml > my-pod.yaml

For Service (NodePort).

# kubectl expose deployment my-app --port=80 --type=NodePort --dry-run=client -o yaml > my-service.yaml

For configMap.

# kubectl create configmap my-config --from-literal=key1=value1 --dry-run=client -o yaml > my-config.yaml

For Secret.

# kubectl create secret generic my-secret --from-literal=username=admin --from-literal=password=1234 --dry-run=client -o yaml > my-secret.yaml

13.Understanding --dry-run=client

Check if a YAML File is Correct Before Creation

# kubectl create -f first_pod.yaml --dry-run=client

Check if Applying a YAML Will Work

# kubectl apply -f first_pod.yaml --dry-run=client

14.check pods IP address.

# kubectl get pods -o wide

15.List All Pods with Labels.

# kubectl get pods --show-labels

# kubectl get pod <pod-name> --show-labels

 

3.Troubleshooting Node Issues

1.Check Node Status.

When you run this command, you’ll see a list of all the nodes in your cluster, along with their current status. The status can be one of the following:

  • Ready: This is the ideal state. It means that the node is healthy and ready to accept and run containers.
  • NotReady: This status indicates that the node is not functioning correctly or is experiencing issues that prevent it from running containers. Nodes in this state might have resource constraints, network problems, or other issues.
  • SchedulingDisabled: This status means that the node is explicitly marked as unschedulable, preventing new containers from being scheduled on it. This can be useful for maintenance or troubleshooting purposes.
  • Unknown: In some cases, the node’s status might be unknown due to communication problems with the Kubernetes control plane.

 

# kubectl get nodes

2.Resource Metrics. (Metrics API required.)

# kubectl top nodes

# kubectl top pods -n my-namespace

3.Get Detailed Node Information.

# kubectl describe node <node-name>

4.Check Kubelet Logs

# journalctl -u kubelet -f --no-pager

5.Restart Kubelet and Docker/Container Runtime.

# systemctl status kubelet

# systemctl status  containerd

# systemctl status  docker

6.Verify Network Connectivity from node.

# ping <API-SERVER-IP>

# curl -k https://<API-SERVER-IP>:6443/healthz

7.List All Nodes with Their Labels.

# kubectl get nodes --show-labels

# kubectl get node <node-name> --show-labels

# kubectl get nodes -l env=prod

8.Remove a Failed Node.

# kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# kubectl delete node <node-name>

 

iv).Troubleshooting Networking Issues

1.Check Pod Networking

# kubectl get pods -o wide

# kubectl get nodes -o wide

# kubectl get pods -o wide -n <namespace>

2.Restart the network plugin (CNI)

# systemctl restart kubelet

# kubectl get pods -n kube-system

# kubectl get pods -n kube-system | grep cni

# kubectl get pods -n kube-system -l k8s-app=calico-node

# kubectl get pods -n kube-system -l k8s-app=flannel

# kubectl get pods -n kube-system -l k8s-app=cilium

# kubectl logs -n kube-system -l k8s-app=calico-node

# kubectl logs -n kube-system -l k8s-app=flannel

# kubectl logs -n kube-system -l k8s-app=cilium

# ls /etc/cni/net.d/

# cat /etc/cni/net.d/10-calico.conf

3.Restart CNI Pods

# kubectl delete pod -n kube-system -l k8s-app=calico-node

Common Networking Issues

i). Test Pod-to-Pod Communication.

# kubectl exec -it <pod-name> -n <namespace> -- ping <another-pod-ip>

# kubectl exec -it <pod-name> -n <namespace> -- ping <destination-pod>.<destination-namespace>.svc.cluster.local

# iptables -L -v -n

# kubectl logs -n kube-system -l k8s-app=kube-dns

restart CoreDNS:

# kubectl rollout restart deployment coredns -n kube-system

# kubectl get pods -n kube-system -l k8s-app=kube-proxy

# kubectl rollout restart ds kube-proxy -n kube-system

# kubectl logs -n kube-system -l k8s-app=kube-proxy

 

ii)Network Policies Blocking Traffic.

# kubectl get networkpolicy -n <namespace>

# kubectl describe networkpolicy <policy-name> -n <namespace>

# kubectl delete networkpolicy <policy-name> -n <namespace>

 

iii).Check Node-to-Node Networking.

# kubectl get nodes -o wide

# ping <node-ip>

# kubectl logs -n kube-system -l k8s-app=<cni-plugin>

 

iv)DNS Resolution Failures.

# kubectl get pods -n kube-system -l k8s-app=kube-dns

# kubectl get pods -n kube-system | grep coredns

# kubectl exec -it <pod-name> -- nslookup google.com

Restart CoreDNS

# kubectl delete pod -n kube-system -l k8s-app=kube-dns

# systemctl restart kubelet


v). External Traffic Not Reaching Services

# kubectl get svc -n <namespace>

Service Type

How External Traffic Works

ClusterIP

Only accessible inside the cluster

NodePort

Accessible on each node's external IP on a specific port

LoadBalancer

Uses a cloud provider’s LB to expose the service

Ingress

Routes external traffic to a service based on hostname/path

 

# kubectl get endpoints -n <namespace>

# kubectl describe svc my-service -n <namespace>

restart kube-proxy:

# kubectl rollout restart ds kube-proxy -n kube-system

# kubectl get events -n <namespace>

# kubectl get ingress -n <namespace>

# kubectl logs -n kube-system -l app.kubernetes.io/name=ingress-nginx

# kubectl describe ingress my-ingress -n <namespace>

# kubectl rollout restart ds kube-proxy -n kube-system

# kubectl rollout restart ds calico-node -n kube-system # If using Calico systemctl restart kubelet

5.Troubleshooting API Server Issues

# kubectl version





Client Version provides information about your local kubectl client.

Server Version provides information about the Kubernetes API server to which your kubectl client is connected.

Ensure that the client version matches or is compatible with the server version. 

1.Check API Server Status

# kubectl get componentstatuses

2.Check API Server Logs

# journalctl -u kubelet -f

3.Check API Server Status on Control Plane Nodes.

# crictl ps

4.Check API Server Health Endpoint.

# kubectl get --raw /readyz

# kubectl get --raw /livez

 # kubectl get --raw /healthz

5.Verify etcd is Healthy.

# ETCDCTL_API=3 etcdctl --endpoints=<etcd-endpoint> endpoint health

6.Check Certificate Issues.

# kubeadm certs check-expiration

7.Renew certificates if needed.

# kubeadm certs renew all

# systemctl restart kube-apiserver

8.Verify RBAC Configuration.

# kubectl auth can-i get pods --as <user>

If denied, check and fix RBAC policies:

# kubectl get clusterrolebindings

# kubectl get roles -n <namespace>

# kubectl get rolebindings -n <namespace>

9.Check for High CPU or Memory Usage.

# kubectl top nodes

# kubectl top pods -n kube-system

10.Viewing Cluster Events

# kubectl get events --all-namespaces

# kubectl get events

# kubectl get events --sort-by=.metadata.creationTimestamp

 

6.Troubleshooting Storage Issues













1.PVC Stuck in Pending State

# kubectl get pvc -n <namespace>

# kubectl describe pvc <pvc-name> -n <namespace>

# kubectl get storageclass

# kubectl describe storageclass <storageclass-name>

# kubectl get pv


2.Pod Failing to Mount Volumes

# kubectl get pvc -n <namespace>

# kubectl describe pvc <pvc-name> -n <namespace>

# kubectl get pod <pod-name> -n <namespace> -o yaml

# kubectl describe pod <pod-name> -n <namespace>


3.Data Not Accessible or Lost

# kubectl describe pod <pod-name> -n <namespace>


4.Storage Class Not Working

# kubectl get storageclass

# kubectl describe storageclass <storageclass-name>

# kubectl get pods -n kube-system | grep csi

# kubectl delete pod <csi-plugin-pod> -n kube-system


5.Volume Mount Permissions Issues

# kubectl exec -it <pod-name> -n <namespace> -- ls -l /path/to/mount

# kubectl get pod <pod-name> -n <namespace> -o yaml


6.Storage Provider-Specific Issues

# kubectl describe pvc <pvc-name> -n <namespace>

 

 Check All Resources in All Namespaces

# kubectl get all --all-namespaces

# kubectl get all -A

 

Check Specific Resources

Resource

Command

Pods

kubectl get pods -A

Services

kubectl get svc -A

Deployments

kubectl get deployments -A

StatefulSets

kubectl get statefulsets -A

ReplicaSets

kubectl get replicasets -A

DaemonSets

kubectl get daemonsets -A

Jobs

kubectl get jobs -A

CronJobs

kubectl get cronjobs -A

Persistent Volumes (PV)

kubectl get pv

Persistent Volume Claims (PVC)

kubectl get pvc -A

ConfigMaps

kubectl get configmaps -A

Secrets

kubectl get secrets -A

Ingress

kubectl get ingress -A

No comments:

Post a Comment