Troubleshooting a pending pod in a Kubernetes cluster (AKS)

Last week I had an interesting issue with one of our clients AKS clusters, and in the spirit of sharing I thought I would document the solution and share my findings.

I find that when it comes to issues in Kubernetes it’s usually not the solution itself that is the tricky part, which is usually just a single command line, but rather finding information and troubleshooting that I find is the most interesting to read about. So here’s a little short story of what happened and what was the solution in my case:

Pod status pending due to insufficient memory.

Our client noticed that one of our applications responsible for rendering forms on the website was no longer up and running, even though this is a self-healing service that should spin up a new pod if it ever goes done.

So the first thing I checked was all the pods running in the production cluster, and as it turns out, one of our pod had the status Pending.

kubectl get pods

Name: forms-13123213f5-dimdsid
Ready: 0/1
Status: Pending

So the first thing I tried was to kill this pod and see if it would start up again. I thought that maybe there was some sort of issue going on the last time this pod tried to start up which left it in a pending state.

kubectl delete pod [pod-name]

This however, did not solve the problem. The pod goes away, and comes back only to return in the same pending state. The next thing I tried was to get some more information about the problem, since pending isn’t very helpful. So next I used the describe command.

kubectl describe pod [pod-name]

This gave me some useful information:

0/3 nodes available: insufficient memory.

According to this information, none of my nodes have enough memory to start up new pods, and to prove that statement I ran the following command:

kubectl top nodes

This gave me a list of all my nodes together with the current amount of memory being used, which was avaraging around 85-90%. So it seems like all of my nodes are under heavy pressure at the moment and therefor does not have enough resources to start any new pods on them.

This later turned out to be a combination of heavy traffic and heavy long-running background jobs. By running kubectl top pods I got a list of all my pods and how much resources they where using. This is not really relevant for this story since I guess the reason for not having enough memory on your nodes will vary from case to case. The important part is how to solve it.

Once I established that the problem was due to insufficient memory I knew I had to assign more nodes to my AKS cluster. You can add more nodes using some fancy command line, but I found the Azure portal to be super easy to use for these kind of operation that you rarely have to do.

  1. Under resources, go to your Kubernetes service
  2. Click on Node pools
  3. Next click on Autoscaling (in this case autoscaling was Disabled, more on this later in this blogpost)
  4. This will open up a dialog called Scale node pools.
  5. Increase node count using the slides and then click Apply.

That’s it. In my case I added one more node to the cluster which was enough for the pod to be able to spin up and I saw an instant decrease of pressure on my cluster. Had I had Autoscaling enabled from the start this would probably never had happened, and we will investigate together with the client if this feature might have to be enabled and also possibly have more memory assigned to each node in the cluster, but this is an ongoing investigation.

Maybe this might be useful to someone and hopefully it didn't just sound like ramblings. As I've mentioned before I'm no Kubernetes ninja but I find documenting my work helps when learning various Kubernetes tasks.

Cheers friends! ❤️