0 0 Share PDF

Nodes down due to calico node unhealthy on swarm only cluster

Article ID: KB000865

Issue

If a calico-node pod node becomes unhealthy or fails to schedule, the corresponding node is marked as Down in Universal Control Plane (UCP) with error message "Calico-node pod is unhealthy: %!s(<nil>)". This problem can be encountered even on the cluster with no deployed Kubernetes workloads.

Prerequisites

*Note: on UCP 3.0.x, this procedure can be applied to all nodes. On UCP 3.1.x, this procedure can only be applied to worker nodes. If this procedure is applied to manager nodes on UCP 3.1.x, UCP will stop displaying CPU, memory, and disk utilization information in the UI.

Resolution

The health status of calico-node applies to Kubernetes networking and has no direct effect on swarm functionality. Affected nodes will still accept swarm services and containers as well as participate in swarm multi-host networks. Moreover, UCP and DTR _ping endpoints will still report good health.

To avoid nodes reporting as down due to calico-node health issues in a swarm-only cluster, apply a kubernetes taint that disables pod scheduling, then delete currently deployed calico-node pods:

  1. Apply a taint to disable scheduling:

    Warning: this command will disable Kubernetes scheduling on the nodes to which it is applied. Do not run this command on a cluster with Kubernetes workloads.

    On UCP 3.0.x, calico can be disabled on all nodes:

    kubectl taint node --all com.docker.ucp.orchestrator.swarm=true:NoSchedule
    

    On UCP 3.1.x, calico should be disabled on worker nodes only:

    kubectl  taint node --selector node-role.kubernetes.io/master!= com.docker.ucp.orchestrator.swarm=true:NoSchedule
    
    Warning: Applying taints to manager nodes will disable UCP metrics in versions 3.1.x and higher.

    The name of the taint used here (com.docker.ucp.orchestrator.swarm) is arbitrary. Taints do not apply to nodes subsequently added to the cluster. You can re-apply with the same command after adding new nodes to the cluster.

  2. Remove existing calico-node pods:

    kubectl -n kube-system delete pod --selector k8s-app=calico-node
    
  3. Confirm calico-node is running where desired:

    kubectl -n kube-system get pod -o wide --selector k8s-app=calico-node 
    

    Expected output on UCP 3.0.x is No resources found. Expected output on UCP 3.1.x is one pod per manager node.

If you want to re-enable Kubernetes scheduling at a later time, you can do so by removing the taint from all nodes:

kubectl taint node --all com.docker.ucp.orchestrator.swarm-