Skip to main content

Docker Success Center

The Docker enterprise customer portal.

Docker, Inc.

How can I clean a failed UCP manager node and re-add it back into the cluster?

As long as your cluster is still functional and has not lost quorum (more than (n/2)-1 nodes failed), you can use the following steps to remove a failed UCP manager node, clean it, and add it back to the cluster.

If your cluster has lost too many members, you must use the UCP backup and restore functionality to reset your cluster to a single controller node cluster.

Steps

On a known good working UCP manager node, demote and remove the failed manager.

  1. First, using the command line on a known good working manager node, find the failed manager node by looking for the Unreachable status for the node:
    $docker node ls  
    ID                           HOSTNAME                       STATUS  AVAILABILITY  MANAGER STATUS
    5dedrbwh2tr8enjod24t0grl9    ip-172-31-2-146.ec2.internal   Ready   Active
    cm0tqq479j4p4quthx1itlomd    ip-172-31-5-182.ec2.internal   Ready   Active        Unreachable
    gjplwu97lsffn8x1nh8cmpore    ip-172-31-29-146.ec2.internal  Ready   Active
    krz68mapcv6yxmawv5n0gx6yz    ip-172-31-1-168.ec2.internal   Ready   Active
    p2dit0e30e2aavqu0jza0lxir    ip-172-31-7-139.ec2.internal   Ready   Active
    r4ou4v7prwksm61ymzdxy7wqs    ip-172-31-16-132.ec2.internal  Ready   Active
    s0xc7fnlz934dxlznk1weu4az *  ip-172-31-4-208.ec2.internal   Ready   Active        Leader
    w4ya2je2zwzww2qku1eir8a49    ip-172-31-22-95.ec2.internal   Ready   Active        Reachable
    
  • Next, demote the unreachable node:
    $ docker node demote cm0tqq479j4p4quthx1itlomd
    
  • Check that it is not a manager
    $ docker node ls
    ID                           HOSTNAME                       STATUS  AVAILABILITY  MANAGER STATUS
    5dedrbwh2tr8enjod24t0grl9    ip-172-31-2-146.ec2.internal   Ready   Active
    cm0tqq479j4p4quthx1itlomd    ip-172-31-5-182.ec2.internal   Ready   Active      
    gjplwu97lsffn8x1nh8cmpore    ip-172-31-29-146.ec2.internal  Ready   Active
    krz68mapcv6yxmawv5n0gx6yz    ip-172-31-1-168.ec2.internal   Ready   Active
    p2dit0e30e2aavqu0jza0lxir    ip-172-31-7-139.ec2.internal   Ready   Active
    r4ou4v7prwksm61ymzdxy7wqs    ip-172-31-16-132.ec2.internal  Ready   Active
    s0xc7fnlz934dxlznk1weu4az *  ip-172-31-4-208.ec2.internal   Ready   Active        Leader
    w4ya2je2zwzww2qku1eir8a49    ip-172-31-22-95.ec2.internal   Ready   Active        Reachable
    
  • Remove the node, --force may or may not be needed
    $ docker node rm cm0tqq479j4p4quthx1itlomd  --force  
    ~ $ docker node ls --- now your down to a 2 node cluster
    ID                           HOSTNAME                       STATUS  AVAILABILITY  MANAGER STATUS
    5dedrbwh2tr8enjod24t0grl9    ip-172-31-2-146.ec2.internal   Ready   Active
    gjplwu97lsffn8x1nh8cmpore    ip-172-31-29-146.ec2.internal  Ready   Active
    krz68mapcv6yxmawv5n0gx6yz    ip-172-31-1-168.ec2.internal   Ready   Active
    p2dit0e30e2aavqu0jza0lxir    ip-172-31-7-139.ec2.internal   Ready   Active
    r4ou4v7prwksm61ymzdxy7wqs    ip-172-31-16-132.ec2.internal  Ready   Active
    s0xc7fnlz934dxlznk1weu4az *  ip-172-31-4-208.ec2.internal   Ready   Active        Leader
    w4ya2je2zwzww2qku1eir8a49    ip-172-31-22-95.ec2.internal   Ready   Active        Reachable
    
  • Go into etcd and confirm that etcd is seeing the correct amount of nodes, if the failed node is still listed remove that node as well https://docs.docker.com/datacenter/u...-failed-member
  • On the failed node remove all of the running containers, and verify:
    $docker rm -f $(docker ps -aq)
    $docker ps -a
    
  • Update rethinkdb. First inspect the UCP container running rethinkdb and confirm the IP address and port that container is running on, typically something in the range of 172.17.0.x:
    $ docker inspect 
                },
                "NetworkMode": "default",
                "PortBindings": {
                    "12383/tcp": [
                        {
                            "HostIp": "0.0.0.0",
                            "HostPort": "12383"
                        }
                    ],
                    "12384/tcp": [
                        {
                            "HostIp": "0.0.0.0",
                            "HostPort": "12384"
                        }
                    ]
                },
    .
    .
    .
                },
                "SandboxKey": "/var/run/docker/netns/9593fa46396c",
                "SecondaryIPAddresses": null,
                "SecondaryIPv6Addresses": null,
                "EndpointID": "1aa4ffe690724c5324e1c08477dac8dfe90e737cb1391a5cb426ae476923cf1b",
                "Gateway": "172.17.0.1",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "IPAddress": "172.17.0.13",
                "IPPrefixLen": 16,
                "IPv6Gateway": "",
                "MacAddress": "02:42:ac:11:00:0d",
                "Networks": {
                    "bridge": {
                        "IPAMConfig": null,
                        "Links": null,
                        "Aliases": null,
                        "NetworkID": "c59ef004735ade27bb3edc2b0e2ade90edd5a1c74b72e15937bb186d026895f8",
                        "EndpointID": "1aa4ffe690724c5324e1c08477dac8dfe90e737cb1391a5cb426ae476923cf1b",
                        "Gateway": "172.17.0.1",
                        "IPAddress": "172.17.0.13",
                        "IPPrefixLen": 16,
                        "IPv6Gateway": "",
                        "GlobalIPv6Address": "",
                        "GlobalIPv6PrefixLen": 0,
                        "MacAddress": "02:42:ac:11:00:0d"
                    }
                }
    
    
  • Using the command below and the information captured from the ucp-auth-store container above to set the rethinkdb command to set the correct number of working nodes $NUM_REPLICAS in your cluster (not including the failed node) and use the provided IP from the inspect for the $HOST_ADDRESS:
    docker run --rm -v ucp-auth-api-certs:/tls docker/ucp-auth:1.1.5 --db-addr "$HOST_ADDRESS:12383" reconfigure-db -n $NUM_REPLICAS --emergency-repair
    ~ $ docker run --rm -v ucp-auth-api-certs:/tls docker/ucp-auth:1.1.5 --db-addr "172.17.0.13:12383" reconfigure-db -n 2 --emergency-repair
    
  • ssh to failed node, verify node is not a part of any swarm at this time, the no containers are running or exited. Need to keep or restore the UCP images if and when your adding the node back into the cluster as a UCP manager node
    $docker node ls
    $docker ps -a
    $docker images
    
  • 6. Prepare the host to re-add:
    $systemctl stop docker 
    $rm -fR /var/lib/docker 
    $rm /etc/docker/key.json 
    $systemctl start docker
    
  • Once docker starts up, you should be able to capture the join node command from the UCP UI to rejoin this node as a worker or as a manager (depending on which UCP version your running)