0 0 Share PDF

Debug RethinkDB fails to find primary replica in cluster

Article ID: KB000215

Issue

DTR cluster is unhealthy, and health status logs emit messages like:

"2541aa473fde": "Unhealthy replicas: 2541aa473fde; reasons: Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_primary, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum, Rethink replica is currently in state: waiting_for_quorum",

Prerequisites

Exec into dtr-rethinkdb-nnnnnnnnnnnn on all DTR nodes and confirm config/replica_ids.csv in the container filesystem contains a comma-separated list of all the DTR replica IDs in the cluster.

Example in a 3-node DTR cluster:

$ docker exec dtr-rethinkdb-a2382528a00b cat /config/replica_ids.csv
a2382528a00b,22476e1b69ee,f759e4b0fdaa

Resolution

  1. If the /config/replica_ids.csv file does NOT contain all of the replica IDs of the cluster members, use a command line editor to make it so:
$ docker exec -it dtr-rethinkdb-a2382528a00b vi /config/replica_ids.csv
  1. Then, docker stop all of the rethinkdb containers so they are all stopped at the same time.
$ docker stop dtr-rethinkdb-a2382528a00b
$ docker stop dtr-rethinkdb-22476e1b69ee
$ docker stop dtr-rethinkdb-f759e4b0fdaa
  1. Then, start one rethinkdb container on one DTR node and follow its logs:
$ docker start dtr-rethinkdb-a2382528a00b
$ docker logs -f dtr-rethinkdb-a2382528a00b
  1. Open another terminal, and while watching the logs from the first rethink container, start the remaining rethink containers and check for "Connected to server" messages.

  2. If the majority of containers still do not connect to each other, start a test container on the dtr-ol network and try to find the IP address of the other replicas from the internal DNS resolver:

$ docker run --rm -it --net dtr-ol --entrypoint sh docker/dtr-rethink:$DTR_VERSION
$ getent hosts dtr-rethinkdb-$OTHER_REPLICA_ID
  1. Repeat this several times, and confirm that the reply is consistently NXDOMAIN if that dtr-rethinkdb container is not currently running or a single IP address if that replica id is currently running.

If there is a DNS entry for a dtr-rethinkdb container that is not currently running or there are multiple DNS entries for the same dtr-rethinkdb container name, then there is an issue with overlay network service discovery which can only be cleared by stopping the Docker daemon on all DTR nodes such that the daemons are stopped at the same time, then starting the Docker daemon on all nodes.