0 0 Share PDF

How to recover from a split gossip cluster

Article ID: KB000759

Issue

If two or more containers on the same overlay network can't communicate with each other, it's possible that the nodes where the containers are running are members of a split gossip cluster.

The gossip cluster can get into a split state when all manager nodes are stopped/started at the same time while worker nodes are running.

Prerequisites

A cluster with more than one managers is needed for this issue to occur.

The following is an indication that you are hitting the gossip split cluster issue:

  • Two or more containers on the same overlay network can't communicate with each other.

  • The following engine log message, which is printed every 5 minutes, displays statistics about a specific network ID: If the netPeers value does not match the number of nodes in your cluster that are part of this network ID, then it's a sign of a split gossip cluster.

    level=info msg="NetworkDB stats - netID:nyzt77p9kxn9yaw4ptln78wpr leaving:false netPeers:1 entries:2 Queue qLen:0 netMsg/s:0"
    

Root Cause

Due to a bug in libnetwork, at least one manager should be available and connected to worker nodes to preserve the gossip cluster reconciliation.

Resolution

To converge all nodes into a single gossip cluster, restart all the worker nodes in the cluster.

What's Next

This bug has been addressed in 17.07.2-ee-16.

If you have any question please contact Docker Support.