0 0 Share PDF

Server loses connection after Swarm init or join

Article ID: KB000780

Issue

Initializing a swarm or joining an existing swarm causes loss of network connection from a specific EC2 instance on AWS. This affects machines running Docker EE on Windows Server only.

On AWS console, error message like this will be shown few minutes after executing docker swarm join or docker swarm init:

Instance reachability check failed at <Date>

Prerequisites

This issue affects servers that meets the following criteria:

  1. Windows Server 2016 is installed
  2. Amazon EC2 instance C3, C4, D2, I2, M4 (excluding m4.16xlarge), M5, and R3
  3. For on-premise installation, a virtual machine has Intel(R) 82599 Virtual Function as primary NIC driver

Root Cause

The drives affected may be conflicting with HNS (Host Network Service) on the Windows host. This issue is still under investigation as of January 2019.

Resolution

As of today, the workaround to this issue are:

  1. Use other AWS instances such as i3 or r4
  2. Consider alternative virtual NIC with on-premise installation

If your server has already lost connection, and you need to recover data, please use following steps on AWS:

  1. Copy Instance ID of the affected server.
  2. Create a brand new network interface with public IP.
  3. Select the network interface just created.
  4. Click Attach and specify the Instance ID from step 1.
  5. Connect to the instance using the new public IP from step 2.

What's Next

Docker and AWS support team is working on this issue currently. This article will be updated as progress is made. If you think your cluster is affected by this, please file a support case with Docker Support as well as Amazon AWS support.