0 0 Share PDF

How to restore DTR 2.2+ from volume backups

Issue

Sometimes it may be necessary to restore DTR from a set of volume backups, in particular when operating DTR 2.2 - 2.4 and experiencing a loss of quorum, necessitating an emergency scale-down of the DTR cluster. This procedure has been tested to work with DTR 2.2.x and 2.3.x, and is expected to work on DTR 2.4 as well. DTR 2.5 has a top-level "emergency recovery" command that can be used in lieu of this process.

Prerequisites

  1. It may be helpful to install the jq binary from your package repositories, or obtain the binary directly from the jq website. If your environment is air-gapped or requires validation of software packages before installing, this process can be performed without jq, but will require some manual file editing.

  2. You should have a set of DTR volume backups accessible in your current working directory:

    cd ~
    ls -lh | grep dtr
    
    >  total 16K
    >  drwxr-xr-x 3 root root 4.0K Mar 26 19:51 dtr-ca-8b7aa2291ede
    >  drwxr-xr-x 3 root root 4.0K Mar 26 19:51 dtr-postgres-8b7aa2291ede
    >  drwxr-xr-x 3 root root 4.0K Mar 26 19:51 dtr-registry-8b7aa2291ede
    >  drwxr-xr-x 3 root root 4.0K Mar 26 19:51 dtr-rethink-8b7aa2291ede
    

    You may or may not have existing docker volumes that are still usable. These instructions assume you are restoring from a blank slate.

    Each volume backup directory should have _data as the top-level folder in each:

    tree -L 2 ~/
    
    >  .
    >  ├── dtr-ca-8b7aa2291ede
    >  │   └── _data
    >  ├── dtr-postgres-8b7aa2291ede
    >  │   └── _data
    >  ├── dtr-registry-8b7aa2291ede
    >  │   └── _data
    >  └── dtr-rethink-8b7aa2291ede
    >      └── _data
    
  3. Ensure that you don't already have an existing DTR installation in your UCP environment - the DTR bootstrapper only allows you to install 1 DTR cluster per UCP cluster, and the restore process may fail if the bootstrapper encounters an existing DTR installation. Also make sure that your UCP cluster is otherwise healthy.

    Note: All commands executed in this process will be sent directly to the engine on the DTR node you wish to restore - do not execute this process through a UCP client bundle.

  4. This process will reset your security certificates and CA certificate for your DTR cluster - ensure you have the original certificates, or make backups of your current CA, server cert, and key, before continuing.

  5. You will need these 2 docker images on the node where you are running the commands:

    • dockerhubenterprise/rethinkcli:v2.2.0
    • romainbelorgey/dtr-global-change

Resolution

To restore a DTR cluster from a set of volume backups:

  1. First, export some shell variables that will be used later:

    export REPLICA_ID=8b7aa2291ede
    export DB_ADDR=$(docker info --format '{{.Swarm.NodeAddr}}')
    export DTR_VERSION=2.2.5
    export UCP_USERNAME=ucpadmin
    export UCP_PASSWORD=password
    export UCP_URL=https://ucp.example.com
    export DTR_EXTERNAL_URL=https://dtr.example.com
    
  2. Next, determine if the volumes for your replica need to be created:

    docker volume ls --filter=name=dtr-
    
  3. If the volumes for DTR do not exist on this node, create them and use a container to copy data into them:

    for VOLUME in dtr-ca dtr-postgres dtr-registry dtr-rethink
    do
      docker volume create $VOLUME-$REPLICA_ID
      docker run --rm -v $(pwd)/$VOLUME-$REPLICA_ID:/src -v $VOLUME-$REPLICA_ID:/dst alpine cp -r /src/_data/. /dst
    done
    
  4. From a manager node, check to see if the dtr-ol overlay network exists:

    docker network ls --filter=name=dtr-ol
    
    > NETWORK ID          NAME                DRIVER              SCOPE
    > iibmipcs6fhc        dtr-ol              overlay             swarm
    
  5. If it does not exist, create it:

    docker network create -d overlay --attachable dtr-ol
    
  6. Run a rethinkdb container:

    docker run -p 8080:8080 --name dtr-rethinkdb-$REPLICA_ID -d -e DTR_VERSION=$DTR_VERSION -e DTR_REPLICA_ID=$REPLICA_ID -v dtr-ca-$REPLICA_ID:/ca --entrypoint rethinkdb -v dtr-rethink-$REPLICA_ID:/data --net dtr-ol --restart unless-stopped docker/dtr-rethink:$DTR_VERSION --bind all --no-update-check --directory /data/rethink --driver-tls-key /ca/rethink-client/key.pem --driver-tls-cert /ca/rethink-client/cert.pem --driver-tls-ca /ca/rethink/cert.pem --cluster-tls-key /ca/rethink-client/key.pem --cluster-tls-cert /ca/rethink-client/cert.pem --cluster-tls-ca /ca/rethink/cert.pem --server-tag dtr_rethinkdb_$REPLICA_ID --server-name dtr_rethinkdb_$REPLICA_ID --canonical-address dtr-rethinkdb-$REPLICA_ID.dtr-ol
    
  7. Start a rethinkcli container:

    docker run --rm -it --net dtr-ol -v dtr-ca-$REPLICA_ID:/ca dockerhubenterprise/rethinkcli:v2.2.0 $REPLICA_ID
    
  8. See that there is a set of unhealthy tables in our single rethink replica:

    r.db('rethinkdb').table('table_status').pluck({'status':'ready_for_writes'})
    
  9. Emergency repair all the tables:

    r.db('rethinkdb').
    table('table_status').
    pluck('db', 'name', {'status':'ready_for_writes'}).
    forEach(function(table) {
        return r.branch(
            table('status')('ready_for_writes'),
            r.expr({}),
            r.db(table('db')).table(table('name')).reconfigure({'emergencyRepair': 'unsafe_rollback'})
        );
    });
    
  10. Check to see that all tables are healthy, now:

    r.db('rethinkdb').table('table_status').pluck({'status':'ready_for_writes'})
    
  11. Check to ensure our tables are healthy by requesting the first row from each table:

    r.db('rethinkdb').table('table_config').pluck('db','name').map(function(row){ return     r.db(row('db')).table(row('name')).nth(0).default({}); })
    
  12. Exit rethinkcli with CTRL+D.

  13. Verify the HA configuration present inside rethinkdb:

    docker run -i --rm --net dtr-ol -v dtr-ca-$REPLICA_ID:/ca -e DTR_REPLICA_ID=$REPLICA_ID romainbelorgey/dtr-global-change getReplicas
    
  14. Delete all the replicas on the configuration that are not the one you want to keep (the one on the $REPLICA_ID variable):

    You need to repeat the command for each replica to delete:

    docker run -i --rm --net dtr-ol -v dtr-ca-$REPLICA_ID:/ca -e DTR_REPLICA_ID=$REPLICA_ID romainbelorgey/dtr-global-change removeReplica --replica-id-to-remove ID
    
  15. Validate that you have only 1 replica:

    docker run -i --rm --net dtr-ol -v dtr-ca-$REPLICA_ID:/ca -e DTR_REPLICA_ID=$REPLICA_ID romainbelorgey/dtr-global-change getReplicas
    
  16. Next, scale the tables down to 1 replica:

    docker exec dtr-rethinkdb-$REPLICA_ID rethinkops scale --replicas 1
    
  17. Bring up a "fake" registry container to work around docker/dtr reconfigure's registry detection:

    docker run --name dtr-registry-$REPLICA_ID -d -e DTR_VERSION=$DTR_VERSION --entrypoint sleep docker/dtr-registry:$DTR_VERSION 10000
    
  18. Rebuild DTR using docker/dtr reconfigure, specifying a "fake" domain for the external URL - this will be fixed in the next step:

    docker run -it --rm docker/dtr:$DTR_VERSION reconfigure --existing-replica-id $REPLICA_ID --ucp-url $UCP_URL --ucp-username admin --ucp-password $UCP_PASSWORD --ucp-insecure-tls --dtr-external-url example.com
    

    This command will reach the "Waiting for DTR" step and then time out after a few minutes - it's safe to CTROL+C out of the bootstrapper at this point and clean up the dtr-phase2 container, or wait for the timeout and let the container exit on its own.

  19. Finally, run docker/dtr reconfigure again, specifying the correct external URL for your DTR cluster:

    docker run -it --rm docker/dtr:$DTR_VERSION reconfigure --existing-replica-id $REPLICA_ID --ucp-url $UCP_URL --ucp-username admin --ucp-password $UCP_PASSWORD --ucp-insecure-tls --dtr-external-url $DTR_EXTERNAL_URL
    

    If you need to specify certificates, you will need to specify them

    The 3 files ca.pem, cert.pem and key.pem need to be present in your current directory:

    docker run -it --rm docker/dtr:$DTR_VERSION reconfigure --existing-replica-id $REPLICA_ID --ucp-url $UCP_URL --ucp-username admin --ucp-password $UCP_PASSWORD --ucp-insecure-tls --dtr-external-url $DTR_EXTERNAL_URL --dtr-ca "$(cat ca.pem)" --dtr-cert "$(cat cert.pem)" --dtr-key "$(cat key.pem)"
    
  20. Now you should have a fully functional DTR replica. Feel free to run docker/dtr join on your 2 remaining replicas to bring the DTR cluster into a highly-available configuration.

    Be sure to clean the nodes for every old DTR datas present with these commands:

    docker rm $(docker ps -q -a -fname=dtr)
    docker volume rm $(docker volume ls -q -fname=dtr)
    

What's Next?