inspecting docker-networks

Or how not to handle a non-responsive docker container

TL;DR

Handle “Cannot start service XY: driver failed programming external connectivity on endpoint XY (ContainerID): Bind for 0.0.0.0: failed: port is already allocated” by inspecting dockers network stack and force-disconnect containers from the network.

The Problem

Today we had a non-responsive docker container that we couldn’t restart. docker-compose stop and such didn’t work. The container was still running and – especially nasty – also not responding. The website it usually provides only returned a 503 – Gateway not responding.

What to do?

Well… not what we did!

Please save yourself trouble and try everything you can with docker stop or docker rm -for the like! It will beware you from trouble. And you do not need to read on.

Beware! Here be dragons!

OK. So you took the hard way.

We also! We did so by killing the process that was run by docker. sudo kill -9 finally killed the container. We were happy!

Until we wanted to restart the container.

Suddenly we got this error message:

$ docker-compose up -d
Removing containername
dbcontainername is up-to-date
Recreating f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_containername ...
Recreating f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_containername ... error

ERROR: for f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_f4cbe66d539c_containername  Cannot start service containername: driver failed programming external connectivity on endpoint containername (63cfecca06aab92e2536a8848c2812413bb1bb34be0f1ef7ea7329101737ca1c): Bind for 0.0.0.0:8079 failed: port is already allocated

ERROR: for containername  Cannot start service containername: driver failed programming external connectivity on endpoint containername (63cfecca06aab92e2536a8848c2812413bb1bb34be0f1ef7ea7329101737ca1c): Bind for 0.0.0.0:8079 failed: port is already allocated
ERROR: Encountered errors while bringing up the project.

Wait… What? But… we just killed the container, didn’t we? How can the port still be in use?

Some netstat -tulpn later we knew why. Of course: We forgot about the docker-proxy that actually manages the networking stuff. Well: That can be solved easily, right? Just kill the docker proxy! Done as said. Now everything is running smoothly.

At least so we thought. Until we tried the docker-compose up again. And got the “Port already in use” message again…

OK. Now we hit a dead-end somewhat. Asking our favorite search engine revealed a lot of good advice. If you call restart the docker daemon call good advice on a system running multiple docker-setups, some of which are kinda production. Which I do not.

So back to the drawing board. And now it became really interesting: Our first approach was to have a look at what process is now holding the port open. netstat -tulpn or lsof are your friend here. But they did not reveal anything. The port was not open on the system level. Yet docker was sure it was open.

Finally!

It took me some chatting with our DevOps team, the distraction of a meeting or two and some trial and error to finally get behind what was going on.

Somehow docker seemed to think that the network was still listening on the port in question. So what is going to happen if I restart the network stack for this docker image? Good question!

Let’s give it a try. But docker network restart is not a command. So let’s try docker network rm. After all the network will be recreated on a docker compose up. So nothing to loose.

It doesn’t work, because there are still containers connected to the network. And, guess what: There’s no --force available.

OK. So let’s stop all the containers associated with that network. Which ones are that? Handy that docker network inspect projectname_default tells me exactly that.

[
  {
    "Name": "projectname_default",
    "Id": "eb5271a643b15e823a1227606099236c185dec50914464e5b7f8e879a9c8800c",
    "Created": "2017-12-07T18:39:28.214702501+01:00",
    "Scope": "local",
    "Driver": "bridge",
    "EnableIPv6": false,
    "IPAM": {
      "Driver": "default",
      "Options": null,
      "Config": [
        {
          "Subnet": "172.21.0.0/16",
          "Gateway": "172.21.0.1"
        }
      ]
    },
    "Internal": false,
    "Attachable": false,
    "Containers": {
      "26a26f936360450c03db41a0c5127674ab9170a01d1f9b3f779c9dd197439e63": {
        "Name": "dbcontainername",
        "EndpointID": "7bee6f37769d877a50079ef69b4e7d1b307bac3ba92413615af1a2c05999d92d",
        "MacAddress": "02:42:ac:15:00:03",
        "IPv4Address": "172.21.0.3/16",
        "IPv6Address": ""
      },
      "7cae9f9369bdc802304fbc2451fac66156f41bc0f0495e53b1f6bbc339da8748": {
        "Name": "7cae9f9369bd_containername",
        "EndpointID": "d53772a608428e992429378491cb7e73f07c2978439d51dc3a20e16e9256dee3",
        "MacAddress": "02:42:ac:15:00:02",
        "IPv4Address": "172.21.0.2/16",
        "IPv6Address": ""
      }
    },
    "Options": {},
    "Labels": {}
  }
]

Well then: let’s shut them containers down. A docker rm [EndpointID] for each container in the list should do the trick. And it did. Apart from one container. That just didn’t exist anymore. But was still listed here in the network. And docker rm --force didn’t help either. The container just wasn’t there. So what other options?

Oh! Wait! There’s also docker network disconnect where I can disconnect a container from a network (or the other way around?). Of course, that didn’t do the trick because the container wasn’t there anymore. But luckily that command actually has a --force-flag!

So after running docker network disconnect --force projectname_default [EndpointID] the result of docker network inspect projectname_default finally didn’t show any containers. So now a docker network rm projectname_default was finally successful. Not that it was necessary, but I ran it anyhow.

And then – finally – our docker-compose up -d didn’t complain about blocked ports and just started the containers.

Next time you see the error message “port already allocated” with docker: Don’t worry. You do not need to restart the docker daemon. Inspecting the network and disconnecting containers might also be an option.

Or do you have a further possibility (apart from not going down that rabbit hole in the first place)?