Guest Author: Jérôme Petazzoni
I was given the opportunity to test AppSwitch, a network stack for containers and hybrid setups that promises to be super easy to deploy and configure, while offering outstanding performance. Sounds too good to be true? Let’s find out.
A bit of context
One of the best perks of my job at Docker has been the incredible connections that I was able to make in the industry. That’s how I met Dinesh Subhraveti, one of the original authors of Linux Containers. Dinesh gave me a sneak peek at his new project, AppSwitch.
AppSwitch abstracts the networking stack of an application, just like containers (and Docker in particular) abstract the compute dimension of the application. At first, I found this statement mysterious (what does it mean exactly?), bold (huge if true!), and exciting (because container networking is hard).
The state of container networking
There are (from my perspective) two major options today for container networking: CNM and CNI.
CNM, the Container Network Model, was introduced by Docker. It lets you create networks that are secure by default, in the sense that they are isolated from each other. A given container can belong to zero, one, or many networks. This is conceptually similar to VLANs, a technology that has been used for decades to partition and segregate Ethernet networks. CNM doesn’t require you to use overlay networks, but in practice, most CNM implementations will create multiple overlay networks.
CNI, the Container Network Interface, was designed for Kubernetes, but is also used by other orchestrators. With CNI, all containers are placed on one big flat network, and they can all communicate with each other. Isolation is a separate feature (implemented in Kubernetes with network policies.) On the other hand, this simpler model is easier to understand and implement from a sysadmin and netadmin point of view, since a straightforward implementation can be done with plain routing and CIDR subnets. That being said, a lot of CNI plugins still rely on some sort of overlay network anyway.
Both approaches have pros and cons, and if you ask your developers, your sysadmins, and your security team what to do, it can be very difficult to get everyone to agree! In particular, if you need to blend different platforms: containers and VMs, Swarm and Kubernetes, Google Cloud and Azure, …
That’s why something like AppSwitch is relevant, since it abstracts all that stuff. OK, but how?
How AppSwitch works
Or rather: how I think it works, in a simplified way.
AppSwitch intercepts all networking calls made by a process (or group of processes). When you create a socket, AppSwitch “sees” it. When you connect() to something, if the destination address is known to AppSwitch, it will directly plumb the connection where it needs to go. And it learns about servers when they call bind().
“Intercepting network calls? Isn’t that … slow?”
No! Indeed, it would be slow if it were using standard ptrace(). But instead, it is using mechanisms that are similar to the ones used to execute e.g. kernel performance profiling. These mechanisms are specifically designed to have a very low overhead.
Furthermore, AppSwitch doesn’t intercept the data path calls (like read and write). That means, the IO speeds are at least as good as native. I say at least because AppSwitch may transparently shortcircuit the network endpoints over a fast UNIX connection when possible.
“Do I need to adapt or recompile my code?”
No, you don’t need to recompile, and as far as I understand, that works even if you’re not using the libc; and even if your binaries are statically linked. Cool.
The exact UX for AppSwitch is not finalized yet. In the version that I have tested, you execute your programs with a special ax executable, conceptually similar to sudo, chroot, nsenter, etc. A program started this way, and all its children, will be using AppSwitch for their network stack. It is also possible to tie AppSwitch to network namespaces or even other kind of namespaces so that the program doesn’t have to be started with ax.
There is a demo at the end of this post, but keep in mind that the final UX may be different.
This system has a handful of advantages. It abstracts the network stack, but it also simplifies the actual traffic on the network. In some scenarios, this is going to yield better performances. And in the long term, I believe that it will also bring subtle improvements, similar to what unikernels have done in the past (and will do in the future).
Let’s break that down quickly.
Independent of CNI, CNM, or what have you
Network mechanisms in container-land rely heavily on network namespaces. Virtual machines rely heavily on virtual NICs. AppSwitch abstracts both things away. The network API is now at the kernel level. You run Linux code? You’re good. (Windows apps are a different story, of course.)
You can connect together applications running in containers, VMs, physical machines, and it’s completely transparent.
Simpler networking stack
As we’ve seen in the introduction, overlay networks are very frequent in the world of containers. As a result, when a container communicates with another, we get network traffic looking like this:
Multiple levels of packet encapsulation
(Slide from Laurent Bernaille’s presentation Deeper Dive In Overlay Networks, DockerCon Europe 2017.)
The useful payload in that diagram is within the black rectangle. Everything else is overhead. Granted, that overhead is small (a few bytes each time), which is why overlay networks aren’t that bad in practice (if they are implemented correctly). But there is also a significant operational cost, as those layers add complexity to the system, making it rigid and/or difficult to setup and operate.
AppSwitch lets us get rid of these layers, because once a connection has been identified at the socket level, we do not need any of the other identity information. It reminds me a little bit of ATM (or the more recent MPLS), where packets do not contain full information like “this is a packet from host H1 on port P1 to host H2 to port P2.” Instead, each packet carries only a short label, and that label is enough for the recipient to know which flow the packet belongs to. AppSwitch somehow seems to do this without touching the packets or even having access to them.
Faster local communication
One thing that network people like to do is to complain about the performance of the Linux bridge. The core of the issue is that the Linux bridge code was single threaded (I don’t know if that’s still the case), and this would slow down container-to-container communication (as well as anything going over a Linux bridge). There are remediations (like using Open vSwitch, for instance) but AppSwitch lets us sidestep the problem entirely.
In the special case of two containers communicating locally, the classic flow would look like this:
write() -> TCP -> IP -> veth -> bridge -> veth -> IP -> TCP -> read()
And with AppSwitch, it becomes this:
write() -> UNIX -> read()
There is not even a real IP stack in that scenario and the application doesn’t even notice it!
Alright, a little less conversation a little more action!
Remember: the exact UX of AppSwitch will probably be different, but this is what I tested looks like.
First of all, since I wanted to understand the installation process from A to Z, I did carefully read the docs, and then I decided to reduce the instructions to the bare minimum.
First, I clone the AppSwitch repository:
git clone [email protected]:appswitch/appswitch
Then, I compile the kernel module. Wait, a kernel module? As I mentioned earlier, this version of AppSwitch works by intercepting network calls at the kernel level. But I’m told that in a future version, it will be able to offer the same functionality (and performance!) without needing the kernel module.
cp trap/ax.ko ~
Next up, I copy AppSwitch userland piece. That part is written in Go, and currently recommends Go 1.8. I’m using a trick to run the build with Go 1.8, regardless of the exact version of Go on my machine:
docker run \
-v /usr/local/sbin:/go/bin \
-v $(pwd):/go/src/github.com/appswitch/appswitch \
golang:1.8 go get -v github.com/appswitch/appswitch/ax
(If you just went “WAT!” at this and are curious, you can check this other blog post for crunchy details.)
Then I copy the module and userland binary to my test cluster. (The IP addresses of my test machines are in ~/hosts.txt, and I use parallel-ssh to control my machines.)
tar -cf- /usr/local/sbin/ax* ~/ax.ko |
parallel-ssh -I -h ~/hosts.txt -O StrictHostKeyChecking=no \
sudo tar -C/ -xf-
We can now load the module:
parallel-ssh -h ~/hosts.txt -O StrictHostKeyChecking=no \
sudo insmod ~/ax.ko
Next, we need to run the AppSwitch daemon on our cluster. AppSwitch is structured like the early versions of Docker: the daemon and client are packaged together as one statically linked binary. There is another similarity: the daemon exposes a REST API that the client consumes.
The exact command that we need to run is highlighted below; it boils down to running ax -daemon -service.neighbors X where X is one (or multiple comma-separated) address of another node. You don’t need to specify all nodes: AppSwitch will use Serf to establish cluster membership. I specify two nodes in the example below to be on the safe side if one node is down for an extended period of time.
The whole command is wrapped within a systemd unit, because why not.
parallel-ssh -I -h ~/hosts.txt sudo tee /etc/systemd/system/appswitch.service <