Friday, 23 August 2024

Linux Namespaces and Collabora Online

In Collabora Online (for the normal mode of operation) we have a single server process (coolwsd) that spawns a separate process (kit) to load and manage each individual document. Each of those per-document kit processes runs in its own isolated environment. See architecture for details.

Each environment contains a minimal file system (ideally bind mounted from a template dir for speed, but linked/copied if not possible) that each kit chroots into, limiting its access to that subtree.

That chroot requires the CAP_SYS_CHROOT capability (and the desirable mount requires the CAP_SYS_ADMIN capability), and granting those capabilities to the coolforkit and coolmount binaries is a root privilege that, for typical deb/rpm packages, is done automatically at install time.

But it would be far more convenient not to require these capabilities to be set to do this isolation. They grant online more ability to affect its host system than it uses, we only want to mount dirs and chroot into dirs that belong to online and have no need or desire to make them available to any other process or user, and it's awkward, especially during development. to require root privileges to set these capabilities.

This scenario is not unique, and Linux provides namespaces, typically used by container implementations, to support achieving this. So recent work in Collabora Online leverages these namespaces to do its own layer of per-document kit isolation. (There's a good series of articles by Steve Ovens on the various namespaces, with the mount namespaces the most relevant one here.)

In essence, a user level process can create its own namespace in which it is apparently root from its own perspective, but as the original uid from the outside perspective and limited to operating on resources that the original uid is limited to accessing. So for each forkit, instead of requiring initial system capabilities and creating a system level bind mount we instead have no specific initial capabilities, enter a new namespace, unique to each forkit, in which that forkit becomes king of its own castle with apparent full capabilities, and can create bind mounts and chroot into its minimal file system.

Which is pretty magical to me as the whole existence of namespaces passed me by entirely without notice despite debuting over a decade ago.

Nothing is ever simple however, so some hurdles along the way.

Entering the namespace "requires that the calling process is not threaded" (man 2 unshare) which is not a problem for the normal use case in each kit, but did pose a problem for the test coolwsd does in advance to probe if there are working namespaces on the system in determine if it should operate kits in namespace mode or not. There it turned out that the Poco::Logger we use backups existing logs when it creates a new one, and then by default spawns a  thread to compress the old log.

I initially had the vague notion that I could treat a namespace as a sort pseudo-sudo and switch back and forth freely between them, but that's not the model, typically it's a one way journey. But namespaces can be stacked instead with a namespace where the original uid is mapped to (apparent) root then containing another namespace where the user is mapped back to the original uid again. So we do that, each forkit enters its initial namespace and is mapped to root, does the mounts, enters another nested namespace mapped back to the original uid, chroots and drops all of the capabilities gained on entering a namespace.  Which aligns the namespace mode with the expectations of the non-namespaces mode as to what effective uid the kit appears to run as.

The mounts that each forkit does are private to that forkit, so while in the non-namespace case the mounts are visible system-wide, in the namespace case the mounts are not visible either to other forkits or to the parent coolwsd. So how the document is provided by coolwsd to a child kit had to be adapted for the new mode of even less potential leakage between components.

There was a glitch in mounting, because when we bind mounts dirs from our system template we want them to be readonly, which requires the typical Linux 2 step process of mount and remount with readonly flags. This worked for the non namespace case, but failed for namespaces even though the initial mount succeeded. Here we had an extra flag of MS_NOATIME when remounting to potentially shave a little time off use of the kit jail, but in namespaces removing that option from the underlying system mount isn't permitted.

Despite that mount flag change giving working namespace-using kits directly inside toplevel OS, one of our lxc-using ci systems still refused to allow a readonly remount in a namespace to work. The catch here was that lxc is bundled with default apparmor rules which additionally restrict a readonly remount call to a certain set of arguments which our remount effort didn't match, so that had to be adjusted. Specifically the rather obscure MS_SILENT use.

Performance-wise, an unexpected (to me at least) side effect of using namespaces is that the coolwsd measurement of the time to spawn a forkit on my hardware has reduced from an average of 39.63ms per spawn to an average of an average of 6.15ms per spawn, which wasn't the primary goal but is a nice benefit.

Surveying distros where namespaces are available by default suggests:

RHEL/CENTOS

  • 8.0+ works with namespaces out of the box
  • 7.9 (EOL) not enabled by default, possible with
    • echo 10000 > /proc/sys/user/max_user_namespaces

Debian

  • 11+ (bullseye) works with namespaces out of the box
  • 10 (buster) EOL, not enabled by default, possible with
    • sudo sysctl -w kernel.unprivileged_userns_clone=1

Ubuntu

  • 16.04+ works with namespaces out of the box

Ubuntu 24.04 however, while supporting namespaces out of the box, has restricted namespaces via apparmor rules, which complicates things again so Collabora Online .deb packages install an apparmor profile to enable it to use namespaces out of the box.