Underneath the virtual world - Modern virtualization explained

Modern virtualization and its foundations explained

⚠️ This article may not be 100% accurate, do not hesitate to report any mistake.

Modern virtualization and its foundations explained

The concept of virtualization is born in the late 1960s and the early 1970s when IBM invested a lot of time and effort developing robust time-sharing solutions, which refers to the shared usage of computer resources among multiple users. It then evolved during the 1990s, keeping in mind this idea of resource mutualization, but improving overall performances and widening the field of possibilities to become in the 2000s what I will call “modern virtualization”, the virtualization we know today.

Virtualization explained

The term virtualization includes notions like emulation, para-virtualization, containerization, full virtualization and more.

The notion of para-virtualization is a bit different than the others because the guest’s (or virtual machine) operating system has been specifically modified to support para-virtualization. The intent is to reduce the time spent performing operations that are difficult to run in a virtual environment. We will not discuss para-virtualization any further but rather focus on the other concepts listed above, which already are a big topic to address.

The main goal of virtualization is to make an abstraction of physical resources. That is to create aggregated pools of logical resources (CPU, memory, storage, networking, …) thus maximizing the use of resources and therefore decreasing the costs but it can also be done to isolate a specific environment from the rest of the system, in the case of containerization.

But then what are the differences between full virtualization, emulation, and containerization?

Full virtualization is what we shortly call “virtualization”, also called hardware virtualization. It refers to the creation of a virtual system that acts like a real system. The guest OS can be aware it is a virtual machine and some ship with optimizations for virtual machines but this is closer to para-virtualization than full virtualization. Full virtualization relies on an hypervisor, there are two types of them:

type 1 or bare-metal, which is basically an OS directly installed on a server (e.g. VMware vSphere, Xen)
type 2 or hosted, that is a software installed on an existing OS (e.g. QEMU/KVM, VMware Workstation)

Full virtualization is hardware-assisted, allowing the hypervisor to have direct access to the CPU to execute instructions that cannot alter the state of anything outside the virtual machine. Other instructions must be simulated to ensure the virtual layer is not breached. All modern CPU and other hardware components provide support for hardware-assisted virtualization.

Emulation has the same goal of creating a virtual system. The difference between emulation and virtualization is that emulation will run softwares that acts as virtual hardware components. The performance of emulation is therefore much lower compared to virtualization.

Containerization is really similar to virtualization and can be seen as lightweight virtualization. Containers are virtual machines that share the kernel with their host, so it only needs to virtualize the application layer and not the entire Operating System.

Containerization and virtualization are similar concepts, they share more or less the same goals, depending on how you use it, which are isolation and resource mutualization. To differentiate these concepts, I would say that containerization is lightweight virtualization that is less isolated.

Under the virtual layer

Containerization and virtualization are similar but do not share the same underlying concepts or at least not in the same way.

In the following part, I will try to explain these underlying mechanics. I will only talk about Linux kernel mechanics because it is open source thus we can find much more information about them. Therefore we will mainly talk about what is under QEMU/KVM and Docker, which are the most used virtualization and containerization solutions on Linux.

Docker is a containerization solution, it relies on these two concepts that will be discussed later:

cgroups
namespaces

QEMU is an emulator, associated with KVM (Kernel-based Virtual Machine) which is a kernel module it “becomes” an hypervisor. KVM provides the support for hardware-assisted virtualization to QEMU.

Because QEMU is an emulator, running a virtual machine will only spawn a few tasks (processes) in your system. These tasks are responsible for hardware emulation.

The virtual machine is completely isolated from the host, the host is invisible from the VM and the VM is only visible through the emulator (e.g. you will not be able to see any process that is running on the VM from the host).

KVM comes into play when it comes to executing instructions on the physical CPU. Since KVM is a kernel module it is running in kernel mode and therefore has full access to CPU. QEMU redirects any instruction that cannot “breach” the virtual layer to KVM to speed up the operation. Instructions that would affect the host are run on the emulated CPU (execution is much slower).

cgroups

Control groups (cgroups) is a feature of the Linux kernel. It limits, isolates, and monitor resources usage of a group of processes. Limitations can be set for all kinds of resources: CPU, memory, network, and IO (i.e. block devices operations).

cgroups comes in the form of a hierarchy, represented by a cgroup filesystem. There is a cgroup filesystem for each resource kind, they are located in /sys/fs/cgroup. Each task on your systems belongs to exactly one cgroup of each type (e.g. run cat /proc/self/cgroup to list cgroups of your current shell).

cgroups is not specifically meant for virtualization and has many other applications such as assigning a certain amount of memory to a specific application to make sure it will never run out of memory. However, it perfectly fits the needs of virtualization and containerization because they both use it (for all kind of resources).

Because Docker containers share the kernel with their host and cgroups is a kernel feature, listing the cgroups of a process running in a container will display the same output whether you list it from the host or the container. Inside the container you will see that your process belongs to cgroups starting with /docker but they do not appear on the container’s cgroups list.

Note that QEMU also uses cgroups to limit resources allocated to the emulated systems.

Namespaces

Namespaces is another feature of the Linux kernel. Linux resources form a hierarchy, for filesystem this hierarchy starts at the root, for processes it starts at PID 1 (init), etc. The namespaces feature allows the creation of subtrees in these hierarchies in such a way that subtrees cannot access or even see the other ones.

There are multiple kinds of namespaces, each wrapping a specific resource type:

cgroup
IPC (Inter-Process Communication)
network
mount
PID
time
user
uts (hostname and NIS domain name)

For example, mount namespaces are useful for isolating a container’s filesystem. Docker creates a new filesystem (an overlay filesystem) for the container that is mounted on the host filesystem and wrapped into a namespace. Thus the container filesystem is visible in the host through the mount point (/var/lib/docker/overlay2/<digest>).

The same concept applies for PIDs, we do not want the processes inside a container to be able to see the ones in the host. Even worst if the container is run in privileged mode, without PID namespace the container would be able to kill host processes! Inside a PID namespace, the count resets to 1, therefore inside the container, the init process will have the PID 1. However on the host side, the “real” PID appears and you can, of course, kill any process running inside the container since it is part of the host PID tree.

Namespaces are a perfect solution to guarantee host integrity and providing container isolation while keeping the container close to its host.

In this short article, I tried to explain as best as I could, the differences between all “kinds” of virtualization, and to present some of their underlying mechanics. This is an extremely vast topic and I’ve tried to make it understandable, for you but also for me!

Did you know?

Here is some bonus information about virtualization:

A guest OS must use the same architecture (or instructions set) as the physical CPU to be virtualized, otherwise you will have to emulate it. Also, the host OS architecture is not important, you can run a 64-bit VM on a 32-bit host as long as you have a 64-bit processor, funny right?
Many people think that the well-known vSphere hypervisor (f.k.a. ESXi) is based on Linux kernel but it is not! In fact, it is not even part of the UNIX OS family. However, it does integrate GNU BusyBox which makes it “looks like” a UNIX system from the user point of view. vSphere hypervisor uses a proprietary kernel called VMkernel that implements a subset of the Linux kernel (mainly the drivers part). VMware ESX (ancestor of vSphere/ESXi) used to be based on Linux, back then VMkernel was loaded as a kernel module. Now VMkernel is, despite its name, a whole OS.

References

Here is a non-exhaustive list of resources I consulted to write this article:

man (7) pages (cgroups, namespaces)
Arch Wiki
- QEMU
- KVM
- cgroups
Docker docs
IBM Cloud Learn Hub
Wikipedia
VMware
- The Architecture of VMware ESXi
- “It’s a Unix system, I know this!”
CGROUPS
PID namespaces
chroot, cgroups and namespaces - An overview
What is the difference between QEMU and KVM?
Containerization vs Virtualization
Oracle VM