A short dive into KVM

Intro

Since I have an interest in systems-programming it is time to learn how to really use KVM. To what purpose you ask? To learn new things and have a bit of fun in the process!

KVM (Kernel-based Virtual Machine) is an interface that the Linux kernel provides that makes it possible for the Linux kernel to act as a hypervisor. A hypervisor is a kind of program that supervises other operating systems running in virtual machines. By using the KVM interface we can use Linux as a hypervisor to run virtual machines with hardware acceleration. This is usefull to me because it allows us to run multiple operating systems on the same hardware at the same time. It is also usefull as a way to isolate untrusted programs from eachother.

So far I have only used KVM via tools such as [libvirt] or [qemu] but sometimes I wonder how difficult it would be to build my own tools.

So there fore in this blogpost we will learn about the KVM API by writing a C program that will run some machine code from a file. The program will create a virtual machine using the KVM API, map the code into the virtual machine, and finally run the virtual machine. The program from the file that runs inside the virtual machine will print an ASCII table subsection as a demonstration that it actually runs.

Some good resources to read

This blogpost is mostly not in the order that I did all of this and the first thing that I actually did was to look around for some good reading material on the topic. After some searching I found these two to be the most usefull.

The Definitive KVM (Kernel-based Virtual Machine) API Documentation

https://www.kernel.org/doc/Documentation/virt/kvm/api.txt

The title says it all. In fact this blogpost is mostly my notes from following that article.

Using the KVM API

This was a good reasource.

https://lwn.net/Articles/658511/

Getting started

Before we begin we have some boilerplate utilities to get out of the way, these headers and the PANIC() macro are used in all of the sourcecode shown later.

// Define feature macros so that various things become available.
#define _DEFAULT_SOURCE
#define _POSIX_C_SOURCE 200809L

// For reading/writing to the console and calling exit().
// Precise integers since we are doing ABI things.
// Finally errno is usefull in panic() for debugging.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <errno.h>

// Then we need various headers since we will be opening and manipulating file descriptors. 
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

// KVM is controlled mostly via the ioctl() system call.
#include <sys/ioctl.h>

// There is also a fair bit of mapped memory so we need mmap()
#include <sys/mman.h>

// Last but not least there are some structures that we need definitions for. 
#include <linux/kvm.h>

// A basic panic() that prints the function, source code line, a message and errno.
// We will use this everywhere there may be an error.
// Doing it this way keeps the examples short but is not recommended for real programs.
static void panic(const char* func, unsigned line, const char* msg) {
    fprintf(stderr, "%s:%u %s (errno=%d)\n", func, line, msg, errno);
    fflush(stderr);
    exit(EXIT_FAILURE);
}

// A macro that uses __func__ to get the function name and __line__ to get the source code line.
#define PANIC(msg) panic(__func__, __LINE__, (msg))

Initializing KVM

The KVM device /dev/kvm is the entrypoint into KVM and the first step is opening a file descriptor to it. Once we have a file descriptor it can be used with the ioctl() function to interact with the KVM subsystem in Linux.

int kvmfd = open("/dev/kvm", O_RDWR);
if (kvmfd == -1) {
    PANIC("open kvm");
}

Checking the KVM version

There has been a few incompatible versions of the KVM API during its development and before we proceed we should check the KVM version. Technically we could skip this step since there is really only one KVM version that is currently in use. That version is version 12, and it is frozen, and has guarantees of backwards compatibility.

However if the Kernel that our program runs on is to old then the program will not work and therefore we check the version.

int get_version_ret = ioctl(kvmfd, KVM_GET_API_VERSION, NULL);
if (get_version_ret == -1) {
    PANIC("could not get version");
} else if (get_version_ret != 12) {
    PANIC("unexpected kvm version");
}

Getting the vCPU mmap() size

Some parts of KVM use shared memory regions to communicate between userspace, kernel and guest. One such case is the struct kvm_run which is used when we call into the virtual machine, for example to actually run the virtual machine. Because the kvm_run structure has fields that are specific to the hardware being virtualized the structure has a platform specific size. So before we can map the memory of the structure we have to ask for the size of the structure.

int kvm_run_size = ioctl(kvmfd, KVM_GET_VCPU_MMAP_SIZE, NULL);
if (kvm_run_size < 0) {
    PANIC("could not get kvm_run size");
}

Creating a VM

KVM_CREATE_VM

Creating the virtual machine is done using the KVM_CREATE_VM enum with ioctl(), which gives us a file descriptor representing the virtual machine. This file descriptor is then used to create other things related to the virtual machine such as virtual CPUs, attaching memory, or devices. To keep things brief we will only work with vCPUs and memory.

int vmfd = ioctl(kvmfd, KVM_CREATE_VM, (unsigned long) 0);
if (vmfd == -1) {
    PANIC("create vm");
}

Mapping code

Next thing is to add some memory to the virtual machine so that we have someplace to store code. This time we are only going to add a single memory region for the code that the VM will run, but in a real implementation we would probably map a lot more memory for general use.

We will get the code from a file that contains raw machine code with no other structure. By working with raw machine code we can avoid implementing a loader, relocations and other tricky things. Instead it is enough open a file descriptor and mmap() the entire file and then add that memory into the virtual machine.

// Open the file that has the code.
int codefd = open("ascii.bin", O_RDWR);
if (codefd == -1) {
    PANIC("open codefile");
}

// Get the stat (size) of the file.
struct stat cstat = { 0 };
if (fstat(codefd, &cstat)) {
    PANIC("fstat codefile");
}

// TODO Check that the file is not to large. Because our virtual machine is 16-bit.

// NOTE: For some reason that I have not quite pinned down the size must be at least
// one hole page. In our case the code file is smaller than a page.
// So we just map a region larger than the file. This is supported by mmap().
// The extra space after the file contents will read zero and be read-only.
size_t map_size = cstat.st_size;
if (map_size < 0x1000) {
    map_size = 0x1000;
}

// Map the file contents into memory.
// Question: Does PROT_WRITE mean the virtual machine can actually write to the file?
void* code = mmap(NULL, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, codefd, 0);
if (!code) {
    PANIC("mmap code");
}

Setting userspace memory region

Now that the code has been mapped into our virtual address space the next step is to attach the memory to the virtual machine. Because right now the code only exists in this process address space so there is no way for the virtual machine to read it.

struct kvm_userspace_memory_region region = {
    // The slot is just an identifier for this region, since there can be multiple
    // independent regions in one machine.
    .slot = 0,
    // Where in the guests address space should this region start.
    // (This should ideally share lower bits with the userspace_addr, we will not do that here.)
    // (Also, we can not map it at 0x0000 because that is reserved.)
    .guest_phys_addr = 0x1000,
    // How large is the memory region. In our case it is the size of the code file.
    .memory_size = map_size,
    // Finally we provide the address in this process address space.
    .userspace_addr = (uint64_t) code,
};

// Now we associate our (userspace) memory region to the virtual machine.
// From now on the VM can access the data in the memory we have provided.
if (ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region) == -1) {
    PANIC("could not set user memory region");
}

Creating a vCPU

To actually run any code inside the virtual machine we need a virtual central processing unit (vCPU). Once again this is a ioctl() call that creates a file descriptor representing the resource.

int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long) 0);
if (vcpufd == -1) {
    PANIC("could not create vcpu");
}

Mapping kvm_run of the vCPU

Communication between our userspace process and the vCPU is done via shared memory. This shared memory is a struct kvm_run that contains information about the state of the vCPU. So the first step of preparing a vCPU is to map this shared memory using mmap(). The size of the region to map is kvm_run_size that we retreived before.

struct kvm_run* kvm_run = mmap(NULL, kvm_run_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);
if (!kvm_run) {
    PANIC("mmap vcpufd");
}

There are a few rules for when the userspace process may read or write to this memory. The brief summary is that userspace can only touch the memory when KVM_RUN is not active. Since we are not starting other threads in this program this is not a problem.

Setup SREGS (cs base and selector)

Now we have a vCPU but it is not initialized yet and if we start it now it will trigger a fault. Proper initialization for a modern virtual machine with a modern operating system can be complex. It is also quite dependent on hardware and hypervisor details and in fact on ARM the ioctl() is completely different since there are no SREGS. In this project we will do the bare minimum and only setup things that are absolutely necessary on a single kind of hardware architecture.

That means we will skip over setting up interrupts, protected mode, "extended features" and page tables. Essentially our vCPU will be an 16-bit [i8086] from somewhere in the 1980s.

By setting cs base and selector to zero we can mostly ignore segmentation.

struct kvm_sregs sregs;
if (ioctl(vcpufd, KVM_GET_SREGS, &sregs) == -1) {
    PANIC("could not get sregs");
}
sregs.cs.base = 0;
sregs.cs.selector = 0;
if (ioctl(vcpufd, KVM_SET_SREGS, &sregs) == -1) {
    PANIC("could not set sregs");
}

Initialize REGS (rip and [rflags])

In much the same way we will initialize the regular registers to zero except for rip (the instruction pointer) and [rflags]. We need to initialize rip because it points out on what address our code starts which in this case is hardcoded to 0x1000. Then we need to initialize rflags to 2 because the spec says so and otherwise the vCPU will not start.

struct kvm_regs regs = {
    .rip = 0x1000,
    .rflags = 0x2,
};
if (ioctl(vcpufd, KVM_SET_REGS, &regs) == -1) {
    PANIC("could not set regs");
}

Minimal run loop

Now we are ready to actually run the virtual machine. To do this we will do another ioctl() call in a loop. When we call ioctl() with KVM_RUN the vm runs until it hits a condition that userpace has to handle. There are quite a few such conditions and we will only handle the bare minimum.

int spin = 1;
while (spin) {
    // We use ioctl(KVM_RUN) for this thread to enter the kernel and then the guest virtual
    // machine. It return when some condition needs to be handled by this process. 
    if (ioctl(vcpufd, KVM_RUN, NULL) == -1) {
        PANIC("error while running vcpu");
    }
    // Once it returns we use the memory mapped kvm_run structure of vcpufd to find out the
    // reason for the return ("exit").
    switch (kvm_run->exit_reason) {
    case KVM_EXIT_HLT:
        // KVM_EXIT_HLT means the vcpu executed the 'hlt' instruction.
        // We handle it by stopping the loop and then exiting the process.
        spin = 0;
        break;
    case KVM_EXIT_IO:
        // TODO We will handle IO in a later section. 
        spin = 0;
        break;
    default:
        // There are many possible exit reasons. Quite a few can be the reason even for this
        // small program if there are mistakes or if the hardware does not match expectations.
        // So if we get an unexpected exit reason we print the number for troubleshooting purposes
        // and then exit.
        // Take a look in <linux/kvm.h> or /usr/include/linux/kvm.h for a list of reasons
        // and definitions for the corresponding datastructures in kvm_run.
        fprintf(stderr, "exit_reason=%i\n", kvm_run->exit_reason);
        PANIC("unknown exit reason");
    }
}

A minimal program to run

Now it is time to prepare a program that we can run inside our virtual machine. This time we will write a small program that outputs a subsection of the ASCII table. For the actual output we will use the output instruction of the processor.

These input/output instructions where used in older systems to communicate with some external device, such as a serial port or extension card. We will use this instruction because the interface is less complex compared to the alternatives and therefore requires less preparation.

// We will define the 'start' symbol in the text section.
// Mostly because the GNU toolchain expects it to exist.
.section .text
.globl start

// We use the .code16 directive to instruct the assembler tool to only generate 16bit code.
// With this the assembler will report errors for 32bit or 64bit instructions.
.code16

start:
    // Initialize a counter in %ax
    mov $0x20, %ax

loop:
    // Increment the counter and compare it with our limit value.
    // If the counter has reached the limit we stop the loop via 'je halt'.
    // Otherwise we output the lower 8 bits of the counter on the serial port.
    // Finally we continue the loop via 'jmp loop'
    add $0x01, %ax
    cmp $0x7f, %ax
    je halt
    // Usually different devices where allocated port numbers.
    // The operating system would have configuration that specified which port numbers belonged
    // to which devices.
    // For our usecase it does not matter much since we do not have an OS in the virtual machine.
    // Or even a hypervisor that expects a certain usage of the ports.
    // So we will just use port zero.
    // (This also simplifies the assembly since we do not have to load a 16 bit port number.)
    out %al, $0x00
    jmp loop

halt:
    // We output a newline on the port to make the output a bit nicer to read.
    mov $0x0a, %al
    out %al, $0x00
    // Lastly we halt the processor.
    // This triggers EXIT_HALT
    hlt

Then we can assemble this program into a binary like this:

# We assemble our program into an ELF object file.
as -o ascii.o ascii.s
# Now we need to link it into a raw binary file (so no ELF file structure)
# We manually mark our entry point with '-e start'.
# Then specify raw binary output with '--oformat binary'.
ld -e start --oformat binary -o ascii.bin ascii.o

Handle KVM_EXIT_IO

If we run our "hypervisor" with the ascii.bin program the virtual machine will simply stop. This is because we have not handled KVM_EXIT_IO with anything more than stopping. So we will amend our program to handle the IO exit reason.

The struct kvm_run contains a reason-specifc structs with information specific to the exit reason. (This is a union so you have to access the correct struct.) Study the <linux/kvm.h> header for the details.

In our case we are only interested in the 'io' member that matches KVM_EXIT_IO.

struct {
	__u8 direction;
	__u8 size;
	__u16 port;
	__u32 count;
	__u64 data_offset;
} io;

As usual we will boil down this code to the bare minimum. We will only check that the values match our expectations.

case KVM_EXIT_IO:
	// The exit reason was an IO instruction (such as 'out').
	// We check that this is in the out direction since we are not going to implement input.
	// We also check that the size and count matches our expectation of one byte.
	// Lastly we use 'data_offset' to create a pointer to the data byte.
    if(kvm_run->io.direction == KVM_EXIT_IO_OUT &&
			kvm_run->io.size == 1 &&
        	kvm_run->io.port == 0 &&
        	kvm_run->io.count == 1) {
        uint8_t* ptr = ((uint8_t*)kvm_run) + kvm_run->io.data_offset;
		// Once we have a pointer to the data byte we simply dereference it and output it
		// on stdout.
        fputc(*ptr, stdout);
    } else {
		// We are not going to try to handle anything unexpected.
        PANIC("unexpected IO");
    }
    break;

The end

If we run our "hypervisor" with the 'ascii.bin' program we should se this line of ASCII:

./hypervisor
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Troubleshooting tips

There are many places in the KVM API and the hardware manuals where there is no error handling. Essentially if any mistake is made there may not be any nice feedback pointing out the mistake. One such thing is accessing memory that is not mapped in the guest virtual machine. There are no "signals" such as SIGSEGV to cause a core dump that we can load into a debugger. (But it is possible to setup similar tooling with GDB and others.)

Something more basic is to expand the handling of "unexpected" exit reasons a bit, such as this:

case KVM_EXIT_MMIO: {
    struct kvm_regs regs = { 0 };
    if (ioctl(vcpufd, KVM_GET_REGS, &regs) == -1) {
        PANIC("could not get regs");
    }
    fprintf(stderr, "exit_reason=KVM_EXIT_MMIO phys_addr=0x%016llx, len=%u, is_write=%d rip=0x%016llx rax=0x%016llx\n",
            run->mmio.phys_addr,
            run->mmio.len,
            run->mmio.is_write,
            regs.rip, regs.rax);
}

If the program running in the guest virtual machine accesses a memory region that is not defined. It will cause an exit with the KVM_EXIT_MMIO reason. So we can catch that and report details. The 'phys_addr' is the address in the virtual machine memory and 'regs.rip' is the instruction pointer.

It is also good idea to print the rest of the registers in 'default' case as well. That way you can easier see where the state of the virtual machine.

Possible next steps

This was a tiny first step on what could be a long journey. If we where to continue what would be the next step?

More hardware support

This code will (probably) only work on certain x86 hardware and it would be nice to also have it work on ARM. That would require quite different code since much of the initialization is different.

Direct long mode and 64 bits

When we create the vCPU we focused on minimalism, and this means the guest starts in [legacy-mode]. It is (probably?) possible for the guest to bootstrap to more modern modes. This would mean writing a the bootstrap code in assembly to do that. Which is not my really my goal right now.

One next step would be to write a bit more guest initialization code in userspace. So that the guest starts in a more modern mode (such as 64-bit mode).

Interrupts into the guest

Right now there is no code to interrupt the guest. So we cannot notify it about incoming messages. Or stop it when we need to do other work.

Bulk IO

We used the port IO instructions since they are easy to work with, however they are not efficient for bulk data transfer of data. Thier poor performance is one of the reasons they where faced out from active use. So we should look into how bulk data transfers can be imlemented.

References

Legacy Mode

An mode of modern x86-64 processors where they pretend to be older 16 or 32 bit processors. See also 'real mode', 'virtual 8086 mode' and [i8086].

i8086

When the virtual guest starts by default it emulates of these https://en.wikipedia.org/wiki/Intel_8086 little chips.

rflags

In x86-64 there is a register called FLAGS which KVM names rflags. According to the x86-64 specifications the second bit in that register must be set which gives us the hardcoded initial value 0x2.