阅读视图

发现新文章,点击刷新页面。

无法打开虚拟机的密钥保护器。 .详细信息包含在 HostGuardianService-Client 事件日志中。咋办?备份 Hyper-V 的虚拟 TPM 证书

  众所周知安装 Windows 11 的先决条件是电脑拥有可信平台模块(TPM),在虚拟机安装 Windows 11 也不例外。由于我用的是 Hyper-V,所以就以 Hyper-V 做介绍,VMware 和 VirtualBox 还不大一样。微软看来要重蹈 EFS 的覆辙,也就是没有充分告知开启虚拟 TPM 会带来的风险(为了省点口水,下称 vTPM)。

  在没有加入域的电脑上 vTPM 的加密密钥和证书就存储在电脑的机器证书中,一旦你把系统重装了那么证书自然会被删除,当你装好系统和 Hyper-V,并重新导入虚拟机后,虚拟机就无法开机了。

  如果没有在虚拟机内不启用 Bitlocker,那么情况会好一些,只需去掉“启用 TPM”的复选框就可以开机并且正常使用,但如果启用了 Bitlocker 那虚拟机就彻底报废了,的确没有人能恢复虚拟机的数据。开启虚拟机的提示是这样的:无法打开虚拟机“虚拟机名称”的密钥保护器。 .详细信息包含在 HostGuardianService-Client 事件日志中。参数不正确。 (0x80070057)。

  有没有感觉跟当年的 EFS 似曾相识?没做微软就是这么欠打,又犯了同样的错误。但是你打他也没用,唯有吃一见长一智能避免再次被坑。

导出(备份)vTPM 的证书

  有两种方法能备份 vTPM 的证书,一种方法是使用 MMC,另一种方法是用国外大神 Lars Iwer 写好的 PowerShell 脚本。想要用脚本的同学可以直接拉到下面,这里我着重介绍使用 MMC 备份 vTPM 证书的方法。

  按下 Win+R 调出运行对话框,然后输入 mmc 回车。在 MMC 窗口,点击“文件”👉“添加或删除管理单元”。

  在“添加或删除管理单元”对话框,选择“证书”然后点击“添加”。在“证书管理单元”对话框,选择“计算机帐户”单选按钮,点击“下一步”,默认选中“本地计算机”,点击“完成”。最后点击“确定”,回到 MMC 窗口。

  展开左侧的证书👉Shielded VM Local Certificates(可能翻译为“受防护的 VM 本地证书”)👉证书。右侧的列表里会会出现两个证书,分别是 Shielded VM Encryption Certificate (UntrustedGuardian) 和 Shielded VM Signing Certificate (UntrustedGuardian)。
友情提示:有一些机器上不一定只有两个证书,尤其是反复重装系统和导入 vTPM 之后。但一定会有以 Shielded VM 开头的证书,而且是成对出现的。

  选定要导出的 Shielded VM 证书后,点击“操作”菜单,然后点击所有任务👉导出。
1. 在向导的欢迎页面,点击“下一步”。
2. 在“导出私钥”的步骤,请一定要记住选中“是的,导出私钥”,否则导出的证书没有密钥,导入后照样没法解密 vTPM,切记,点击“下一步”。
3. 在“导出格式”页面,什么也不要修改直接点击“下一步”。
4. 在“安全”页面,勾选“密码”,然后为证书设置一个密码,一定要牢记该密码,或者把它写在一个安全的地方,否则忘记密码就不能导入了,设置好密码后,点击“下一步”。
5. 选择保存证书的地方,然后点击“下一步”。
6. 这是最终确认的页面,确认无误,点击“完成”。

导入(还原(vTPM 的证书

  当你重装了系统并且安装了 Hyper-V 之后,再次回到 Shielded VM Local Certificates(可能翻译为“受防护的 VM 本地证书”),右键点击证书,然后指向所有任务👉导入。

  1. 在向导的欢迎页面,点击“下一步”。
  2. 浏览到保存的证书,然后点击“下一步”。提是:需要在浏览文件的对话框的“文件类型”中选择“个人信息交换”,否则是看不到证书文件的。
  3. 在“私钥保护”页面,输入导出证书时设置的密码,勾选“将此密钥标记为可导出。这将允许您在以后备份或传输您的密钥”复选框,否则下次重装系统之前,你就没办法导出证书的私钥了,切记!
  4. 在“证书存储”页面,直接点击“下一步”。
  5. 这是最终确认的页面,确认无误,点击“完成”。

使用 Powershell 脚本来导出和导入 vTPM 的证书

  注意:如下脚本,版权归 Lars Iwer 所有

导出证书

  将下面的内容保存为 Export-UntrustedGuardian.ps1

$GuardianName = 'UntrustedGuardian'
$CertificatePassword = Read-Host -Prompt 'Please enter a password to secure the certificate files' -AsSecureString

$guardian = Get-HgsGuardian -Name $GuardianName

if (-not $guardian)
{
    throw "Guardian '$GuardianName' could not be found on the local system."
}

$encryptionCertificate = Get-Item -Path "Cert:\LocalMachine\Shielded VM Local Certificates\$($guardian.EncryptionCertificate.Thumbprint)"
$signingCertificate = Get-Item -Path "Cert:\LocalMachine\Shielded VM Local Certificates\$($guardian.SigningCertificate.Thumbprint)"

if (-not ($encryptionCertificate.HasPrivateKey -and $signingCertificate.HasPrivateKey))
{
    throw 'One or both of the certificates in the guardian do not have private keys. ' + `
          'Please ensure the private keys are available on the local system for this guardian.'
}

Export-PfxCertificate -Cert $encryptionCertificate -FilePath ".\$GuardianName-encryption.pfx" -Password $CertificatePassword
Export-PfxCertificate -Cert $signingCertificate -FilePath ".\$GuardianName-signing.pfx" -Password $CertificatePassword

导入证书

  将下面的内容保存为 Import-UntrustedGuardian.ps1

$NameOfGuardian = 'UntrustedGuardian'
$CertificatePassword = Read-Host -Prompt 'Please enter the password that was used to secure the certificate files' -AsSecureString
New-HgsGuardian -Name $NameOfGuardian -SigningCertificate ".\$NameOfGuardian-signing.pfx" -SigningCertificatePassword $CertificatePassword -EncryptionCertificate ".\$NameOfGuardian-encryption.pfx" -EncryptionCertificatePassword $CertificatePassword -AllowExpired -AllowUntrustedRoot

使用方法

导出

  在要导出证书的机器上:

  1. 以系统管理员的身份打开 Powershell
  2. 运行 Set-ExecutionPolicy -ExecutionPolicy Bypass -Scope Process
  3. 用 cd 命令导航到 Export-UntrustedGuardian.ps1 的所在目录
  4. 运行 Export-UntrustedGuardian.ps1
  5. 为密钥设置一个密码,这里和图形界面不同的是不强制输入密码
  6. 导出成功后,证书会保存在和 Export-UntrustedGuardian.ps1 相同的目录

导入

  在要导入证书的机器上安装 Hyper-V,然后:

  1. 以系统管理员的身份打开 Powershell
  2. 运行 Set-ExecutionPolicy -ExecutionPolicy Bypass -Scope Process
  3. 用 cd 命令导航到 Import-UntrustedGuardian.ps1 的所在目录,这里也是 Export-UntrustedGuardian.ps1 和证书的所在目录
  4. 运行 Import-UntrustedGuardian.ps1
  5. 输入导出证书时设置的密码
  6. 导入成功后,证书会保存在和 Export-UntrustedGuardian.ps1 相同的目录

温馨提示

  建议在装好系统,首次开启虚拟机的 TPM 后,马上导出 vTPM 所使用的证书,以免系统崩坏之后无从备份。不需要每次开启虚拟机的 TPM 都要导出一次证书。

后记

  就像 Windows 7 改进了 Windows XP 加密文件系统容易让用户丢失加密密钥一样,希望微软在 Hyper-V 的控制台中也增加导出 vTPM 证书的入口,毕竟现在的设计的确对初学者和粗心的人不太友好,万一虚拟机中存储了绝不能丢失的信息,那就损失大了。

Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)

Fig. kubevirt architecture overview

An introductory post before this deep dive: Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Based on kubevirt v1.0.0, v1.1.0.



Fig. Architecture overview of the kubevirt solution

This post assumes there is already a running Kubernetes cluster, and kubevirt is correctly deployed in this cluster.

1 virt-handler startup

1.1 Agent responsibilities

As the node agent, virt-handler is responsible for managing the lifecycle of all VMs on that node, such as creating, destroying, pausing, …, freezing those VMs. It functions similarly to OpenStack’s nova-compute, but with the added complexity of running each VM inside a Kubernetes Pod, which requires collaboration with kubelet - Kubernete’s node agent. For example,

  • When creating a VM, virt-handler must wait until kubelet creates the corresponding Pod,
  • When destroying a VM, virt-handler handles the VM destruction first, followed by kubelet performing the remaining cleanup steps (destroying the Pod).

1.2 Start and initialization (call stack)

Run                                           // cmd/virt-handler/virt-handler.go
  |-vmController := NewController()
  |-vmController.Run()
      |-Run()                                 // pkg/virt-handler/vm.go
         |-go c.deviceManagerController.Run()
         | 
         |-for domain in c.domainInformer.GetStore().List() {
         |     d := domain.(*api.Domain)
         |     vmiRef := v1.NewVMIReferenceWithUUID(...)
         |     key := controller.VirtualMachineInstanceKey(vmiRef)
         | 
         |     exists := c.vmiSourceInformer.GetStore().GetByKey(key)
         |     if !exists
         |         c.Queue.Add(key)
         |-}
         | 
         |-for i := 0; i < threadiness; i++ // 10 goroutine by default
               go c.runWorker
                  /
      /----------/
     /
runWorker
  |-for c.Execute() {
         |-key := c.Queue.Get()
         |-c.execute(key) // handle VM changes
              |-vmi, vmiExists := d.getVMIFromCache(key)
              |-domain, domainExists, domainCachedUID := d.getDomainFromCache(key)
              |-if !vmiExists && string(domainCachedUID) != ""
              |     vmi.UID = domainCachedUID
              |-if string(vmi.UID) == "" {
              |     uid := virtcache.LastKnownUIDFromGhostRecordCache(key)
              |     if uid != "" {
              |         vmi.UID = uid
              |     } else { // legacy support, attempt to find UID from watchdog file it exists.
              |         uid := watchdog.WatchdogFileGetUID(d.virtShareDir, vmi)
              |         if uid != ""
              |             vmi.UID = types.UID(uid)
              |     }
              |-}
              |-return d.defaultExecute(key, vmi, vmiExists, domain, domainExists)
    }

Steps done during virt-handler boostrap:

  1. Start necessary controllers, such as the device-related controller.
  2. Scan all VMs on the node and perform any necessary cleanups.
  3. Spawn goroutines to handle VM-related tasks.

    Each goroutine runs an infinite loop, monitoring changes to kubevirt’s VMI (Virtual Machine Instance) custom resources and responding accordingly. This includes actions like creating, deleting, …, unpausing VMs. For example, if a new VM is detected to be created on the node, the goroutine will initiate the creation process.

1.3 Summary

Now that the agent is ready to handle VM-related tasks, let’s create a VM in this Kubernetes cluster and see what happens in the behind.

2 Create a VirtualMachine in Kubernetes

Let’s see how to create a KVM-based virtual machine (just like the ones you’ve created in OpenStack, or the EC2 instances you’re using on public clouds) with kubevirt, and what happens in the behind.

Fig. Workflow of creating a VM in kubevirt. Left: steps added by kubevirt; Right: vanilla precedures of creating a Pod in k8s. [2]

2.1 kube-apiserver: create a VirtualMachine CR

kubevirt introduces a VirtualMachine CRD, which allows to define the specifications of virtual machines, such as CPU, memory, network, and disk configurations. Below is the spec of our to-be-created VM, it’s ok if you don’t undertand all the fields:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: kubevirt-smoke-fedora
spec:
  running: true
  template:
    metadata:
      annotations:
        kubevirt.io/keep-launcher-alive-after-failure: "true"
    spec:
      nodeSelector:
        kubevirt.io/schedulable: "true"
      architecture: amd64
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 1
        resources:
          requests:
            memory: 4G
        machine:
          type: q35
        devices:
          interfaces:
          - bridge: {}
            name: default
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: emptydisk
          - disk:
              bus: virtio
            name: cloudinitdisk
        features:
          acpi:
            enabled: true
        firmware:
          uuid: c3ecdb42-282e-44c3-8266-91b99ac91261
      networks:
      - name: default
        pod: {}
      volumes:
      - containerDisk:
          image: kubevirt/fedora-cloud-container-disk-demo:latest
          imagePullPolicy: Always
        name: containerdisk
      - emptyDisk:
          capacity: 2Gi
        name: emptydisk
      - cloudInitNoCloud:
          userData: |-
            #cloud-config
            password: changeme               # password of this VM
            chpasswd: { expire: False }
        name: cloudinitdisk

Now just apply it:

(master) $ k apply -f kubevirt-smoke-fedora.yaml

2.2 virt-controller: translate VirtualMachine to VirtualMachineInstance and Pod

virt-controller, a control plane component of kubevirt, monitors VirtualMachine CRs/objects and generates corresponding VirtualMachineInstance objects, and further creates a standard Kubernetes Pod object to describe the VM. See renderLaunchManifest() for details.

VirtualMachineInstance is a running instance of the corresponding VirtualMachine, such as, if you stop a VirtualMachine, the corresponding VirtualMachineInstance will be deleted, but it will be recreated after you start this VirtualMachine again.

$ k get vm
NAME                    AGE    STATUS    READY
kubevirt-smoke-fedora   ...

$ k get vmi
NAME                    AGE     PHASE     IP             NODENAME         READY   LIVE-MIGRATABLE   PAUSED
kubevirt-smoke-fedora   ...

$ k get pod -o wide | grep fedora
virt-launcher-kubevirt-smoke-fedora-2kx25   <status> ...

Once the Pod object is created, kube-scheduler takes over and selects a suitable node for the Pod. This has no differences compared with scheduling a normal Kubernetes pod.

The Pod’s yaml specification is very lengthy, we’ll see them piece by piece in following sections.

2.3 kube-scheduler: schedule Pod

Based on Pod’s label selectors, kube-scheduler will choose a node for the Pod, then update the pod spec.

Fig. Architecture overview of the kubevirt solution

The steps described above, from applying a VirtualMachine CR to the scheduling of the corresponding Pod on a node, all occur within the master node or control plane. Subsequent steps involve happen within the selected node.

2.4 kubelet: create Pod

Upon detecting a Pod has been scheduled to this node, kubelet on that node initiates the creation of the Pod using its specifications.

While a standard Pod typically consists of a pause container for holding namespaces and a main container for executing user-defined tasks, Kubernetes also allows for multiple containers to be included within a single Pod. This is particularly useful in scenarios such as service mesh, where a sidecar container can be injected into each Pod to process network requests.

In the case of kubevirt, this “multi-container” property is leveraged even further. virt-controller described 4 containers within the Pod:

  • 2 init containers for creating shared directories for containers in this Pod and copying files;
  • 1 volume container for holding volumes;
  • 1 compute container for holding the VM in this Pod.

2.4.1 pause container

crictl ps won’t show the pause container, but we can check it with ps:

(node) $ ps -ef | grep virt-launcher
qemu     822447 821556  /usr/bin/virt-launcher-monitor --qemu-timeout 288s --name kubevirt-smoke-fedora --uid 413e131b-408d-4ec6-9d2c-dc691e82cfda --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --run-as-nonroot --keep-after-failure
qemu     822464 822447  /usr/bin/virt-launcher         --qemu-timeout 288s --name kubevirt-smoke-fedora --uid 413e131b-408d-4ec6-9d2c-dc691e82cfda --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --run-as-nonroot
qemu     822756 822447  /usr/libexec/qemu-kvm -name ... # parent is virt-launcher-monitor

(node) $ ps -ef | grep pause
root     820808 820788  /pause
qemu     821576 821556  /pause
...

Process start information:

$ cat /proc/821576/cmdline | tr '\0' ' ' # the `pause` process
/pause

$ cat /proc/821556/cmdline | tr '\0' ' ' # the parent process
/usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 09c4b -address /run/containerd/containerd.sock

2.4.2 1st init container: install container-disk-binary to Pod

Snippet from Pod yaml:

  initContainers:
  - command:
    - /usr/bin/cp
    - /usr/bin/container-disk
    - /init/usr/bin/container-disk
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    image: virt-launcher:v1.0.0
    name: container-disk-binary
    resources:
      limits:
        cpu: 100m
        memory: 40M
      requests:
        cpu: 10m
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsGroup: 107
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /init/usr/bin
      name: virt-bin-share-dir

It copies a binary named container-disk from container image to a directory of the Pod,

Source code cmd/container-disk/main.c. About ~100 lines of C code.

so this binary is shared among all containers of this Pod. virt-bin-share-dir is declared as a Kubernetes emptyDir, kubelet will create a volume for it automatically in local disk:

For a Pod that defines an emptyDir volume, the volume is created when the Pod is assigned to a node. As the name says, the emptyDir volume is initially empty. All containers in the Pod can read and write the same files in the emptyDir volume, though that volume can be mounted at the same or different paths in each container. When a Pod is removed from a node for any reason, the data in the emptyDir is deleted permanently.

Check the container:

$ crictl ps -a | grep container-disk-binary # init container runs and exits
55f4628feb5a0   Exited   container-disk-binary   ...

Check the emptyDir created for it:

$ crictl inspect 55f4628feb5a0
    ...
    "mounts": [
      {
        "containerPath": "/init/usr/bin",
        "hostPath": "/var/lib/k8s/kubelet/pods/8364158c/volumes/kubernetes.io~empty-dir/virt-bin-share-dir",
      },

Check what's inside the directory:

$ ls /var/lib/k8s/kubelet/pods/8364158c/volumes/kubernetes.io~empty-dir/virt-bin-share-dir
container-disk # an excutable that will be used by the other containers in this Pod

2.4.3 2nd init container: volumecontainerdisk-init

  - command:
    - /usr/bin/container-disk
    args:
    - --no-op                   # exit(0) directly
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    name: volumecontainerdisk-init
    resources:
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 1m
        ephemeral-storage: 50M
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda
      name: container-disks
    - mountPath: /usr/bin
      name: virt-bin-share-dir

With --no-op option, the container-disk program will exit immediately with a return code of 0, indicating success.

So, what is the purpose of this container? It appears that it references a volume named container-disks, suggesting that it uses this approach as a workaround for certain edge cases. This ensures that the directory (emptyDir) is created before being utilized by the subsequent container.

2.4.4 1st main container: volumecontainerdisk

  - command:
    - /usr/bin/container-disk
    args:
    - --copy-path
    - /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda/disk_0
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    name: volumecontainerdisk
    resources:                         # needs little CPU & memory
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 1m
        ephemeral-storage: 50M
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /usr/bin
      name: virt-bin-share-dir
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda
      name: container-disks

This container uses two directories created by the init containers:

  1. virt-bin-share-dir: an emptyDir, created by the 1st init container;
  2. container-disks: an emptyDir, created by the 2nd init container;

--copy-path <path>:

  • Create this path is not exist;
  • Create a unix domain socket, listen requests and close them;

It seems that this container serves the purpose of holding the container-disk-data volume and does not perform any other significant tasks.

2.4.5 2nd main container: compute

  - command:
    - /usr/bin/virt-launcher-monitor
    - --qemu-timeout
    - 288s
    - --name
    - kubevirt-smoke-fedora
    - --uid
    - 413e131b-408d-4ec6-9d2c-dc691e82cfda
    - --namespace
    - default
    - --kubevirt-share-dir
    - /var/run/kubevirt
    - --ephemeral-disk-dir
    - /var/run/kubevirt-ephemeral-disks
    - --container-disk-dir
    - /var/run/kubevirt/container-disks
    - --grace-period-seconds
    - "45"
    - --hook-sidecars
    - "0"
    - --ovmf-path
    - /usr/share/OVMF
    - --run-as-nonroot
    - --keep-after-failure
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    - name: VIRT_LAUNCHER_LOG_VERBOSITY
      value: "6"
    - name: LIBVIRT_DEBUG_LOGS
      value: "1"
    - name: VIRTIOFSD_DEBUG_LOGS
      value: "1"
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: virt-launcher:v1.0.0
    name: compute
    resources:
      limits:
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
      requests:
        cpu: 100m
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        ephemeral-storage: 50M
        memory: "4261567892"
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: true
      runAsGroup: 107
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /var/run/kubevirt-private
      name: private
    - mountPath: /var/run/kubevirt
      name: public
    - mountPath: /var/run/kubevirt-ephemeral-disks
      name: ephemeral-disks
    - mountPath: /var/run/kubevirt/container-disks
      mountPropagation: HostToContainer
      name: container-disks
    - mountPath: /var/run/libvirt
      name: libvirt-runtime
    - mountPath: /var/run/kubevirt/sockets
      name: sockets
    - mountPath: /var/run/kubevirt/hotplug-disks
      mountPropagation: HostToContainer
      name: hotplug-disks
    - mountPath: /var/run/kubevirt-ephemeral-disks/disk-data/containerdisk
      name: local

This container runs a binary called virt-launcher-monitor, which is a simple wrapper around virt-launcher. The main purpose of adding a wrapping layer is for better cleaning up when process exits.

virt-launcher-monitor

All virt-launcher-monitor’s arguments will be passed to virt-launcher changed, except --keep-after-failure will be removed - this is a monitor-only flag.

// run virt-launcher process and monitor it to give qemu an extra grace period to properly terminate in case of crashes
func RunAndMonitor(containerDiskDir string) (int, error) {
    args := removeArg(os.Args[1:], "--keep-after-failure")

    cmd := exec.Command("/usr/bin/virt-launcher", args...)
    cmd.SysProcAttr = &syscall.SysProcAttr{
        AmbientCaps: []uintptr{unix.CAP_NET_BIND_SERVICE},
    }
    cmd.Start()

    sigs := make(chan os.Signal, 10)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGCHLD)
    go func() {
        for sig := range sigs {
            switch sig {
            case syscall.SIGCHLD:
                var wstatus syscall.WaitStatus
                wpid := syscall.Wait4(-1, &wstatus, syscall.WNOHANG, nil)

                log.Log.Infof("Reaped pid %d with status %d", wpid, int(wstatus))
                if wpid == cmd.Process.Pid {
                    exitStatus <- wstatus.ExitStatus()
                }
            default: // Log("signalling virt-launcher to shut down")
                cmd.Process.Signal(syscall.SIGTERM)
                sig.Signal()
            }
        }
    }()

    exitCode := <-exitStatus // wait for VM's exit
    // do cleanups here
}

virt-launcher call stack: start virtqemud/cmdserver

main // cmd/virt-launcher/virt-launcher.go
  |-NewLibvirtWrapper(*runWithNonRoot)
  |-SetupLibvirt(libvirtLogFilters)
  |-StartVirtquemud(stopChan)
  |    |-go func() {
  |    |     for {
  |    |         Run("/usr/sbin/virtqemud -f /var/run/libvirt/virtqemud.conf")
  |    |
  |    |         select {
  |    |         case <-stopChan:
  |    |             return cmd.Process.Kill()
  |    |         }
  |    |     }
  |    |-}()
  |
  |-domainConn := createLibvirtConnection() // "qemu+unix:///session?socket=/var/run/libvirt/virtqemud-sock" or "qemu:///system"
  |
  |-notifier := notifyclient.NewNotifier(*virtShareDir)
  |-domainManager := NewLibvirtDomainManager()
  |
  |-startCmdServer("/var/run/kubevirt/sockets/launcher-init-sock")
  |-startDomainEventMonitoring
  |-domain := waitForDomainUUID()
  |
  |-mon := virtlauncher.NewProcessMonitor(domainName,)
  |-mon.RunForever()
        |-monitorLoop()

It starts two processes inside the container:

  1. virtqemud: a libvirt component, runs as a daemon process;
  2. cmdserver: a gRPC server, provides VM operation (delete/pause/freeze/…) interfaces to the caller;

virtqemud: management for QEMU VMs

virtqemud is a server side daemon component of the libvirt virtualization management system,

  • one of a collection of modular daemons that replace functionality previously provided by the monolithic libvirtd daemon.
  • provide management for QEMU virtual machines.
  • listens for requests on a local Unix domain socket by default. Remote access via TLS/TCP and backwards compatibility with legacy clients expecting libvirtd is provided by the virtproxyd daemon.

Check the container that will hold the VM:

$ crictl ps -a | grep compute
f67f57d432534       Running     compute     0       09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25

Configurations of virtqemud:

$ crictl exec -it f67f57d432534 sh
sh-5.1$ cat /var/run/libvirt/virtqemud.conf
listen_tls = 0
listen_tcp = 0
log_outputs = "1:stderr"
log_filters="3:remote 4:event 3:util.json 3:util.object 3:util.dbus 3:util.netlink 3:node_device 3:rpc 3:access 3:util.threadjob 3:cpu.cpu 3:qemu.qemu_monitor 1:*"

Process tree

There are several processes inside the compute container, show their relationships with pstree:

$ pstree -p <virt-launcher-monitor pid> --hide-threads
virt-launcher-m(<pid>)─┬─qemu-kvm(<pid>) # The real VM process, we'll see this in the next chapter
                       └─virt-launcher(<pid>)─┬─virtlogd(<pid>)
                                              └─virtqemud(<pid>)

# Show the entire process arguments
$ pstree -p <virt-launcher-monitor pid> --hide-threads --arguments
$ virt-launcher-monitor --qemu-timeout 321s --name kubevirt-smoke-fedora ...
  ├─qemu-kvm            -name guest=default_kubevirt-smoke-fedora,debug-threads=on ...
  └─virt-launcher       --qemu-timeout 321s --name kubevirt-smoke-fedora ...
      ├─virtlogd        -f /etc/libvirt/virtlogd.conf
      └─virtqemud       -f /var/run/libvirt/virtqemud.conf

2.5 virt-launcher: reconcile VM state (create VM in this case)

Status til now:

Fig. Ready to create a KVM VM inside the Pod

  1. kubelet has successfully created a Pod and is reconciling the Pod status based on the Pod specification.

    Note that certain details, such as network creation for the Pod, have been omitted to keep this post concise. There are no differences from normal Pods.

  2. virt-handler is prepared to synchronize the status of the VirtualMachineInstance with a real KVM virtual machine on this node. As there is currently no virtual machine present, the first task of the virt-handler is to create the virtual machine.

Now, let’s delve into the detailed steps involved in creating a KVM virtual machine.

2.5.1 virt-handler/cmdclient -> virt-launcher/cmdserver: sync VMI

An informer is used in virt-handler to sync VMI, it will call to the following stack:

defaultExecute
  |-switch {
    case shouldShutdown:
        d.processVmShutdown(vmi, domain)
    case shouldDelete:
        d.processVmDelete(vmi)
    case shouldCleanUp:
        d.processVmCleanup(vmi)
    case shouldUpdate:
        d.processVmUpdate(vmi, domain)
          |// handle migration if needed
          |
          |// handle vm create
          |-d.vmUpdateHelperDefault
               |-client.SyncVirtualMachine(vmi, options)
                   |-// lots of preparation work here
                   |-genericSendVMICmd("SyncVMI", c.v1client.SyncVirtualMachine, vmi, options)
  }

client.SyncVirtualMachine(vmi, options) does lots of preparation work, then calls SyncVMI() gRPC method to synchronize VM status - if not exist, then create it. This method will be handled by the cmdserver in virt-launcher.

2.5.2 virt-launcher/cmdserver: SyncVirtualMachine() -> libvirt C API virDomainCreateWithFlags()

SyncVirtualMachine // pkg/virt-launcher/virtwrap/cmd-server/server.go
  |-vmi, response := getVMIFromRequest(request.Vmi)
  |-domainManager.SyncVMI(vmi, l.allowEmulation, request.Options)
      |-domain := &api.Domain{}
      |-c := l.generateConverterContext // generate libvirt domain from VMI spec
      |-dom := l.virConn.LookupDomainByName(domain.Spec.Name)
      |-if notFound {
      |     domain = l.preStartHook(vmi, domain, false)
      |     dom = withNetworkIfacesResources(
      |         vmi, &domain.Spec,
      |         func(v *v1.VirtualMachineInstance, s *api.DomainSpec) (cli.VirDomain, error) {
      |             return l.setDomainSpecWithHooks(v, s)
      |         },
      |     )
      |
      |     l.metadataCache.UID.Set(vmi.UID)
      |     l.metadataCache.GracePeriod.Set( api.GracePeriodMetadata{DeletionGracePeriodSeconds: converter.GracePeriodSeconds(vmi)},)
      |     logger.Info("Domain defined.")
      |-}
      | 
      |-switch domState {
      |     case vm create:
      |         l.generateCloudInitISO(vmi, &dom)
      |         dom.CreateWithFlags(getDomainCreateFlags(vmi)) // start VirtualMachineInstance
      |     case vm pause/unpause:
      |     case disk attach/detach/resize disks:
      |     case hot plug/unplug virtio interfaces:
      |-}

As the above code shows, it eventually calls into libvirt API to create a "domain".

https://unix.stackexchange.com/questions/408308/why-are-vms-in-kvm-qemu-called-domains

They’re not kvm exclusive terminology (xen also refers to machines as domains). A hypervisor is a rough equivalent to domain zero, or dom0, which is the first system initialized on the kernel and has special privileges. Other domains started later are called domU and are the equivalent to a guest system or virtual machine. The reason is probably that both are very similar as they are executed on the kernel that handles them.

2.5.3 libvirt API -> virtqemud: create domain (VM)

LibvirtDomainManager

All VM/VMI operations are abstracted into a LibvirtDomainManager struct:

// pkg/virt-launcher/virtwrap/manager.go

type LibvirtDomainManager struct {
    virConn cli.Connection

    // Anytime a get and a set is done on the domain, this lock must be held.
    domainModifyLock sync.Mutex
    // mutex to control access to the guest time context
    setGuestTimeLock sync.Mutex

    credManager *accesscredentials.AccessCredentialManager

    hotplugHostDevicesInProgress chan struct{}
    memoryDumpInProgress         chan struct{}

    virtShareDir             string
    ephemeralDiskDir         string
    paused                   pausedVMIs
    agentData                *agentpoller.AsyncAgentStore
    cloudInitDataStore       *cloudinit.CloudInitData
    setGuestTimeContextPtr   *contextStore
    efiEnvironment           *efi.EFIEnvironment
    ovmfPath                 string
    ephemeralDiskCreator     ephemeraldisk.EphemeralDiskCreatorInterface
    directIOChecker          converter.DirectIOChecker
    disksInfo                map[string]*cmdv1.DiskInfo
    cancelSafetyUnfreezeChan chan struct{}
    migrateInfoStats         *stats.DomainJobInfo

    metadataCache *metadata.Cache
}

libvirt C API

// vendor/libvirt.org/go/libvirt/domain.go

// See also https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainCreateWithFlags
func (d *Domain) CreateWithFlags(flags DomainCreateFlags) error {
	C.virDomainCreateWithFlagsWrapper(d.ptr, C.uint(flags), &err)
}

2.5.4 virtqemud -> KVM subsystem

Create VM with xml spec.

The domain (VM) will be created, and the VCPU will enter running state unless special flags are specified.

2.6 Recap

Fig. A KVM VM is created inside the Pod

$ crictl ps -a | grep kubevirt
960d3e86991fa     Running     volumecontainerdisk        0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
f67f57d432534     Running     compute                    0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
e8b79067667b7     Exited      volumecontainerdisk-init   0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
55f4628feb5a0     Exited      container-disk-binary      0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25

This finishes our journey of creating a VM in kubevirt.

What it looks like if we created two kubevir VirtualMachines and they are scheduled to the same node:

Fig. Two KVM VMs on the node

3 Management VM states

Control path of VM state changes:

kube-apiserver -> virt-handler -> cmdserver -> virtqemud -> KVM subsystem -> VM

 VMI states         VM agent     |-- virt-launcher pod --|     kernel

3.1 Resize CPU/Memory/Disk

Workflow (non hotplug):

  1. /VirtualMachine/VirtualMachineInstance spec is modified;
  2. virt-handler receives the changes and modifies KVM VM configurations via virtqemud->KVM;
  3. Restart pod and KVM VM to make the changes take effect.

If hotplug is supported (e.g. Kubernetes VPA is supported), kubevirt should be able to hot-reload the changes.

3.2 Delete VM

Similar workflow as the above.

4 Summary

This post illustrates what happens in the underlying when user creates a VirtualMachine in Kubernetes with kubevirt.

References

  1. github.com/kubevirt
  2. Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Written by Human, Not by AI Written by Human, Not by AI

Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Fig. Running (full-feature) VMs inside containers, phasing out OpenStack. Solutions: kubevirt, etc



1 Introduction

Some may be puzzling on this topic: why do we still need virtual machines (from the past cloud computing era) when we already have containerized platforms in this cloud-native era? And further, why should we bother managing VMs on Kubernetes, the de-facto container orchestration platform?

Comparing VMs and containers as provisioning methods is a complex matter, and out of the this post’s scope. We just highlight some practical reasons for why deploying VMs on Kubernetes.

1.1 Pratical reasons

Firstly, not all applications can be containerized. VMs provide a complete operating system environment and scratch space (stateful to users), while containers are most frequently used in stateless fashion, and they share the same kernel as the node. Scenarios that are not suitable for containerizations:

  • Applications that are tightly coupled with operation systems or have dependencies on specific hardwares;
  • GUI-based applications with complex display requirements - Windows as an example;

Secondly, applications with strict security requirements may not be suitable for container deployment:

  • VMs offer stronger isolation between workloads and better control over resource usage;
  • Hard multi-tenancy in OpenStack vs. soft multi-tenancy in Kubernetes;

Thirdly, not all transitions from VMs to containers bring business benefits. While moving from VMs to containers can reduce technical debts in most cases, mature and less evolving VM-based stacks may not benefit from such a transition.

With all the above said, despite the benefits of containers, there are still many scenarios where VMs are necessary. The question then becomes: whether to maintain them as standalone or legacy platforms like OpenStack, or to unify management with Kubernetes - especially if your main focus and efforts are already on Kubernetes.

This post explores the latter case: managing VMs along with your container workloads with Kubernetes.

1.2 Resource provision and orchestration

Before moving forward, let’s see a simple comparison between two ages.

1.2.1 Cloud computing era

In this era, the focus primarily lies on IAAS-level, where virtualization is carried out on hardware to provide virtual CPUs, virtual network interfaces, virtual disks, etc. These virtual pieces are finally assembled into a virtual machine (VM), just like a physical machine (blade server) for users.

Users typically express their requirements as follows:

I’d like 3 virtual machines. They should,

  1. Have their own permanent IP addresses (immutable IP throughout their lifecycle).
  2. Have persistent disks for scratch space or stateful data.
  3. Be resizable in terms of CPU, memory, disk, etc.
  4. Be recoverable during maintenance or outages (through cold or live migration).

Once users log in to the machines, they can deploy their business applications and orchestrate their operations on top of these VMs.

Examples of platforms that cater to these needs:

  • AWS EC2
  • OpenStack

Focus of these platforms: resource sharing, hard multi-tenancy, strong isolation, security, etc.

1.2.2 Cloud Native era

In the cloud-native era, orchestration platforms still pay attention to the above mentioned needs, but they operate at a higher level than IAAS. They address concerns such as elasticity, scalability, high availability, service load balancing, and model abstraction. The resulted platforms typically manage stateless workloads.

For instance, in the case of Kubernetes, users often express their requirements as follows:

I want an nginx service for serving a static website, which should:

  • Have a unique entrypoint for accessing (ServiceIP, etc).
  • Have 3 instances replicated across 3 nodes (affinity/anti-affinity rules).
  • Requests should be load balanced (ServiceIP to PodIPs load balancing).
  • Misbehaving instances be automatically replaced with new ones (stateless, health-checking, and reconciliation mechanisms).

1.3 Summary

With the above discussions in mind, let’s see some open-source solutions for managing VM workloads on Kubernetes.

2 Managing VM workloads via Kubernetes: solutions

There are two typical solutions, both based on Kubernetes and capable of managing both container and VM workloads:

  1. VM inside container: suitable for teams that currently maintain both OpenStack and Kubernetes. They can leverage this solution to provision VMs to end users while gradually phasing out OpenStack.

  2. Container inside VM: already are enjoying the benefits and conveniences provided by container ecosystem, while would like to strenthen the security and isolation aspects of container workloads.

2.1 Run VM inside Pod: kubevirt

Fig. Running (full-feature) VMs inside containers, phasing out OpenStack. Solutions: kubevirt, etc

kubevirt utilizes Kubernetes for VM provisioning.

  • Run on top of vanilla Kubernetes.
  • Introduce several CRDs and components to provision VMs.
  • Faciliate VM provisioning by embedding each VM into a container (pod).
  • Compatible with almost all Kubernetes facilities, e.g. Service load-balancing.

2.2 Run Pod inside VM: kata containers

Fig. Running containers inside (lightweight) VMs, with a proper container runtime. Solutions: kata containers, etc

Kata containers have a lightweight VM wrapper,

  • Deploy containers inside a lightweight and ultra-fast VM.
  • Enhance container security with this out-layer VM.
  • Need a dedicated container runtime (but no changes to Kubernetes).

3 Kubevirt solution overview

In this section, we’ll have a quick overview to the kubevirt project.

3.1 Architecture and components

High level architecture:

Fig. kubevirt architecture overview

Main components:

  • virt-api: kubevirt apiserver, for accepting requests like console streaming;
  • virt-controller: reconciles kubevirt objects like VirtualMachine, VirtualMachineInstance (VMI);
  • virt-handler: node agent (like nova-compute in OpenStack), collaborates with Kubernetes’s node agent kubelet;
  • virtctl: CLI, e.g. virtctl console <vm>

3.2 How it works

How a VM is created in kubevirt on top of Kubernetes:

Fig. Workflow of creating a VM in kubevirt. Left: steps added by kubevirt; Right: vanilla precedures of creating a Pod in k8s.

You can see that there are only add-ons but no changes to Kubernetes workflow.

An in-depth illustration: Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive.

3.3 Node internal topology

The internal view of the components inside a node:

Fig. A k8s/kubevirt node with two (KVM) VMs

3.4 Tech stacks

3.4.1 Computing

Still based on KVM/QEMU/libvirt, just like OpenStack.

3.4.2 Networking

Compatible with the CNI mechanism, can work seamlessly with popular network solutions like flannel, calico, and cilium.

kubevirt agent further creates virtual machine network on top of the pod network. This is necessary because virtual machines operate as userspace processes and require userspace simulated network cards (such as TUN/TAP) instead of veth pairs.

Networking is a big topic, I’d like a dedicated blog for it (if time permits).

3.4.3 Storage

Based on Kubernetes storage machanisms (PV/PVC), and advanced features like VM snapshot, clone, live migration, etc, all rely on these machanisms.

Also made some extentions, for example, containerDisk (embedding virtual machines images into container images) .

4 Conclusion

This post talks about why there are needs for running VMs on Kubernetes, and gives a further technical overview to the kubevirt project.

References

  1. github.com/kubevirt
  2. github.com/kata-containers
  3. Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)

Written by Human, Not by AI Written by Human, Not by AI

[译] 100 行 C 代码创建一个 KVM 虚拟机(2019)

译者序

本文核心内容来自 2019 年的一篇英文博客: KVM HOST IN A FEW LINES OF CODE

  1. 首先基于 KVM API 用 100 来行 C 代码实现一个极简虚拟机管理程序(类比 VirtualBox);
  2. 然后用 10 来行汇编代码编写一个极简内核,然后将其制作成虚拟机镜像(类比 Ubuntu/Linux);
  3. 然后把 2 作为输入传给 1,就能创建出一个虚拟机并运行。

本文重新组织和注释了原文核心部分,并做了一些内容扩展,供个人学习参考。为尊重原作者劳动, 本文仍以 [译] 作为标题开头,但注意内容和顺序已经和原文不太对得上。 本文所用代码见 github

由于译者水平有限,本文不免存在错误之处。如有疑问,请查阅原文。



KVM (Kernel Virtual Machine) 是 Linux 内核提供的一种虚拟化技术, 允许用户在单个 Linux 主机上运行多个虚拟机(VM),OpenStack/kubevirt 等等开源 VM 编排系统的底层就是基于 KVM。 那 KVM 是如何工作的呢?

1 内核 KVM 子系统

1.1 交互:字符设备 /dev/kvm

KVM 通过一个特殊(字符)设备 /dev/kvm 供用户空间操作,

$ file /dev/kvm
/dev/kvm: character special

Character devices in Linux provide unbuffered access to data. It is used to communicate with devices that transfer data character by character, such as keyboards, mice, serial ports, and terminals. Character devices allow data to be read from or written to the device one character at a time, without any buffering or formatting.

KVM 的整套 API 都是基于文件描述符的。

1.2 接口:KVM API

KVM API 是一系列控制 VM 行为的 ioctl() get/set 操作, 按功能层次分为下面几个级别:

级别 说明 备注
System KVM 子系统级别的操作;另外还包括一个创建 VM 的 ioctl 操作。  
VM VM 级别的操作,例如设置内存布局;另外还包括一个创建 VCPU 和 device 的 ioctl 操作。 必须从创建该 VM 的那个进程(地址空间)发起。
VCPU VCPU 级别的操作。 必须从创建该 VCPU 那个线程发起。异步 VCPU ioctl 操作除外。
Device 设备级别的操作。 必须从创建该 VM 的那个进程(地址空间)发起。

1.3 操作:ioctl() 系统调用

open("/dev/kvm") 获得一个 KVM 子系统的 fd, 就可以通过 ioctl(kvm_fd, ...) 系统调用来分配资源、启动和管理 VM 了。

2 100 来行 C 代码创建一个 KVM 虚拟机

接下来看一个完整例子:如何基于 KVM 提供的 API 来创建和运行一个虚拟机。

2.1 打开 KVM 设备:kvm_fd = open("/dev/kvm")

与 KVM 子系统交互,需要以读写方式打开 /dev/kvm,获取一个文件描述符:

    if ((kvm_fd = open("/dev/kvm", O_RDWR)) < 0) {
        fprintf(stderr, "failed to open /dev/kvm: %d\n", errno);
        return 1;
    }

这个文件描述符 kvm_fd 在系统中是唯一的,它会将我们接下来的 KVM 操作与主机上其他用户的 KVM 操作区分开(例如,系统上可能同时有多个用户或进程在创建和管理各自的虚拟机)。

2.2 创建 VM 外壳:vm_fd = ioctl(kvm_fd, KVM_CREATE_VM)

有了 kvm_fd 之后,就可以向内核 KVM 子系统发起一个创建虚拟机的 ioctl 请求了:

    if ((vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0)) < 0) {
        fprintf(stderr, "failed to create vm: %d\n", errno);
        return 1;
    }

返回的文件描述符唯一标识这个虚拟机。 不过,此时这个“虚拟机”还仅仅是一个“机箱”,没有 CPU,也没有内存。

2.3 分配 VM 内存:mmap()

“虚拟机”是用用户空间进程来模拟一台完整的机器, 因此给“虚拟机”分配的内存也需要来用户空间,具体来说就是宿主机上的用户空间内存(userspace memory)。 分配用户空间内存有多种方式,这里我们用效率比较高的 mmap()

    if ((mem = mmap(NULL, 1 << 30, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0)) == NULL) {
        fprintf(stderr, "mmap failed: %d\n", errno);
        return 1;
    }

成功后,返回映射内存区域的起始地址 mem

2.4 初始化 VM 内存:ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION)

初始化这片内存区域:

    struct kvm_userspace_memory_region region;
    memset(&region, 0, sizeof(region));
    region.slot = 0;
    region.guest_phys_addr = 0;
    region.memory_size = 1 << 30;
    region.userspace_addr = (uintptr_t)mem;
    if (ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region) < 0) {
        fprintf(stderr, "ioctl KVM_SET_USER_MEMORY_REGION failed: %d\n", errno);
        return 1;
    }

接下来就可以将虚拟机镜像加载到这片内存区域了。

2.5 加载 VM 镜像:open() + read()

这里假设命令行第一个参数指定的是虚拟机镜像的文件路径

    int img_fd = open(argv[1], O_RDONLY);
    if (img_fd < 0) {
        fprintf(stderr, "can not open binary guest file: %d\n", errno);
        return 1;
    }
    char *p = (char *)mem;
    for (;;) {
        int r = read(img_fd, p, 4096);
        if (r <= 0) {
            break;
        }
        p += r;
    }
    close(img_fd);

以 4KB 为单位,通过一个循环将整个镜像文件内容复制到 VM 的内存地址空间

KVM 并非逐个解释执行 CPU 指令,而是让真实 CPU 直接执行, 因此要求镜像(字节码)与当前 CPU 架构相符,KVM 自己只拦截 I/O 请求。 因此,KVM 性能很好,除非 VM 有大量 IO 操作。

至此,VM 内存部分的虚拟化和初始化就完成了。

2.6 创建 VCPU:ioctl(vm_fd, KVM_CREATE_VCPU)

接下来给 VM 创建虚拟机处理器,即 VCPU:

    if ((vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0)) < 0) {
        fprintf(stderr, "can not create vcpu: %d\n", errno);
        return 1;
    }

成功后,返回一个非负的 VCPU 文件描述符。 这个 VCPU 有自己的寄存器、内存,将模拟一个物理 CPU 的执行。

2.7 初始化 VCPU 控制区域:ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE) + mmap

VCPU 运行结束后,需要将一些运行状态("run state")返回给我们的控制程序。 KVM 的实现方式是提供一段特殊的内存区域,称为 KVM_RUN,来存储和传递这些状态。

通过 ioctl 可以获取这段内存的大小:

    int kvm_run_mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
    if (kvm_run_mmap_size < 0) {
        fprintf(stderr, "ioctl KVM_GET_VCPU_MMAP_SIZE: %d\n", errno);
        return 1;
    }

然后通过 mmap 分配内存:

    struct kvm_run *run = (struct kvm_run *)mmap(NULL, kvm_run_mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_fd, 0);
    if (run == NULL) {
        fprintf(stderr, "mmap kvm_run: %d\n", errno);
        return 1;
    }

VCPU 退出运行时,将把退出原因(例如需要 IO)等状态信息写入这里。

2.8 设置 VCPU 寄存器:ioctl(vcpu_fd, KVM_SET_SREGS/KVM_SET_REGS)

接下来需要初始化这个 VCPU 的寄存器。首先拿到这些寄存器,

    struct kvm_regs regs;
    struct kvm_sregs sregs;
    if (ioctl(vcpu_fd, KVM_GET_SREGS, &(sregs)) < 0) {
        perror("can not get sregs\n");
        exit(1);
    }

为简单起见,我们这里要求虚拟机镜像是 16bit 模式, 也就是内存地址和寄存器都是 16 位的。

设置特殊目的寄存器(special registers): 初始化几个 segment pointers(段指针),它们表示的是内存偏置(memory offset) [2],

  • CS:代码段(code segment)
  • SS:栈段(stack segment)
  • DS:数据段(data segment)
  • ES:额外段(extra segment)
#define CODE_START 0x0000

    sregs.cs.selector = CODE_START;  // 代码
    sregs.cs.base = CODE_START * 16;
    sregs.ss.selector = CODE_START;  // 栈
    sregs.ss.base = CODE_START * 16;
    sregs.ds.selector = CODE_START;  // 数据
    sregs.ds.base = CODE_START * 16;
    sregs.es.selector = CODE_START;  // 额外
    sregs.es.base = CODE_START * 16;
    sregs.fs.selector = CODE_START;  //
    sregs.fs.base = CODE_START * 16;
    sregs.gs.selector = CODE_START;  //

    if (ioctl(vcpu_fd, KVM_SET_SREGS, &sregs) < 0) {
        perror("can not set sregs");
        return 1;
    }

设置通用目的寄存器

    regs.rflags = 2;
    regs.rip = 0;

    if (ioctl(vcpu_fd, KVM_SET_REGS, &(regs)) < 0) {
        perror("KVM SET REGS\n");
        return 1;
    }

至此,所有初始化工作都做完了,接下来就可以启动这个虚拟机了。

2.9 启动 VM:ioctl(vcpu_fd, KVM_RUN)

启动一个无限循环,在里面做两件事情:

  1. 调用 ioctl(vcpu_fd, KVM_RUN, 0) 让 VCPU 运行,直到它主动退出;
  2. VCPU 退出之后,读取 KVM_RUN 控制区域,判断退出原因,然后执行相应的操作;
    for (;;) {
        int ret = ioctl(vcpu_fd, KVM_RUN, 0);
        if (ret < 0) {
            fprintf(stderr, "KVM_RUN failed\n");
            return 1;
        }

        switch (run->exit_reason) {
            case KVM_EXIT_IO:
                printf("IO port: %x, data: %x\n", run->io.port,
                        *(int *)((char *)(run) + run->io.data_offset));
                sleep(1);
                break;
            case KVM_EXIT_SHUTDOWN:
                goto exit;
        }
    }

这里只判断两种状态:

  1. 如果 VCPU 是因为要执行 IO 操作而退出,那就从 KVM_RUN 区域读取它想输入/输出的数据,然后替它执行 —— 这里就是打印出来;
  2. 如果是正常退出,就退出这个无限循环 —— 对我们这个简单程序来说,实际效果就是关闭并销毁这个虚拟机。

2.10 小结

以上就是创建、初始化并运行一个 VM 的代码,总共 130 行左右(如果不算头文件引用和一些打印代码,不到 100 行)。 要测试运行,现在唯一还缺的就是一个虚拟机镜像

为了深入理解,下面我们自己用汇编代码来写一个极简虚拟机(内核),并做成镜像。

3 极简 VM 镜像

3.1 极简内核:8 行汇编代码

我们将用 16bit 汇编代码实现一个袖珍 guest VM “kernel”,效果是

  1. 初始化一个变量为 0,
  2. 进入一个无限循环,首先将变量值输出到 debug 端口 0x10,然后变量值加 1,进入下次循环;

代码如下,每行都做了注释,

# A tiny 16-bit guest "kernel" that infinitely prints an incremented number to the debug port

.globl _start
.code16          # 16bit 模式,让 KVM 用 "real" mode 运行
_start:          # 代码开始
  xorw %ax, %ax  # 设置 %ax = 0。对同一个寄存器做异或操作,结果为 0,所以这个操作就是重置寄存器 ax。
loop:            # 开始一个循环
  out %ax, $0x10 # 将 ax 寄存器的值输出到 0x10 I/O port
  inc %ax        # 将 ax 寄存器的值加 1
  jmp loop       # 跳到下一次循环

基础 x86 汇编语法可参考 (译) 简明 x86 汇编指南(2017)

KVM VCPU 支持运行多种模式(16/32 bit 等),这里用 16bit 是因为这种模式最简单。 另外,Real mode 是直接内存寻址的,不需要 descriptor tables,因此初始化寄存器非常方便。

3.2 制作成虚拟机镜像

只需汇编(assemble)和链接:

$ make image
as -32 guest.S -o guest.o
ld -m elf_i386 --oformat binary -N -e _start -Ttext 0x10000 -o guest guest.o
  • 汇编(assemble):将汇编代码(assembly code)转成目标文件(object file)
  • 链接(linking):将目标文件及其依赖链接为ELF 文件

最终得到的是一个与当前 CPU 架构相同的二进制文件(字节码),

$ file guest
guest: data

当前宿主机的 CPU 可以直接执行这些指令。

4 测试

4.1 编译

我们的 C 代码只依赖内核头文件,如果用的 centos,如下安装:

$ yum install kernel-headers

然后就可以用 gcc 或 clang 编译了:

$ make kvm
gcc kvm-vmm.c

$ ls
a.out  guest  guest.o  guest.S  kvm-vmm.c  Makefile

4.2 运行

$ ./a.out guest
IO port: 10, data: 0
IO port: 10, data: 1
IO port: 10, data: 2
^C

5 扩展阅读

如何让虚拟机内核更接近现实,原文 有进一步讨论和部分验证:

  • 方向

    • 通过 ioctl 增加定时器、中断控制器等;
    • bzImage 格式;
    • boot 协议
    • 磁盘、键盘、图形处理器等 I/O driver 支持
  • 实际上不会直接使用 KVM API,而是使用更上层的libvirt,封装了 KVM/BHyve 等底层虚拟化技术;
  • 想更深入学习 KVM,推荐阅读 kvmtool 源码;代码不算多,比 QEMU 更容易理解;

相关主题:

译文参考资料

  1. The Definitive KVM (Kernel-based Virtual Machine) API Documentation, kernel.org
  2. x86 Assembly/16, 32, and 64 Bits, wikipedia
❌