阅读视图

发现新文章,点击刷新页面。

K8s 调度框架设计与 scheduler plugins 开发部署示例(2024)



1 引言

K8s 调度框架提供了一种扩展调度功能的插件机制, 对于想实现自定义调度逻辑的场景非常有用。

  • 如果 pod spec 里没指定 schedulerName 字段,则使用默认调度器;
  • 如果指定了,就会走到相应的调度器/调度插件。

本文整理一些相关内容,并展示如何用 300 来行代码实现一个简单的固定宿主机调度插件。 代码基于 k8s v1.28

1.1 调度框架(sceduling framework)扩展点

如下图所示,K8s 调度框架定义了一些扩展点(extension points),

Fig. Scheduling framework extension points.

用户可以编写自己的调度插件(scheduler plugins)注册到这些扩展点来实现想要的调度逻辑。 每个扩展点上一般会有多个 plugins,按注册顺序依次执行。

扩展点根据是否影响调度决策,可以分为两类。

1.1.1 影响调度决策的扩展点

大部分扩展点是影响调度决策的,

  • 后面会看到,这些函数的返回值中包括一个成功/失败字段,决定了是允许还是拒绝这个 pod 进入下一处理阶段;
  • 任何一个扩展点失败了,这个 pod 的调度就失败了;

1.1.2 不影响调度决策的扩展点(informational)

少数几个扩展点是 informational 的,

  • 这些函数没有返回值,因此不能影响调度决策
  • 但是,在这里面可以修改 pod/node 等信息,或者执行清理操作。

1.2 调度插件分类

根据是否维护在 k8s 代码仓库本身,分为两类。

1.2.1 in-tree plugins

维护在 k8s 代码目录 pkg/scheduler/framework/plugins 中, 跟内置调度器一起编译。里面有十几个调度插件,大部分都是常用和在用的,

$ ll pkg/scheduler/framework/plugins
defaultbinder/
defaultpreemption/
dynamicresources/
feature/
imagelocality/
interpodaffinity/
names/
nodeaffinity/
nodename/
nodeports/
noderesources/
nodeunschedulable/
nodevolumelimits/
podtopologyspread/
queuesort/
schedulinggates/
selectorspread/
tainttoleration/
volumebinding/
volumerestrictions/
volumezone/

in-tree 方式每次要添加新插件,或者修改原有插件,都需要修改 kube-scheduler 代码然后编译和 重新部署 kube-scheduler,比较重量级。

1.2.2 out-of-tree plugins

out-of-tree plugins 由用户自己编写和维护独立部署, 不需要对 k8s 做任何代码或配置改动。

本质上 out-of-tree plugins 也是跟 kube-scheduler 代码一起编译的,不过 kube-scheduler 相关代码已经抽出来作为一个独立项目 github.com/kubernetes-sigs/scheduler-plugins。 用户只需要引用这个包,编写自己的调度器插件,然后以普通 pod 方式部署就行(其他部署方式也行,比如 binary 方式部署)。 编译之后是个包含默认调度器和所有 out-of-tree 插件的总调度器程序,

  • 它有内置调度器的功能;
  • 也包括了 out-of-tree 调度器的功能;

用法有两种:

  • 跟现有调度器并行部署,只管理特定的某些 pods;
  • 取代现有调度器,因为它功能也是全的。

1.3 每个扩展点上分别有哪些内置插件

内置的调度插件,以及分别工作在哪些 extention points: 官方文档。 比如,

  • node selectors 和 node affinity 用到了 NodeAffinity plugin;
  • taint/toleration 用到了 TaintToleration plugin。

2 Pod 调度过程

一个 pod 的完整调度过程可以分为两个阶段:

  1. scheduling cycle:为 pod 选择一个 node,类似于数据库查询和筛选
  2. binding cycle:落实以上选择,类似于处理各种关联的东西并将结果写到数据库

例如,虽然 scheduling cycle 为 pod 选择了一个 node,但是在接下来的 binding cycle 中, 在这个 node 上给这个 pod 创建 persistent volume 失败了, 那整个调度过程也是算失败的,需要回到最开始的步骤重新调度。 以上两个过程加起来称为一个 scheduling context

另外,在进入一个 scheduling context 之前,还有一个调度队列, 用户可以编写自己的算法对队列内的 pods 进行排序,决定哪些 pods 先进入调度流程。 总流程如下图所示:

Fig. queuing/sorting and scheduling context

下面分别来看。

2.1 等待调度阶段

2.1.1 PreEnqueue

Pod 处于 ready for scheduling 的阶段。 内部工作原理:sig-scheduling/scheduler_queues.md

这一步没过就不会进入调度队列,更不会进入调度流程。

2.1.2 QueueSort

对调度队列(scheduling queue)内的 pod 进行排序,决定先调度哪些 pods。

2.2 调度阶段(scheduling cycle)

2.2.1 PreFilter:pod 预处理和检查,不符合预期就提前结束调度

这里的插件可以对 Pod 进行预处理,或者条件检查,函数签名如下:

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349-L367

// PreFilterPlugin is an interface that must be implemented by "PreFilter" plugins.
// These plugins are called at the beginning of the scheduling cycle.
type PreFilterPlugin interface {
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected. PreFilter could optionally
    // return a PreFilterResult to influence which nodes to evaluate downstream. This is useful
    // for cases where it is possible to determine the subset of nodes to process in O(1) time.
    // When it returns Skip status, returned PreFilterResult and other fields in status are just ignored,
    // and coupled Filter plugin/PreFilterExtensions() will be skipped in this scheduling cycle.
    PreFilter(ctx , state *CycleState, p *v1.Pod) (*PreFilterResult, *Status)

    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions
}
  • 输入:

    • p *v1.Pod待调度的 pod
    • 第二个参数 state 可用于保存一些状态信息,然后在后面的扩展点(例如 Filter() 阶段)拿出来用;
  • 输出:

    • 只要有任何一个 plugin 返回失败,这个 pod 的调度就失败了
    • 换句话说,所有已经注册的 PreFilter plugins 都成功之后,pod 才会进入到下一个环节;

2.2.2 Filter:排除所有不符合要求的 node

这里的插件可以过滤掉那些不满足要求的 node(equivalent of Predicates in a scheduling Policy),

  • 针对每个 node,调度器会按配置顺序依次执行 filter plugins;
  • 任何一个插件 返回失败,这个 node 就被排除了;
// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2

// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from running the pod.
type FilterPlugin interface {
    Plugin
    // Filter is called by the scheduling framework.
    // All FilterPlugins should return "Success" to declare that
    // the given node fits the pod. If Filter doesn't return "Success",
    // it will return "Unschedulable", "UnschedulableAndUnresolvable" or "Error".
    // For the node being evaluated, Filter plugins should look at the passed
    // nodeInfo reference for this particular node's information (e.g., pods
    // considered to be running on the node) instead of looking it up in the
    // NodeInfoSnapshot because we don't guarantee that they will be the same.
    // For example, during preemption, we may pass a copy of the original
    // nodeInfo object that has some pods removed from it to evaluate the
    // possibility of preempting them to schedule the target pod.
    Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}
  • 输入:

    • nodeInfo当前给定的 node 的信息,Filter() 程序判断这个 node 是否符合要求;
  • 输出:

    • 放行或拒绝。

对于给定 node,如果所有 Filter plugins 都返回成功,这个 node 才算通过筛选, 成为备选 node 之一

2.2.3 PostFilterFilter 之后没有 node 剩下,补救阶段

如果 Filter 阶段之后,所有 nodes 都被筛掉了,一个都没剩,才会执行这个阶段;否则不会执行这个阶段的 plugins。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L392C1-L407C2

// PostFilterPlugin is an interface for "PostFilter" plugins. These plugins are called after a pod cannot be scheduled.
type PostFilterPlugin interface {
    // A PostFilter plugin should return one of the following statuses:
    // - Unschedulable: the plugin gets executed successfully but the pod cannot be made schedulable.
    // - Success: the plugin gets executed successfully and the pod can be made schedulable.
    // - Error: the plugin aborts due to some internal error.
    //
    // Informational plugins should be configured ahead of other ones, and always return Unschedulable status.
    // Optionally, a non-nil PostFilterResult may be returned along with a Success status. For example,
    // a preemption plugin may choose to return nominatedNodeName, so that framework can reuse that to update the
    // preemptor pod's .spec.status.nominatedNodeName field.
    PostFilter(ctx , state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status)
}
  • 按 plugin 顺序依次执行,任何一个插件将 node 标记为 Schedulable 就算成功,不再执行剩下的 PostFilter plugins。

典型例子preemptiontolerationFilter() 之后已经没有可用 node 了,在这个阶段就挑一个 pod/node,抢占它的资源。

2.2.4 PreScore

PreScore/Score/NormalizeScore 都是给 node 打分的,以最终选出一个最合适的 node。这里就不展开了, 函数签名也在上面给到的源文件路径中,这里就不贴了。

2.2.5 Score

针对每个 node 依次调用 scoring plugin,得到一个分数。

2.2.6 NormalizeScore

2.2.7 Reserve:Informational,维护 plugin 状态信息

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L444C1-L462C2

// ReservePlugin is an interface for plugins with Reserve and Unreserve
// methods. These are meant to update the state of the plugin. This concept
// used to be called 'assume' in the original scheduler. These plugins should
// return only Success or Error in Status.code. However, the scheduler accepts
// other valid codes as well. Anything other than Success will lead to
// rejection of the pod.
type ReservePlugin interface {
    // Reserve is called by the scheduling framework when the scheduler cache is
    // updated. If this method returns a failed Status, the scheduler will call
    // the Unreserve method for all enabled ReservePlugins.
    Reserve(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
    // Unreserve is called by the scheduling framework when a reserved pod was
    // rejected, an error occurred during reservation of subsequent plugins, or
    // in a later phase. The Unreserve method implementation must be idempotent
    // and may be called by the scheduler even if the corresponding Reserve
    // method for the same plugin was not called.
    Unreserve(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

这里有两个方法,都是 informational,也就是不影响调度决策; 维护了 runtime state (aka “stateful plugins”) 的插件,可以通过这两个方法 接收 scheduler 传来的信息

  1. Reserve

    用来避免 scheduler 等待 bind 操作结束期间,因 race condition 导致的错误。 只有当所有 Reserve plugins 都成功后,才会进入下一阶段,否则 scheduling cycle 就中止了。

  2. Unreserve

    调度失败,这个阶段回滚时执行。Unreserve() 必须幂等,且不能 fail。

2.2.8 Permit允许/拒绝/等待进入 binding cycle

这是 scheduling cycle 的最后一个扩展点了,可以阻止或延迟将一个 pod binding 到 candidate node。

// PermitPlugin is an interface that must be implemented by "Permit" plugins.
// These plugins are called before a pod is bound to a node.
type PermitPlugin interface {
    // Permit is called before binding a pod (and before prebind plugins). Permit
    // plugins are used to prevent or delay the binding of a Pod. A permit plugin
    // must return success or wait with timeout duration, or the pod will be rejected.
    // The pod will also be rejected if the wait timeout or the pod is rejected while
    // waiting. Note that if the plugin returns "wait", the framework will wait only
    // after running the remaining plugins given that no other plugin rejects the pod.
    Permit(ctx , state *CycleState, p *v1.Pod, nodeName string) (*Status, time.Duration)
}

三种结果:

  1. approve:所有 Permit plugins 都 appove 之后,这个 pod 就进入下面的 binding 阶段;
  2. deny:任何一个 Permit plugin deny 之后,就无法进入 binding 阶段。这会触发 Reserve plugins 的 Unreserve() 方法;
  3. wait (with a timeout):如果有 Permit plugin 返回 “wait”,这个 pod 就会进入一个 internal “waiting” Pods list;

2.3 绑定阶段(binding cycle)

Fig. Scheduling framework extension points.

2.3.1 PreBindBind 之前的预处理,例如到 node 上去挂载 volume

例如,在将 pod 调度到一个 node 之前,先给这个 pod 在那台 node 上挂载一个 network volume。

// PreBindPlugin is an interface that must be implemented by "PreBind" plugins.
// These plugins are called before a pod being scheduled.
type PreBindPlugin interface {
    // PreBind is called before binding a pod. All prebind plugins must return
    // success or the pod will be rejected and won't be sent for binding.
    PreBind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}
  • 任何一个 PreBind plugin 失败,都会导致 pod 被 reject,进入到 reserve plugins 的 Unreserve() 方法;

2.3.2 Bind:将 pod 关联到 node

所有 PreBind 完成之后才会进入 Bind。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L497

// Bind plugins are used to bind a pod to a Node.
type BindPlugin interface {
    // Bind plugins will not be called until all pre-bind plugins have completed. Each
    // bind plugin is called in the configured order. A bind plugin may choose whether
    // or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the
    // remaining bind plugins are skipped. When a bind plugin does not handle a pod,
    // it must return Skip in its Status code. If a bind plugin returns an Error, the
    // pod is rejected and will not be bound.
    Bind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}
  • 所有 plugin 按配置顺序依次执行;
  • 每个 plugin 可以选择是否要处理一个给定的 pod;
  • 如果选择处理,后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行

2.3.3 PostBind:informational,可选,执行清理操作

这是一个 informational extension point,也就是无法影响调度决策(没有返回值)。

  • bind 成功的 pod 才会进入这个阶段;
  • 作为 binding cycle 的最后一个阶段,一般是用来清理一些相关资源。
// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L473

// PostBindPlugin is an interface that must be implemented by "PostBind" plugins.
// These plugins are called after a pod is successfully bound to a node.
type PostBindPlugin interface {
    // PostBind is called after a pod is successfully bound. These plugins are informational.
    // A common application of this extension point is for cleaning
    // up. If a plugin needs to clean-up its state after a pod is scheduled and
    // bound, PostBind is the extension point that it should register.
    PostBind(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

3 开发一个极简 sticky node 调度器插件(out-of-tree)

这里以 kubevirt 固定宿主机调度 VM 为例,展示如何用几百行代码实现一个 out-of-tree 调度器插件。

3.1 设计

3.1.1 背景知识

一点背景知识 [2,3]:

  1. VirtualMachine 是一个虚拟机 CRD;
  2. 一个 VirtualMachine 会对应一个 VirtualMachineInstance,这是一个运行中的 VirtualMachine
  3. 一个 VirtualMachineInstance 对应一个 Pod

如果发生故障,VirtualMachineInstancePod 可能会重建和重新调度,但 VirtualMachine 是不变的; VirtualMachine <--> VirtualMachineInstance/Pod 的关系, 类似于 StatefulSet <--> Pod 的关系。

3.1.2 业务需求

VM 创建之后只要被调度到某台 node,以后不管发生什么故障,它永远都被调度到这个 node 上(除非人工干预)。

可能场景:VM 挂载了宿主机本地磁盘,因此换了宿主机之后数据就没了。 故障场景下,机器或容器不可用没关系,微服务系统自己会处理实例的健康检测和流量拉出, 底层基础设施保证不换宿主机就行了,这样故障恢复之后数据还在。

技术描述:

  • 用户创建一个 VirtualMachine 后,能正常调度到一台 node 创建出来;
  • 后续不管发生什么问题(pod crash/eviction/recreate、node restart …),这个 VirtualMachine 都要被调度到这台机器。

3.1.3 技术方案

  1. 用户创建一个 VirtualMachine 后,由默认调度器给它分配一个 node,然后将 node 信息保存到 VirtualMachine CR 上;
  2. 如果 VirtualMachineInstancePod 被删除或发生重建,调度器先找到对应的 VirtualMachine CR, 如果 CR 中有保存的 node 信息,就用这个 node;否则(必定是第一次调度),转 1。

3.2 实现

实现以上功能需要在三个位置注册调度扩展函数:

  1. PreFilter
  2. Filter
  3. PostBind

代码基于 k8s v1.28

3.2.1 Prefilter()

主要做一些检查和准备工作,

  1. 如果不是我们的 Pod:直接返回成功,留给其他 plugin 去处理;
  2. 如果是我们的 Pod,查询关联的 VMI/VM CR,这里分两种情况:

    1. 找到了:说明之前已经调度过(可能是 pod 被删除了导致重新调度),我们应该解析出原来的 node,供后面 Filter() 阶段使用;
    2. 没找到:说明是第一次调度,什么都不做,让默认调度器为我们选择初始 node。
  3. 将 pod 及为它选择的 node(没有就是空)保存到一个 state 上下文中,这个 state 会传给后面的 Filter() 阶段使用。
// PreFilter invoked at the preFilter extension point.
func (pl *StickyVM) PreFilter(ctx , state *framework.CycleState, pod *v1.Pod) (*framework.PreFilterResult, *framework.Status) {
    s := stickyState{false, ""}

    // Get pod owner reference
    podOwnerRef := getPodOwnerRef(pod)
    if podOwnerRef == nil {
        return nil, framework.NewStatus(framework.Success, "Pod owner ref not found, return")
    }

    // Get VMI
    vmiName := podOwnerRef.Name
    ns := pod.Namespace

    vmi := pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return nil, framework.NewStatus(framework.Error, "get vmi failed")
    }

    vmiOwnerRef := getVMIOwnerRef(vmi)
    if vmiOwnerRef == nil {
        return nil, framework.NewStatus(framework.Success, "VMI owner ref not found, return")
    }

    // Get VM
    vmName := vmiOwnerRef.Name
    vm := pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return nil, framework.NewStatus(framework.Error, "get vmi failed")
    }

    // Annotate sticky node to VM
    s.node, s.nodeExists = vm.Annotations[stickyAnnotationKey]
    return nil, framework.NewStatus(framework.Success, "Check pod/vmi/vm finish, return")
}

3.2.2 Filter()

调度器会根据 pod 的 nodeSelector 等,为我们初步选择出一些备选 nodes。 然后会遍历这些 node,依次调用各 plugin 的 Filter() 方法,看这个 node 是否合适。 伪代码:

// For a given pod
for node in selectedNodes:
    for pl in plugins:
        pl.Filter(ctx, customState, pod, node)

我们的 plugin 逻辑,首先解析传过来的 state/pod/node 信息,

  1. 如果 state 中保存了一个 node,

    1. 如果保存的这个 node 就是当前 Filter() 传给我们的 node,返回成功;
    2. 对于其他所有 node,都返回失败。

    以上的效果就是:只要这个 pod 上一次调度到某个 node,我们就继续让它调度到这个 node, 也就是“固定宿主机调度”

  2. 如果 state 中没有保存的 node,说明是第一次调度,也返回成功,默认调度器会给我们分一个 node。 我们在后面的 PostBind 阶段把这个 node 保存到 state 中。

func (pl *StickyVM) Filter(ctx , state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    s := state.Read(stateKey)
    if err != nil {
        return framework.NewStatus(framework.Error, fmt.Sprintf("read preFilter state fail: %v", err))
    }

    r, ok := s.(*stickyState)
    if !ok {
        return framework.NewStatus(framework.Error, fmt.Sprintf("convert %+v to stickyState fail", s))
    }
    if !r.nodeExists {
        return nil
    }

    if r.node != nodeInfo.Node().Name {
        // returning "framework.Error" will prevent process on other nodes
        return framework.NewStatus(framework.Unschedulable, "already stick to another node")
    }

    return nil
}

3.2.3 PostBind()

能到这个阶段,说明已经为 pod 选择好了一个 node。我们只需要检查下这个 node 是否已经保存到 VM CR 中, 如果没有就保存之。

func (pl *StickyVM) PostBind(ctx , state *framework.CycleState, pod *v1.Pod, nodeName string) {
    s := state.Read(stateKey)
    if err != nil {
        return
    }

    r, ok := s.(*stickyState)
    if !ok {
        klog.Errorf("PostBind: pod %s/%s: convert failed", pod.Namespace, pod.Name)
        return
    }

    if r.nodeExists {
        klog.Errorf("PostBind: VM already has sticky annotation, return")
        return
    }

    // Get pod owner reference
    podOwnerRef := getPodOwnerRef(pod)
    if podOwnerRef == nil {
        return
    }

    // Get VMI owner reference
    vmiName := podOwnerRef.Name
    ns := pod.Namespace

    vmi := pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return
    }

    vmiOwnerRef := getVMIOwnerRef(vmi)
    if vmiOwnerRef == nil {
        return
    }

    // Add sticky node to VM annotations
    retry.RetryOnConflict(retry.DefaultRetry, func() error {
        vmName := vmiOwnerRef.Name
        vm := pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: "0"})
        if err != nil {
            return err
        }

        if vm.Annotations == nil {
            vm.Annotations = make(map[string]string)
        }

        vm.Annotations[stickyAnnotationKey] = nodeName
        if _ = pl.kubevirtClient.VirtualMachines(pod.Namespace).Update(ctx, vm, metav1.UpdateOptions{}); err != nil {
            return err
        }
        return nil
    })
}

前面提到过,这个阶段是 informational 的, 它不能影响调度决策,所以它没有返回值

3.2.4 其他说明

以上就是核心代码,再加点初始化代码和脚手架必需的东西就能编译运行了。 完整代码见 这里 (不包括依赖包)。

实际开发中,golang 依赖问题可能比较麻烦,需要根据 k8s 版本、scheduler-plugins 版本、golang 版本、kubevirt 版本等等自己解决。

3.3 部署

Scheduling plugins 跟网络 CNI plugins 不同,后者是可执行文件(binary),放到一个指定目录就行了。 Scheduling plugins 是 long running 服务。

3.3.1 配置

为我们的 StickyVM scheduler 创建一个配置:

$ cat ksc.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.kubeconfig"
profiles:
- schedulerName: stickyvm
  plugins:
    preFilter:
      enabled:
      - name: StickyVM
      disabled:
      - name: NodeResourceFit
    filter:
      enabled:
      - name: StickyVM
      disabled:
      - name: NodePorts
      # - name: "*"
    reserve:
      disabled:
      - name: "*"
    preBind:
      disabled:
      - name: "*"
    postBind:
      enabled:
      - name: StickyVM
      disabled:
      - name: "*"

一个 ksc 里面可以描述多个 profile, 会启动多个独立 scheduler。 由于这个配置是给 kube-scheduler 的,而不是 kube-apiserver,

# content of the file passed to "--config"
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration

所以 k api-resourcesk get KubeSchedulerConfiguration 都是找不到这个资源的。

pod 想用哪个 profile,就填对应的 schdulerName。 如果没指定,就是 default-scheduler

3.3.2 运行

不需要对 k8s 做任何配置改动,作为普通 pod 部署运行就行(需要创建合适的 CluterRole 等等)。

这里为了方面,用 k8s cluster admin 证书直接从开发机启动,适合开发阶段快速迭代:

$ ./bin/stickyvm-scheduler --leader-elect=false --config ksc.yaml
Creating StickyVM scheduling plugin
Creating kubevirt clientset
Create kubevirt clientset successful
Create StickyVM scheduling plugin successful
Starting Kubernetes Scheduler" version="v0.0.20231122"
Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Serving securely on [::]:10259
"Starting DynamicServingCertificateController"

3.4 测试

只需要在 VM CR spec 里面指定调度器名字。

3.4.1 首次创建 VM

新创建一个 VM 时的 workflow,

  1. yaml 里指定用 schedulerName: stickyvm
  2. k8s 默认调度器自动选一个 node,
  3. StickyVM 根据 ownerref 依次拿到 vmi/vm,然后在 postbind hook 里将这个 node 添加到 VM annotation 里;

日志:

Prefilter: start
Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp
PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora
PreFilter: found corresponding VMI
PreFilter: found corresponding VM
PreFilter: VM has no sticky node, skip to write to scheduling context
Prefilter: finish
Filter: start
Filter: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp, sticky node not exist, got node-1, return success
PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp
PostBind: annotating selected node node-1 to VM
PostBind: parent is VirtualMachineInstance kubevirt-smoke-fedora
PostBind: found corresponding VMI
PostBind: found corresponding VM
PostBind: annotating node node-1 to VM: kubevirt-smoke-fedora

3.4.2 删掉 VMI/Pod,重新调度时

删除 vmi 或者 pod,StickyVM plugin 会在 prefilter 阶段从 annotation 拿出这个 node 信息,然后在 filter 阶段做判断,只有过滤到这个 node 时才返回成功,从而实现 固定 node 调度的效果:

Prefilter: start
Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v
PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora
PreFilter: found corresponding VMI
PreFilter: found corresponding VM
PreFilter: VM already sticky to node node-1, write to scheduling context
Prefilter: finish
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-2
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, given node is sticky node node-1, return success
Filter: finish
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-3
PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v
PostBind: VM already has sticky annotation, return

这时候 VM 上已经有 annotation,因此 postbind 阶段不需要做任何事情。

4 总结

本文整理了一些 k8s 调度框架和扩展插件相关的内容,并通过一个例子展示了开发和部署过程。

参考资料

  1. github.com/kubernetes-sigs/scheduler-plugins
  2. Virtual Machines on Kubernetes: Requirements and Solutions (2023)
  3. Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)
  4. Scheduling Framework, kubernetes.io
  5. github.com/kubernetes-sigs/scheduler-plugins

Written by Human, Not by AI Written by Human, Not by AI

Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)

Fig. kubevirt architecture overview

An introductory post before this deep dive: Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Based on kubevirt v1.0.0, v1.1.0.



Fig. Architecture overview of the kubevirt solution

This post assumes there is already a running Kubernetes cluster, and kubevirt is correctly deployed in this cluster.

1 virt-handler startup

1.1 Agent responsibilities

As the node agent, virt-handler is responsible for managing the lifecycle of all VMs on that node, such as creating, destroying, pausing, …, freezing those VMs. It functions similarly to OpenStack’s nova-compute, but with the added complexity of running each VM inside a Kubernetes Pod, which requires collaboration with kubelet - Kubernete’s node agent. For example,

  • When creating a VM, virt-handler must wait until kubelet creates the corresponding Pod,
  • When destroying a VM, virt-handler handles the VM destruction first, followed by kubelet performing the remaining cleanup steps (destroying the Pod).

1.2 Start and initialization (call stack)

Run                                           // cmd/virt-handler/virt-handler.go
  |-vmController := NewController()
  |-vmController.Run()
      |-Run()                                 // pkg/virt-handler/vm.go
         |-go c.deviceManagerController.Run()
         | 
         |-for domain in c.domainInformer.GetStore().List() {
         |     d := domain.(*api.Domain)
         |     vmiRef := v1.NewVMIReferenceWithUUID(...)
         |     key := controller.VirtualMachineInstanceKey(vmiRef)
         | 
         |     exists := c.vmiSourceInformer.GetStore().GetByKey(key)
         |     if !exists
         |         c.Queue.Add(key)
         |-}
         | 
         |-for i := 0; i < threadiness; i++ // 10 goroutine by default
               go c.runWorker
                  /
      /----------/
     /
runWorker
  |-for c.Execute() {
         |-key := c.Queue.Get()
         |-c.execute(key) // handle VM changes
              |-vmi, vmiExists := d.getVMIFromCache(key)
              |-domain, domainExists, domainCachedUID := d.getDomainFromCache(key)
              |-if !vmiExists && string(domainCachedUID) != ""
              |     vmi.UID = domainCachedUID
              |-if string(vmi.UID) == "" {
              |     uid := virtcache.LastKnownUIDFromGhostRecordCache(key)
              |     if uid != "" {
              |         vmi.UID = uid
              |     } else { // legacy support, attempt to find UID from watchdog file it exists.
              |         uid := watchdog.WatchdogFileGetUID(d.virtShareDir, vmi)
              |         if uid != ""
              |             vmi.UID = types.UID(uid)
              |     }
              |-}
              |-return d.defaultExecute(key, vmi, vmiExists, domain, domainExists)
    }

Steps done during virt-handler boostrap:

  1. Start necessary controllers, such as the device-related controller.
  2. Scan all VMs on the node and perform any necessary cleanups.
  3. Spawn goroutines to handle VM-related tasks.

    Each goroutine runs an infinite loop, monitoring changes to kubevirt’s VMI (Virtual Machine Instance) custom resources and responding accordingly. This includes actions like creating, deleting, …, unpausing VMs. For example, if a new VM is detected to be created on the node, the goroutine will initiate the creation process.

1.3 Summary

Now that the agent is ready to handle VM-related tasks, let’s create a VM in this Kubernetes cluster and see what happens in the behind.

2 Create a VirtualMachine in Kubernetes

Let’s see how to create a KVM-based virtual machine (just like the ones you’ve created in OpenStack, or the EC2 instances you’re using on public clouds) with kubevirt, and what happens in the behind.

Fig. Workflow of creating a VM in kubevirt. Left: steps added by kubevirt; Right: vanilla precedures of creating a Pod in k8s. [2]

2.1 kube-apiserver: create a VirtualMachine CR

kubevirt introduces a VirtualMachine CRD, which allows to define the specifications of virtual machines, such as CPU, memory, network, and disk configurations. Below is the spec of our to-be-created VM, it’s ok if you don’t undertand all the fields:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: kubevirt-smoke-fedora
spec:
  running: true
  template:
    metadata:
      annotations:
        kubevirt.io/keep-launcher-alive-after-failure: "true"
    spec:
      nodeSelector:
        kubevirt.io/schedulable: "true"
      architecture: amd64
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 1
        resources:
          requests:
            memory: 4G
        machine:
          type: q35
        devices:
          interfaces:
          - bridge: {}
            name: default
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: emptydisk
          - disk:
              bus: virtio
            name: cloudinitdisk
        features:
          acpi:
            enabled: true
        firmware:
          uuid: c3ecdb42-282e-44c3-8266-91b99ac91261
      networks:
      - name: default
        pod: {}
      volumes:
      - containerDisk:
          image: kubevirt/fedora-cloud-container-disk-demo:latest
          imagePullPolicy: Always
        name: containerdisk
      - emptyDisk:
          capacity: 2Gi
        name: emptydisk
      - cloudInitNoCloud:
          userData: |-
            #cloud-config
            password: changeme               # password of this VM
            chpasswd: { expire: False }
        name: cloudinitdisk

Now just apply it:

(master) $ k apply -f kubevirt-smoke-fedora.yaml

2.2 virt-controller: translate VirtualMachine to VirtualMachineInstance and Pod

virt-controller, a control plane component of kubevirt, monitors VirtualMachine CRs/objects and generates corresponding VirtualMachineInstance objects, and further creates a standard Kubernetes Pod object to describe the VM. See renderLaunchManifest() for details.

VirtualMachineInstance is a running instance of the corresponding VirtualMachine, such as, if you stop a VirtualMachine, the corresponding VirtualMachineInstance will be deleted, but it will be recreated after you start this VirtualMachine again.

$ k get vm
NAME                    AGE    STATUS    READY
kubevirt-smoke-fedora   ...

$ k get vmi
NAME                    AGE     PHASE     IP             NODENAME         READY   LIVE-MIGRATABLE   PAUSED
kubevirt-smoke-fedora   ...

$ k get pod -o wide | grep fedora
virt-launcher-kubevirt-smoke-fedora-2kx25   <status> ...

Once the Pod object is created, kube-scheduler takes over and selects a suitable node for the Pod. This has no differences compared with scheduling a normal Kubernetes pod.

The Pod’s yaml specification is very lengthy, we’ll see them piece by piece in following sections.

2.3 kube-scheduler: schedule Pod

Based on Pod’s label selectors, kube-scheduler will choose a node for the Pod, then update the pod spec.

Fig. Architecture overview of the kubevirt solution

The steps described above, from applying a VirtualMachine CR to the scheduling of the corresponding Pod on a node, all occur within the master node or control plane. Subsequent steps involve happen within the selected node.

2.4 kubelet: create Pod

Upon detecting a Pod has been scheduled to this node, kubelet on that node initiates the creation of the Pod using its specifications.

While a standard Pod typically consists of a pause container for holding namespaces and a main container for executing user-defined tasks, Kubernetes also allows for multiple containers to be included within a single Pod. This is particularly useful in scenarios such as service mesh, where a sidecar container can be injected into each Pod to process network requests.

In the case of kubevirt, this “multi-container” property is leveraged even further. virt-controller described 4 containers within the Pod:

  • 2 init containers for creating shared directories for containers in this Pod and copying files;
  • 1 volume container for holding volumes;
  • 1 compute container for holding the VM in this Pod.

2.4.1 pause container

crictl ps won’t show the pause container, but we can check it with ps:

(node) $ ps -ef | grep virt-launcher
qemu     822447 821556  /usr/bin/virt-launcher-monitor --qemu-timeout 288s --name kubevirt-smoke-fedora --uid 413e131b-408d-4ec6-9d2c-dc691e82cfda --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --run-as-nonroot --keep-after-failure
qemu     822464 822447  /usr/bin/virt-launcher         --qemu-timeout 288s --name kubevirt-smoke-fedora --uid 413e131b-408d-4ec6-9d2c-dc691e82cfda --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 45 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --run-as-nonroot
qemu     822756 822447  /usr/libexec/qemu-kvm -name ... # parent is virt-launcher-monitor

(node) $ ps -ef | grep pause
root     820808 820788  /pause
qemu     821576 821556  /pause
...

Process start information:

$ cat /proc/821576/cmdline | tr '\0' ' ' # the `pause` process
/pause

$ cat /proc/821556/cmdline | tr '\0' ' ' # the parent process
/usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 09c4b -address /run/containerd/containerd.sock

2.4.2 1st init container: install container-disk-binary to Pod

Snippet from Pod yaml:

  initContainers:
  - command:
    - /usr/bin/cp
    - /usr/bin/container-disk
    - /init/usr/bin/container-disk
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    image: virt-launcher:v1.0.0
    name: container-disk-binary
    resources:
      limits:
        cpu: 100m
        memory: 40M
      requests:
        cpu: 10m
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsGroup: 107
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /init/usr/bin
      name: virt-bin-share-dir

It copies a binary named container-disk from container image to a directory of the Pod,

Source code cmd/container-disk/main.c. About ~100 lines of C code.

so this binary is shared among all containers of this Pod. virt-bin-share-dir is declared as a Kubernetes emptyDir, kubelet will create a volume for it automatically in local disk:

For a Pod that defines an emptyDir volume, the volume is created when the Pod is assigned to a node. As the name says, the emptyDir volume is initially empty. All containers in the Pod can read and write the same files in the emptyDir volume, though that volume can be mounted at the same or different paths in each container. When a Pod is removed from a node for any reason, the data in the emptyDir is deleted permanently.

Check the container:

$ crictl ps -a | grep container-disk-binary # init container runs and exits
55f4628feb5a0   Exited   container-disk-binary   ...

Check the emptyDir created for it:

$ crictl inspect 55f4628feb5a0
    ...
    "mounts": [
      {
        "containerPath": "/init/usr/bin",
        "hostPath": "/var/lib/k8s/kubelet/pods/8364158c/volumes/kubernetes.io~empty-dir/virt-bin-share-dir",
      },

Check what's inside the directory:

$ ls /var/lib/k8s/kubelet/pods/8364158c/volumes/kubernetes.io~empty-dir/virt-bin-share-dir
container-disk # an excutable that will be used by the other containers in this Pod

2.4.3 2nd init container: volumecontainerdisk-init

  - command:
    - /usr/bin/container-disk
    args:
    - --no-op                   # exit(0) directly
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    name: volumecontainerdisk-init
    resources:
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 1m
        ephemeral-storage: 50M
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda
      name: container-disks
    - mountPath: /usr/bin
      name: virt-bin-share-dir

With --no-op option, the container-disk program will exit immediately with a return code of 0, indicating success.

So, what is the purpose of this container? It appears that it references a volume named container-disks, suggesting that it uses this approach as a workaround for certain edge cases. This ensures that the directory (emptyDir) is created before being utilized by the subsequent container.

2.4.4 1st main container: volumecontainerdisk

  - command:
    - /usr/bin/container-disk
    args:
    - --copy-path
    - /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda/disk_0
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    name: volumecontainerdisk
    resources:                         # needs little CPU & memory
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 1m
        ephemeral-storage: 50M
        memory: 1M
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        drop:
        - ALL
      privileged: true
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /usr/bin
      name: virt-bin-share-dir
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/413e131b-408d-4ec6-9d2c-dc691e82cfda
      name: container-disks

This container uses two directories created by the init containers:

  1. virt-bin-share-dir: an emptyDir, created by the 1st init container;
  2. container-disks: an emptyDir, created by the 2nd init container;

--copy-path <path>:

  • Create this path is not exist;
  • Create a unix domain socket, listen requests and close them;

It seems that this container serves the purpose of holding the container-disk-data volume and does not perform any other significant tasks.

2.4.5 2nd main container: compute

  - command:
    - /usr/bin/virt-launcher-monitor
    - --qemu-timeout
    - 288s
    - --name
    - kubevirt-smoke-fedora
    - --uid
    - 413e131b-408d-4ec6-9d2c-dc691e82cfda
    - --namespace
    - default
    - --kubevirt-share-dir
    - /var/run/kubevirt
    - --ephemeral-disk-dir
    - /var/run/kubevirt-ephemeral-disks
    - --container-disk-dir
    - /var/run/kubevirt/container-disks
    - --grace-period-seconds
    - "45"
    - --hook-sidecars
    - "0"
    - --ovmf-path
    - /usr/share/OVMF
    - --run-as-nonroot
    - --keep-after-failure
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    - name: VIRT_LAUNCHER_LOG_VERBOSITY
      value: "6"
    - name: LIBVIRT_DEBUG_LOGS
      value: "1"
    - name: VIRTIOFSD_DEBUG_LOGS
      value: "1"
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: virt-launcher:v1.0.0
    name: compute
    resources:
      limits:
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
      requests:
        cpu: 100m
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        ephemeral-storage: 50M
        memory: "4261567892"
    securityContext:
      allowPrivilegeEscalation: true
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: true
      runAsGroup: 107
      runAsNonRoot: false
      runAsUser: 107
    volumeMounts:
    - mountPath: /var/run/kubevirt-private
      name: private
    - mountPath: /var/run/kubevirt
      name: public
    - mountPath: /var/run/kubevirt-ephemeral-disks
      name: ephemeral-disks
    - mountPath: /var/run/kubevirt/container-disks
      mountPropagation: HostToContainer
      name: container-disks
    - mountPath: /var/run/libvirt
      name: libvirt-runtime
    - mountPath: /var/run/kubevirt/sockets
      name: sockets
    - mountPath: /var/run/kubevirt/hotplug-disks
      mountPropagation: HostToContainer
      name: hotplug-disks
    - mountPath: /var/run/kubevirt-ephemeral-disks/disk-data/containerdisk
      name: local

This container runs a binary called virt-launcher-monitor, which is a simple wrapper around virt-launcher. The main purpose of adding a wrapping layer is for better cleaning up when process exits.

virt-launcher-monitor

All virt-launcher-monitor’s arguments will be passed to virt-launcher changed, except --keep-after-failure will be removed - this is a monitor-only flag.

// run virt-launcher process and monitor it to give qemu an extra grace period to properly terminate in case of crashes
func RunAndMonitor(containerDiskDir string) (int, error) {
    args := removeArg(os.Args[1:], "--keep-after-failure")

    cmd := exec.Command("/usr/bin/virt-launcher", args...)
    cmd.SysProcAttr = &syscall.SysProcAttr{
        AmbientCaps: []uintptr{unix.CAP_NET_BIND_SERVICE},
    }
    cmd.Start()

    sigs := make(chan os.Signal, 10)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGCHLD)
    go func() {
        for sig := range sigs {
            switch sig {
            case syscall.SIGCHLD:
                var wstatus syscall.WaitStatus
                wpid := syscall.Wait4(-1, &wstatus, syscall.WNOHANG, nil)

                log.Log.Infof("Reaped pid %d with status %d", wpid, int(wstatus))
                if wpid == cmd.Process.Pid {
                    exitStatus <- wstatus.ExitStatus()
                }
            default: // Log("signalling virt-launcher to shut down")
                cmd.Process.Signal(syscall.SIGTERM)
                sig.Signal()
            }
        }
    }()

    exitCode := <-exitStatus // wait for VM's exit
    // do cleanups here
}

virt-launcher call stack: start virtqemud/cmdserver

main // cmd/virt-launcher/virt-launcher.go
  |-NewLibvirtWrapper(*runWithNonRoot)
  |-SetupLibvirt(libvirtLogFilters)
  |-StartVirtquemud(stopChan)
  |    |-go func() {
  |    |     for {
  |    |         Run("/usr/sbin/virtqemud -f /var/run/libvirt/virtqemud.conf")
  |    |
  |    |         select {
  |    |         case <-stopChan:
  |    |             return cmd.Process.Kill()
  |    |         }
  |    |     }
  |    |-}()
  |
  |-domainConn := createLibvirtConnection() // "qemu+unix:///session?socket=/var/run/libvirt/virtqemud-sock" or "qemu:///system"
  |
  |-notifier := notifyclient.NewNotifier(*virtShareDir)
  |-domainManager := NewLibvirtDomainManager()
  |
  |-startCmdServer("/var/run/kubevirt/sockets/launcher-init-sock")
  |-startDomainEventMonitoring
  |-domain := waitForDomainUUID()
  |
  |-mon := virtlauncher.NewProcessMonitor(domainName,)
  |-mon.RunForever()
        |-monitorLoop()

It starts two processes inside the container:

  1. virtqemud: a libvirt component, runs as a daemon process;
  2. cmdserver: a gRPC server, provides VM operation (delete/pause/freeze/…) interfaces to the caller;

virtqemud: management for QEMU VMs

virtqemud is a server side daemon component of the libvirt virtualization management system,

  • one of a collection of modular daemons that replace functionality previously provided by the monolithic libvirtd daemon.
  • provide management for QEMU virtual machines.
  • listens for requests on a local Unix domain socket by default. Remote access via TLS/TCP and backwards compatibility with legacy clients expecting libvirtd is provided by the virtproxyd daemon.

Check the container that will hold the VM:

$ crictl ps -a | grep compute
f67f57d432534       Running     compute     0       09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25

Configurations of virtqemud:

$ crictl exec -it f67f57d432534 sh
sh-5.1$ cat /var/run/libvirt/virtqemud.conf
listen_tls = 0
listen_tcp = 0
log_outputs = "1:stderr"
log_filters="3:remote 4:event 3:util.json 3:util.object 3:util.dbus 3:util.netlink 3:node_device 3:rpc 3:access 3:util.threadjob 3:cpu.cpu 3:qemu.qemu_monitor 1:*"

Process tree

There are several processes inside the compute container, show their relationships with pstree:

$ pstree -p <virt-launcher-monitor pid> --hide-threads
virt-launcher-m(<pid>)─┬─qemu-kvm(<pid>) # The real VM process, we'll see this in the next chapter
                       └─virt-launcher(<pid>)─┬─virtlogd(<pid>)
                                              └─virtqemud(<pid>)

# Show the entire process arguments
$ pstree -p <virt-launcher-monitor pid> --hide-threads --arguments
$ virt-launcher-monitor --qemu-timeout 321s --name kubevirt-smoke-fedora ...
  ├─qemu-kvm            -name guest=default_kubevirt-smoke-fedora,debug-threads=on ...
  └─virt-launcher       --qemu-timeout 321s --name kubevirt-smoke-fedora ...
      ├─virtlogd        -f /etc/libvirt/virtlogd.conf
      └─virtqemud       -f /var/run/libvirt/virtqemud.conf

2.5 virt-launcher: reconcile VM state (create VM in this case)

Status til now:

Fig. Ready to create a KVM VM inside the Pod

  1. kubelet has successfully created a Pod and is reconciling the Pod status based on the Pod specification.

    Note that certain details, such as network creation for the Pod, have been omitted to keep this post concise. There are no differences from normal Pods.

  2. virt-handler is prepared to synchronize the status of the VirtualMachineInstance with a real KVM virtual machine on this node. As there is currently no virtual machine present, the first task of the virt-handler is to create the virtual machine.

Now, let’s delve into the detailed steps involved in creating a KVM virtual machine.

2.5.1 virt-handler/cmdclient -> virt-launcher/cmdserver: sync VMI

An informer is used in virt-handler to sync VMI, it will call to the following stack:

defaultExecute
  |-switch {
    case shouldShutdown:
        d.processVmShutdown(vmi, domain)
    case shouldDelete:
        d.processVmDelete(vmi)
    case shouldCleanUp:
        d.processVmCleanup(vmi)
    case shouldUpdate:
        d.processVmUpdate(vmi, domain)
          |// handle migration if needed
          |
          |// handle vm create
          |-d.vmUpdateHelperDefault
               |-client.SyncVirtualMachine(vmi, options)
                   |-// lots of preparation work here
                   |-genericSendVMICmd("SyncVMI", c.v1client.SyncVirtualMachine, vmi, options)
  }

client.SyncVirtualMachine(vmi, options) does lots of preparation work, then calls SyncVMI() gRPC method to synchronize VM status - if not exist, then create it. This method will be handled by the cmdserver in virt-launcher.

2.5.2 virt-launcher/cmdserver: SyncVirtualMachine() -> libvirt C API virDomainCreateWithFlags()

SyncVirtualMachine // pkg/virt-launcher/virtwrap/cmd-server/server.go
  |-vmi, response := getVMIFromRequest(request.Vmi)
  |-domainManager.SyncVMI(vmi, l.allowEmulation, request.Options)
      |-domain := &api.Domain{}
      |-c := l.generateConverterContext // generate libvirt domain from VMI spec
      |-dom := l.virConn.LookupDomainByName(domain.Spec.Name)
      |-if notFound {
      |     domain = l.preStartHook(vmi, domain, false)
      |     dom = withNetworkIfacesResources(
      |         vmi, &domain.Spec,
      |         func(v *v1.VirtualMachineInstance, s *api.DomainSpec) (cli.VirDomain, error) {
      |             return l.setDomainSpecWithHooks(v, s)
      |         },
      |     )
      |
      |     l.metadataCache.UID.Set(vmi.UID)
      |     l.metadataCache.GracePeriod.Set( api.GracePeriodMetadata{DeletionGracePeriodSeconds: converter.GracePeriodSeconds(vmi)},)
      |     logger.Info("Domain defined.")
      |-}
      | 
      |-switch domState {
      |     case vm create:
      |         l.generateCloudInitISO(vmi, &dom)
      |         dom.CreateWithFlags(getDomainCreateFlags(vmi)) // start VirtualMachineInstance
      |     case vm pause/unpause:
      |     case disk attach/detach/resize disks:
      |     case hot plug/unplug virtio interfaces:
      |-}

As the above code shows, it eventually calls into libvirt API to create a "domain".

https://unix.stackexchange.com/questions/408308/why-are-vms-in-kvm-qemu-called-domains

They’re not kvm exclusive terminology (xen also refers to machines as domains). A hypervisor is a rough equivalent to domain zero, or dom0, which is the first system initialized on the kernel and has special privileges. Other domains started later are called domU and are the equivalent to a guest system or virtual machine. The reason is probably that both are very similar as they are executed on the kernel that handles them.

2.5.3 libvirt API -> virtqemud: create domain (VM)

LibvirtDomainManager

All VM/VMI operations are abstracted into a LibvirtDomainManager struct:

// pkg/virt-launcher/virtwrap/manager.go

type LibvirtDomainManager struct {
    virConn cli.Connection

    // Anytime a get and a set is done on the domain, this lock must be held.
    domainModifyLock sync.Mutex
    // mutex to control access to the guest time context
    setGuestTimeLock sync.Mutex

    credManager *accesscredentials.AccessCredentialManager

    hotplugHostDevicesInProgress chan struct{}
    memoryDumpInProgress         chan struct{}

    virtShareDir             string
    ephemeralDiskDir         string
    paused                   pausedVMIs
    agentData                *agentpoller.AsyncAgentStore
    cloudInitDataStore       *cloudinit.CloudInitData
    setGuestTimeContextPtr   *contextStore
    efiEnvironment           *efi.EFIEnvironment
    ovmfPath                 string
    ephemeralDiskCreator     ephemeraldisk.EphemeralDiskCreatorInterface
    directIOChecker          converter.DirectIOChecker
    disksInfo                map[string]*cmdv1.DiskInfo
    cancelSafetyUnfreezeChan chan struct{}
    migrateInfoStats         *stats.DomainJobInfo

    metadataCache *metadata.Cache
}

libvirt C API

// vendor/libvirt.org/go/libvirt/domain.go

// See also https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainCreateWithFlags
func (d *Domain) CreateWithFlags(flags DomainCreateFlags) error {
	C.virDomainCreateWithFlagsWrapper(d.ptr, C.uint(flags), &err)
}

2.5.4 virtqemud -> KVM subsystem

Create VM with xml spec.

The domain (VM) will be created, and the VCPU will enter running state unless special flags are specified.

2.6 Recap

Fig. A KVM VM is created inside the Pod

$ crictl ps -a | grep kubevirt
960d3e86991fa     Running     volumecontainerdisk        0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
f67f57d432534     Running     compute                    0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
e8b79067667b7     Exited      volumecontainerdisk-init   0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25
55f4628feb5a0     Exited      container-disk-binary      0   09c4b63f6bca5       virt-launcher-kubevirt-smoke-fedora-2kx25

This finishes our journey of creating a VM in kubevirt.

What it looks like if we created two kubevir VirtualMachines and they are scheduled to the same node:

Fig. Two KVM VMs on the node

3 Management VM states

Control path of VM state changes:

kube-apiserver -> virt-handler -> cmdserver -> virtqemud -> KVM subsystem -> VM

 VMI states         VM agent     |-- virt-launcher pod --|     kernel

3.1 Resize CPU/Memory/Disk

Workflow (non hotplug):

  1. /VirtualMachine/VirtualMachineInstance spec is modified;
  2. virt-handler receives the changes and modifies KVM VM configurations via virtqemud->KVM;
  3. Restart pod and KVM VM to make the changes take effect.

If hotplug is supported (e.g. Kubernetes VPA is supported), kubevirt should be able to hot-reload the changes.

3.2 Delete VM

Similar workflow as the above.

4 Summary

This post illustrates what happens in the underlying when user creates a VirtualMachine in Kubernetes with kubevirt.

References

  1. github.com/kubevirt
  2. Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Written by Human, Not by AI Written by Human, Not by AI

Virtual Machines on Kubernetes: Requirements and Solutions (2023)

Fig. Running (full-feature) VMs inside containers, phasing out OpenStack. Solutions: kubevirt, etc



1 Introduction

Some may be puzzling on this topic: why do we still need virtual machines (from the past cloud computing era) when we already have containerized platforms in this cloud-native era? And further, why should we bother managing VMs on Kubernetes, the de-facto container orchestration platform?

Comparing VMs and containers as provisioning methods is a complex matter, and out of the this post’s scope. We just highlight some practical reasons for why deploying VMs on Kubernetes.

1.1 Pratical reasons

Firstly, not all applications can be containerized. VMs provide a complete operating system environment and scratch space (stateful to users), while containers are most frequently used in stateless fashion, and they share the same kernel as the node. Scenarios that are not suitable for containerizations:

  • Applications that are tightly coupled with operation systems or have dependencies on specific hardwares;
  • GUI-based applications with complex display requirements - Windows as an example;

Secondly, applications with strict security requirements may not be suitable for container deployment:

  • VMs offer stronger isolation between workloads and better control over resource usage;
  • Hard multi-tenancy in OpenStack vs. soft multi-tenancy in Kubernetes;

Thirdly, not all transitions from VMs to containers bring business benefits. While moving from VMs to containers can reduce technical debts in most cases, mature and less evolving VM-based stacks may not benefit from such a transition.

With all the above said, despite the benefits of containers, there are still many scenarios where VMs are necessary. The question then becomes: whether to maintain them as standalone or legacy platforms like OpenStack, or to unify management with Kubernetes - especially if your main focus and efforts are already on Kubernetes.

This post explores the latter case: managing VMs along with your container workloads with Kubernetes.

1.2 Resource provision and orchestration

Before moving forward, let’s see a simple comparison between two ages.

1.2.1 Cloud computing era

In this era, the focus primarily lies on IAAS-level, where virtualization is carried out on hardware to provide virtual CPUs, virtual network interfaces, virtual disks, etc. These virtual pieces are finally assembled into a virtual machine (VM), just like a physical machine (blade server) for users.

Users typically express their requirements as follows:

I’d like 3 virtual machines. They should,

  1. Have their own permanent IP addresses (immutable IP throughout their lifecycle).
  2. Have persistent disks for scratch space or stateful data.
  3. Be resizable in terms of CPU, memory, disk, etc.
  4. Be recoverable during maintenance or outages (through cold or live migration).

Once users log in to the machines, they can deploy their business applications and orchestrate their operations on top of these VMs.

Examples of platforms that cater to these needs:

  • AWS EC2
  • OpenStack

Focus of these platforms: resource sharing, hard multi-tenancy, strong isolation, security, etc.

1.2.2 Cloud Native era

In the cloud-native era, orchestration platforms still pay attention to the above mentioned needs, but they operate at a higher level than IAAS. They address concerns such as elasticity, scalability, high availability, service load balancing, and model abstraction. The resulted platforms typically manage stateless workloads.

For instance, in the case of Kubernetes, users often express their requirements as follows:

I want an nginx service for serving a static website, which should:

  • Have a unique entrypoint for accessing (ServiceIP, etc).
  • Have 3 instances replicated across 3 nodes (affinity/anti-affinity rules).
  • Requests should be load balanced (ServiceIP to PodIPs load balancing).
  • Misbehaving instances be automatically replaced with new ones (stateless, health-checking, and reconciliation mechanisms).

1.3 Summary

With the above discussions in mind, let’s see some open-source solutions for managing VM workloads on Kubernetes.

2 Managing VM workloads via Kubernetes: solutions

There are two typical solutions, both based on Kubernetes and capable of managing both container and VM workloads:

  1. VM inside container: suitable for teams that currently maintain both OpenStack and Kubernetes. They can leverage this solution to provision VMs to end users while gradually phasing out OpenStack.

  2. Container inside VM: already are enjoying the benefits and conveniences provided by container ecosystem, while would like to strenthen the security and isolation aspects of container workloads.

2.1 Run VM inside Pod: kubevirt

Fig. Running (full-feature) VMs inside containers, phasing out OpenStack. Solutions: kubevirt, etc

kubevirt utilizes Kubernetes for VM provisioning.

  • Run on top of vanilla Kubernetes.
  • Introduce several CRDs and components to provision VMs.
  • Faciliate VM provisioning by embedding each VM into a container (pod).
  • Compatible with almost all Kubernetes facilities, e.g. Service load-balancing.

2.2 Run Pod inside VM: kata containers

Fig. Running containers inside (lightweight) VMs, with a proper container runtime. Solutions: kata containers, etc

Kata containers have a lightweight VM wrapper,

  • Deploy containers inside a lightweight and ultra-fast VM.
  • Enhance container security with this out-layer VM.
  • Need a dedicated container runtime (but no changes to Kubernetes).

3 Kubevirt solution overview

In this section, we’ll have a quick overview to the kubevirt project.

3.1 Architecture and components

High level architecture:

Fig. kubevirt architecture overview

Main components:

  • virt-api: kubevirt apiserver, for accepting requests like console streaming;
  • virt-controller: reconciles kubevirt objects like VirtualMachine, VirtualMachineInstance (VMI);
  • virt-handler: node agent (like nova-compute in OpenStack), collaborates with Kubernetes’s node agent kubelet;
  • virtctl: CLI, e.g. virtctl console <vm>

3.2 How it works

How a VM is created in kubevirt on top of Kubernetes:

Fig. Workflow of creating a VM in kubevirt. Left: steps added by kubevirt; Right: vanilla precedures of creating a Pod in k8s.

You can see that there are only add-ons but no changes to Kubernetes workflow.

An in-depth illustration: Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive.

3.3 Node internal topology

The internal view of the components inside a node:

Fig. A k8s/kubevirt node with two (KVM) VMs

3.4 Tech stacks

3.4.1 Computing

Still based on KVM/QEMU/libvirt, just like OpenStack.

3.4.2 Networking

Compatible with the CNI mechanism, can work seamlessly with popular network solutions like flannel, calico, and cilium.

kubevirt agent further creates virtual machine network on top of the pod network. This is necessary because virtual machines operate as userspace processes and require userspace simulated network cards (such as TUN/TAP) instead of veth pairs.

Networking is a big topic, I’d like a dedicated blog for it (if time permits).

3.4.3 Storage

Based on Kubernetes storage machanisms (PV/PVC), and advanced features like VM snapshot, clone, live migration, etc, all rely on these machanisms.

Also made some extentions, for example, containerDisk (embedding virtual machines images into container images) .

4 Conclusion

This post talks about why there are needs for running VMs on Kubernetes, and gives a further technical overview to the kubevirt project.

References

  1. github.com/kubevirt
  2. github.com/kata-containers
  3. Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)

Written by Human, Not by AI Written by Human, Not by AI

K8s 的核心是 API 而非容器(二):从开源项目看 k8s 的几种 API 扩展机制(2023)

Fig. kube-apiserver internal flows when processing a request. Image source Programming Kubernetes, O'Reilly

第一篇介绍了 k8s 的 API 设计。本文作为第二篇,通过具体开源项目来了解 k8s API 的几种扩展机制。



1 引言

1.1 扩展 API 的需求

上一篇已经看到,k8s 所有资源都通过 kube-apiserver 以 API 的形式暴露给各组件和用户, 例如通过 /api/v1/pods/... 可以对 pod 执行增删查改操作。 但如果用户有特殊需求,无法基于现有 API 实现某些目的,该怎么办呢?

有特殊需求的场景很多,举一个更具体的例子: 假设我们想加一个类似于 /api/v1/pods/namespaces/{ns}/{pod}/hotspots 的 API, 用于查询指定 pod 的某些热点指标(用户自己采集和维护)。针对这个需求有两种常见的解决思路:

  1. 直接改 k8s 代码,增加用户需要的 API 和一些处理逻辑;
  2. 为 k8s 引入某种通用的扩展机制,能让用户在不修改 k8s 代码的情况下, 也能实现新增 API 的功能。

显然,第二种方式更为通用,而且能更快落地,因为修改 k8s 代码并合并到上游通常是一个漫长的过程。 实际上,k8s 不仅提供了这样的机制,而且还提供了不止一种。 本文就这一主题展开介绍。

1.2 K8s Resource & API 回顾

在深入理解 API 扩展机制之前,先简单回顾下 k8s 的 API 设计。更多信息可参考前一篇。

1.2.1 API Resources

K8s 有很多内置的对象类型,包括 pod、node、role、rolebinding、networkpolicy 等等, 在 k8s 术语中,它们统称为“Resource”(资源)。 资源通过 kube-apiserver 的 API 暴露出来,可以对它们执行增删查改操作(前提是有权限)。 用 kubectl 命令可以获取这个 resource API 列表:

$ k api-resources
# 名称         # 命令行简写  # API 版本   # 是否区分 ns   # 资源类型
NAME           SHORTNAMES    APIVERSION   NAMESPACED      KIND
configmaps     cm            v1           true            ConfigMap
events         ev            v1           true            Event
namespaces     ns            v1           false           Namespace
nodes          no            v1           false           Node
pods           po            v1           true            Pod
...

组合以上几个字段值,就可以拼出 API。例如针对内置资源类型,以及是否区分 ns,

  1. Namespaced resource

    • 格式:/api/{version}/namespaces/{namespace}/{resource}
    • 举例:/api/v1/namespaces/default/pods
  2. Unnamespaced resource

    • 格式:/api/{version}/{resource}
    • 举例:/api/v1/nodes

1.2.2 API 使用方式

有两种常见的使用方式:

  1. 通过 SDK(例如 client-go)或裸代码,直接向 API 发起请求。适合程序使用, 例如各种自己实现的 controller、operator、apiserver 等等。

  2. 通过 kubectl 命令行方式,它会将各种 CLI 参数拼接成对应的 API。适合使用,例如问题排查;

     # 直接增删查改指定资源(或资源类型)
     $ k get pods -n kube-system -o wide
    
     # 向指定 API 发起请求
     $ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/" | jq . | head -n 20
    

1.3 小结

有了以上铺垫,接下来我们将深入分析 k8s 提供的两种 API 扩展机制

  1. CRD (Custom Reosurce Definition),自定义资源
  2. Kubernetes API Aggregation Layer (APIRegistration),直译为 API 聚合层

2 扩展机制一:CRD

扩展 k8s API 的第一种机制称为 CRD (Custom Resource Definition), 在第一篇中已经有了比较详细的介绍。

简单来说,这种机制要求用户将自己的自定义资源类型描述注册到 k8s 中, 这种自定义资源类型称为 CRD,这种类型的对象称为 CR,后面会看到具体例子。 从名字 Custom Resource 就可以看出,它们本质上也是资源, 只不过是用户自定义资源,以区别于 pods/nodes/services 等内置资源

2.1 案例需求:用 k8s 管理虚拟机

第一篇中已经有关于 CRD 创建和使用的简单例子。这里再举一个真实例子: k8s 只能管理容器,现在我们想让它连虚拟机也一起管理起来,也就是通过引入 "VirtualMachine" 这样一个抽象 (并实现对应的 apiserver/controller/agent 等核心组件), 实现通过 k8s 来创建、删除和管理虚拟机等目的。

实际上已经有这样一个开源项目,叫 kubevirt, 已经做到生产 ready。本文接下来就拿它作为例子。

实际上 kubevirt 引入了多个 CRD,但本文不是关于 kubevirt 的专门介绍,因此简单起见这里只看最核心的“虚拟机”抽象。

2.2 引入 VirtualMachine CRD

我们自定义的虚拟机资源最终要对应到 k8s object, 因此要符合后者的格式要求。从最高层来看,它非常简单:

// https://github.com/kubevirt/kubevirt/blob/v1.0.0/staging/src/kubevirt.io/api/core/v1/types.go#L1327-L1343

// The VirtualMachine contains the template to create the VirtualMachineInstance.
type VirtualMachine struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec VirtualMachineSpec `json:"spec" valid:"required"`
	Status VirtualMachineStatus `json:"status,omitempty"`
}

这就是一个标准 k8s object 结构,

  • type/object metadata 字段是每个 k8s object 都要带的,
  • Spec 描述这个“虚拟机”对象长什么样(期望的状态),

    里面包括了 CPU 架构(x86/arm/..)、PCIe 设备、磁盘、网卡等等关于虚拟机的描述信息; 这里就不展开了,有兴趣可以移步相应代码链接;

  • Status 描述这个“虚拟机”对象现在是什么状态。

将以上结构体用 OpenAPI schema 描述,就变成 k8s 能认的格式, 然后将其注册到 k8s,相当于

$ k apply -f virtualmachine-cr.yaml

VirtualMachine 这个 CRD 就注册完成了。用第一篇中的类比,这就相当于在数据库中创建了一张表。 可以用 kubectl explain 等方式来查看这张“表”的字段描述:

$ k explain virtualmachine
GROUP:      kubevirt.io
KIND:       VirtualMachine
VERSION:    v1
...

$ k get crd virtualmachines.kubevirt.io -o yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
...

2.3 使用 kubectl 增删查改 VirtualMachine

CRD 创建好之后,就可以创建这种自定义类型的对象了。

比如下面的 vm-cirros.yaml:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm-cirros
  name: vm-cirros
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-cirros
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
        resources:
          requests:
            memory: 128Mi
      terminationGracePeriodSeconds: 0
      volumes:
      - containerDisk:
          image: registry:5000/kubevirt/cirros-container-disk-demo:devel
        name: containerdisk

用 kubectl apply 以上 yaml,就创建了一个虚拟机(的描述)。接下来还可以继续用 kubectl 对这个虚拟机执行查删改等操作,与对 pods/nodes 等原生资源的操作类似:

$ k get virtualmachines.kubevirt.io # or 'k get vm'
NAME                    AGE   STATUS    READY
vm-cirros               1h    Running   True

要让虚拟机正确运行,还需要实现必要的虚拟机创建和处理逻辑, 这是 kubevirt 的几个控制组件(apiserver/controller/agent)做的事情,但这不是本文重点,所以不展开。

2.4 背后的 VirtualMachine API

之所以用 kubectl 操作 VirtualMachine,是因为在创建 CRD 时,k8s 自动帮我们生成了一套对应的 API, 并同样通过 kube-apiserver 提供服务。在命令行加上适当的日志级别就能看到这些 API 请求:

$ k get vm -v 10 2>&1 | grep -v Response | grep apis
curl -v -XGET ... 'https://xxx:6443/apis?timeout=32s'
GET https://xxx:6443/apis?timeout=32s 200 OK in 2 milliseconds
curl -v -XGET ...  'https://xx:6443/apis/kubevirt.io/v1/namespaces/default/virtualmachines?limit=500'
GET https://xxx:6443/apis/kubevirt.io/v1/namespaces/default/virtualmachines?limit=500 200 OK in 6 milliseconds

更具体来说,CRD 的 API 会落到下面这个扩展 API 组里:

  • 格式:/apis/{apiGroup}/{apiVersion}/namespaces/{namespace}/{resource}
  • 举例:/apis/kubevirt.io/v1/namespaces/default/virtualmachines

k api-resources 会列出所在 k8s 集群所有的 API,包括内置类型和扩展类型:

$ k api-resources
NAME               SHORTNAMES   APIGROUP       NAMESPACED   KIND
virtualmachines    vm,vms       kubevirt.io    true         VirtualMachine
...

2.5 小结

本节介绍了第一种 API 扩展机制,对于需要引入自定义资源的场景非常有用。 但如果用户没有要引入的新资源类型,只是想对现有的(内置或自定义)资源类型加一些新的 API, CRD 机制就不适用了。我们再来看另一种机制。

3 扩展机制二:Aggregated API Server (APIService)

Aggregated API Server(一些文档中也缩写为 AA)也提供了一种扩展 API 的机制。 这里,“聚合”是为了和处理 pods/nodes/services 等资源的 “核心” apiserver 做区分。

注意,AA 并不是独立组件,而是 kube-apiserver 中的一个模块, 运行在 kube-apiserver 进程中。

什么情况下会用到 AA 提供的扩展机制呢?

3.1 用户需求

如果没有要引入的自定义资源,只是想(给已有的资源)加一些新的 API,那 CRD 方式就不适用了。 两个例子,

  1. 用户想引入一个服务从所有 node 收集 nodes/pods 数据,聚合之后通过 kube-apiserver 入口提供服务(而不是自己提供一个 server 入口);

    这样集群内的服务,包括 k8s 自身、用户 pods 等,都可以直接通过 incluster 方式获取这些信息(前提是有权限)。

  2. 想给上一节引入的虚拟机 API apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm} 增加一层 sub-url,

    • apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm}/start
    • apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm}/stop
    • apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm}/pause
    • apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm}/migrate

3.2 方案设计

3.2.1 引入 kube-aggregator 模块和 APIService 抽象

  • APIService 表示的是一个有特定 GroupVersion 的 server。
  • APIService 一般用于对原有资源(API)加 subresource。

这样一个模块+模型,就能支持用户注册新 API 到 kube-apiserver。 举例,

  • 用户将 apis/kubevirt.io/v1/namespaces/{ns}/virtualmachines/{vm}/start 注册到 kube-apiserver;
  • kube-apiserver 如果收到这样的请求,就将其转发给指定是 service 进行处理,例如 kubevirt namespace 内名为 virt-apiService

kube-apiserver kube-apiserver 在这里相当于用户服务(virt-api)的反向代理。 下面看一下它内部的真实工作流。

3.2.2 kube-apiserver 内部工作流(delegate)

kube-apiserver 内部实现了下面这样一个 workflow,

Fig. kube-apiserver internal flows when processing a request. Image source Programming Kubernetes, O'Reilly

进入到 kube-apiserver 的请求会依次经历四个阶段:

  1. kube-aggregator:处理本节这种反向代理需求,将请求转发给 API 对应的用户服务;如果没有命中,转 2;
  2. kube resources:处理内置的 pods, services 等内置资源;如果没有命中,转 3;
  3. apiextensions-apiserver:处理 CRD 资源的请求;如果没有命中,转 4;
  4. 返回 404。

下面看两个具体案例。

3.3 案例一:k8s 官方 metrics-server

AA 机制的一个官方例子是 github.com/kubernetes-sigs/metrics-server。 它启动一个 metrics-server 从所有 kubelet 收集 pods/nodes 的 CPU、Memory 等信息, 然后向 kube-apiserver 注册若干 API,包括

  • /apis/metrics.k8s.io/v1beta1/nodes
  • /apis/metrics.k8s.io/v1beta1/pods

HPA、VPA、scheduler 等组件会通过这些 API 获取数据, 供自动扩缩容、动态调度等场景决策使用。

3.3.1 注册 APIService

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    k8s-app: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io       # 所有到 /apis/metrics.k8s.io/v1beta1/ 的请求
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true # 用 http 转发请求
  service:                    # 请求转发给这个 service
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100

以上 yaml 表示,如果请求的 URL 能匹配到 API 前缀 /apis/metrics.k8s.io/v1beta1/,那么 kube-apiserver 就用 HTTP(insecure)的方式将请求转发给 kube-system/metrics-server 进行处理。

我们能进一步在 api-resource 列表看到 metrics-server 注册了那些 API:

$ k api-resources | grep metrics.k8s.io
nodes   metrics.k8s.io     false        NodeMetrics
pods    metrics.k8s.io     true         PodMetrics
...

这两个 API 对应的完整 URL 是 /apis/metrics.k8s.io/v1beta1/{nodes,pods}

3.3.2 验证注册的扩展 API

用 kubectl 访问 metrics-server 注册的 API,这个请求会发送给 kube-apiserver:

$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/" | jq . | head -n 20
{
  "kind": "NodeMetricsList",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/"
  },
  "items": [
    {
      "metadata": {
        "name": "node1",
        "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/node1",
      },
      "timestamp": "2023-10-14T16:26:56Z",
      "window": "30s",
      "usage": {
        "cpu": "706808951n",
        "memory": "6778764Ki"
      }
    },
...

成功拿到了所有 node 的 CPU 和 Memory 使用信息。

直接 curl API 也可以,不过 kube-apiserver 是 https 服务,所以要加上几个证书才行。

$ cat curl-k8s-apiserver.sh
curl -s --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key --cacert /etc/kubernetes/pki/ca.crt $@

$ ./curl-k8s-apiserver.sh https://localhost:6443/apis/metrics.k8s.io/v1beta1/nodes/

类似地,获取指定 pod 的 CPU/Memory metrics:

$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/cilium-smoke-0" | jq '.'
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "cilium-smoke-0",
    "namespace": "default",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/cilium-smoke-0",
  },
  "timestamp": "2023-10-14T16:28:37Z",
  "window": "30s",
  "containers": [
    {
      "name": "nginx",
      "usage": {
        "cpu": "7336n",
        "memory": "3492Ki"
      }
    }
  ]
}

3.3.3 命令行支持:k top node/pod

metrics-server 是官方项目,所以它还在 kubectl 里面加了几个子命令来对接这些扩展 API, 方便集群管理和问题排查:

$ k top node
NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   346m         1%     6551Mi          2%
node-2   743m         1%     8439Mi          3%
node-4   107m         0%     6606Mi          2%
node-3   261m         0%     8759Mi          3%

一般的 AA 项目是不会动 kubectl 代码的。

3.4 案例二:kubevirt

3.4.1 APIService 注册

注册一个名为 v1.subresources.kubevirt.io 的 APIService 到 k8s 集群:

具体到 kubevirt 代码,它是通过 virt-operator 注册的 pkg/virt-operator/resource/generate/components/apiservices.go

$ k get apiservices v1.subresources.kubevirt.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1.subresources.kubevirt.io
spec:
  group: subresources.kubevirt.io  # 所有到 /apis/subresources.kubevirt.io/v1/ 的请求
  version: v1
  groupPriorityMinimum: 1000
  caBundle: LS0tLS1C...0tLS0K      # https 转发请求,用这个证书
  service:                         # 转发给这个 service
    name: virt-api
    namespace: kubevirt
    port: 443
  versionPriority: 15
status:
  conditions:
    message: all checks passed     # 所有检查都通过了,现在是 ready 状态
    reason: Passed
    status: "True"
    type: Available

以上表示,所有到 /apis/subresources.kubevirt.io/v1/ 的请求,kube-apiserver 应该用 HTTPS 转发给 kubevirt/virt-api 这个 service 处理。 查看这个 service

$ k get svc -n kubevirt virt-api -o wide
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE   SELECTOR
virt-api   ClusterIP   10.7.10.6    <none>        443/TCP   1d    kubevirt.io=virt-api

另外注意,status 里面有个 available 字段, 用来指示后端 service 健康检测是否正常。状态不正常时的表现:

$ k get apiservice -o wide | grep kubevirt
v1.kubevirt.io                Local                   True                       5h2m
v1.subresources.kubevirt.io   kubevirt/virt-api       False (MissingEndpoints)   5h1m

提示 service 没有 endpoints(pods)。

3.4.2 Sub-url handler 注册(virt-api

virt-api 这个服务在启动时会注册几十个 subresource

  • /apis/subresources.kubevirt.io/v1/namespaces/default/virtualmachineinstances/{name}/console
  • /apis/subresources.kubevirt.io/v1/namespaces/default/virtualmachineinstances/{name}/restart
  • /apis/subresources.kubevirt.io/v1/namespaces/default/virtualmachineinstances/{name}/freeze

可以看到这些都会命中上面注册的 APIService,因此当有这样的请求到达 kube-apiserver 时, 就会通过 https 将请求转发给 virt-api 进行处理。

3.4.3 测试

在 master node 上用命令 virtctl console kubevirt-smoke-fedora 登录 VM 时,下面是抓取到的 kube-apiserver audit log

* username: system:unsecured
* user_groups: ["system:masters","system:authenticated"]
* request_uri: /apis/subresources.kubevirt.io/v1/namespaces/default/virtualmachineinstances/kubevirt-smoke-fedora/console

可以看到确实请求的以上 sub-url。这个请求的大致路径:

virtctl (CLI) <-> kube-apiserver <-> kube-aggregator (in kube-apiserver) <-> virt-api service <-> virt-api pods <-> virt-handler (agent)

3.5 其他案例

  1. podexec/podlogs,都在 apiserver-builder 项目内, 分别是 k exec <pod>k logs <pod> 背后调用的 API:

     $ k -v 10 exec -it -n kube-system coredns-pod-1 bash 2>&1 | grep -v Response | grep api | grep exec
     curl -v -XPOST ... 'https://xx:6443/api/v1/namespaces/kube-system/pods/coredns-pod-1/exec?command=bash&container=coredns&stdin=true&stdout=true&tty=true'
     POST https://xx:6443/api/v1/namespaces/kube-system/pods/coredns-pod-1/exec?command=bash&container=coredns&stdin=true&stdout=true&tty=true 403 Forbidden in 36 milliseconds
    
  2. custom-metrics-server

    这跟前面介绍的 metrics-server 并不是同一个项目,这个收集的是其他 metrics。 metrics-server 只用到了 APIService,这个还用到了 subresource。

3.6 APIService 分类:Local/external

查看集群中所有 apiservice

$ k get apiservices
NAME                                   SERVICE             AVAILABLE   AGE
v1.                                    Local               True        26d
v1.acme.cert-manager.io                Local               True        4d5h
v1.admissionregistration.k8s.io        Local               True        26d
v1.apiextensions.k8s.io                Local               True        26d
v1.kubevirt.io                         Local               True        2d7h
v1.subresources.kubevirt.io            kubevirt/virt-api   True        2d7h
...

第二列有些是 Local,有些是具体的 Service <ns>/<svc name>。这种 Local 的表示什么意思呢? 挑一个看看:

$ k get apiservice v1.kubevirt.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    kube-aggregator.kubernetes.io/automanaged: "true"  # kube-aggregator 自动管理的
                                                       # kube-aggregator 并不是一个独立组件,而是集成在 kube-apiserver 中
  name: v1.kubevirt.io
  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.kubevirt.io
spec:
  group: kubevirt.io
  version: v1
  groupPriorityMinimum: 1000
  versionPriority: 100
status:
  conditions:
    status: "True"
    type: Available                                    # 类型:可用
    reason: Local
    message: Local APIServices are always available    # Local APIService 永远可用
  • 状态是 Availabel,reason 是 Local
  • 没有 service 字段,说明没有独立的后端服务

实际上,这种 Local 类型对应的请求是由 kube-apiserver 直接处理的;这种 APIService 也不是用户注册的,是 kube-aggregator 模块自动创建的。 更多关于 kube-apiserver 的实现细节可参考 [3]。

4 两种机制的对比:CRD vs. APIService

4.1 所在的资源组不同

$ k api-resources
NAME                        SHORTNAMES   APIVERSION                NAMESPACED   KIND
customresourcedefinitions   crd,crds     apiextensions.k8s.io/v1   false        CustomResourceDefinition
apiservices                              apiregistration.k8s.io/v1 false        APIService
...

二者位于两个不同的资源组,对应的 API:

  • CRD: /apis/apiextensions.k8s.io/{version}/...
  • APIService: /apis/apiregistration.k8s.io/{version}/...

4.2 目的和场景不同

CRD 主要目的是让 k8s 能处理新的对象类型(new kinds of object), 只要用户按规范提交一个自定义资源的描述(CRD),k8s 就会自动为它生成一套 CRUD API。

聚合层的目的则不同。 从设计文档可以看出, 当时引入聚合层有几个目的:

  1. 提高 API 扩展性:可以方便地定义自己的 API,以 kube-apiserver 作为入口,而无需修改任何 k8s 核心代码
  2. 加速新功能迭代:新的 API 通过聚合层引入 k8s,如果有必要再引入 kube-apiserver,修改后者是一个漫长的过程;
  3. 作为 experimental API 试验场
  4. 提供一套标准的 API 扩展规范:否则用户都按自己的意愿来,最后社区管理将走向混乱。

4.3 使用建议

两个官方脚手架项目:

官方建议:用脚手架项目;优先考虑用 CRD,实在不能满足需求再考虑 APIService 方式。这样的特殊场景包括:

  1. 希望使用其他 storage API,将数据存储到 etcd 之外的其他地方;
  2. 希望支持 long-running subresources/endpoints,例如 websocket;
  3. 希望对接外部系统;

5 Webhook 机制

Webhook 并不是设计用来扩展 API 的,但它提供的注册机制确实也实现了添加 API 的功能, 另外它也在 kube-apiserver 内部,所以本文也简单列一下,参照学习。

5.1 Webhook 位置及原理

Fig. k8s API request. Image source github.com/krvarma/mutating-webhook

两种 webhook:

  • mutating webhook:拦截指定的资源请求,判断操作是否允许,或者动态修改资源

    • 举例:如果 pod 打了 sidecar-injector 相关标签,就会在这一步给它注入 sidecar
  • validating webhook:功能与 mutation webhook 类似,但随 k8s 一起编译,前者是插件方式。

5.2 Mutating 案例:过滤所有 create/update virtualmachine 请求

kubevirt 通过注册如下 mutating webhook, 实现对 CREATE/UPDATE /apis/kubevirt.io/v1/virtualmachines 请求的拦截,并转发到 virt-api.kubevirt:443/virtualmachines-mutate 进行额外处理:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: virt-api-mutator
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: LS0tL...
    service:
      name: virt-api
      namespace: kubevirt
      path: /virtualmachines-mutate
      port: 443
  name: virtualmachines-mutator.kubevirt.io
  rules:
  - apiGroups:
    - kubevirt.io
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - virtualmachines
    scope: '*'
...

组件 virt-api 中实现了这些额外的处理逻辑。

5.3 Validating 案例:拦截驱逐 virtualmachines 请求

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: virt-api-validator
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: LS0t
    service:
      name: virt-api
      namespace: kubevirt
      path: /launcher-eviction-validate
      port: 443
  name: virt-launcher-eviction-interceptor.kubevirt.io
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - '*'
    resources:
    - pods/eviction
    scope: '*'
...

这样可以在虚拟机被驱逐之前做一次额外判断,例如禁止驱逐。

6 结束语

本文梳理了几种 k8s API 的扩展机制,并拿几个开源项目做了实际解读,以便加深理解。 两种机制在使用时都有相应的脚手架项目,应避免自己完全从头写代码。

参考资料

  1. Aggregated API Servers 设计文档, 2019
  2. Patterns of Kubernetes API Extensions, ITNEXT, 2018
  3. Kubernetes apiExtensionsServer 源码解析, 2020
❌