Skip to content

Conversation

everzakov
Copy link

Add the vTPM (virtual Trusted Platform Module) specification to the documentation, config.go and schema description. Runtime uses this specification to create vTPMs and pass them to the container. This virtual module can be used to create quotes, signatures and perform Direct Anonymous Attestation.

Also, users can specify that vTPM should be manufactured with precreated certs / activated PCR banks and populated Endorsement Key Pair.

The following is an example of a vTPM description that is found under the path /linux/resources/vtpms:

    "vtpms": [
        {
            "statePath": "/var/lib/runc/myvtpm1",
            "statePathIsManaged": false,
            "vtpmVersion": "2",
            "createCerts": false,
            "runAs": "tss",
            "pcrBanks": "sha1,sha512",
            "encryptionPassword": "mysecret",
            "vtpmName": "tpm0",
            "vtpmMajor": 100,
            "vtpmMinor": 1
        }
    ]

This PR is based on #920

Add the vTPM specification to the documentation, config.go, and
schema description. The following is an example of a vTPM description
that is found under the path /linux/resources/vtpms:

    "vtpms": [
        {
            "statePath": "/var/lib/runc/myvtpm1",
            "vtpmVersion": "2",
            "createCerts": false,
            "runAs": "tss",
            "pcrBanks": "sha1,sha512"
        }
    ]

Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
… the container

Signed-off-by: Efim Verzakov <efimverzakov@gmail.com>
@tianon
Copy link
Member

tianon commented Aug 16, 2025

If I understand correctly, the idea is that a runtime is expected to start an instance of swtpm behind the scenes, and wire up the result inside the container. Is that accurate?

This is perhaps mirroring some of the concerns expressed in #920, but what's the benefit of doing that over running swtpm explicitly and mapping the device or socket from it?

To maybe help explain why this makes me nervous, what do we do if the container dies? The runtime is typically long gone at that point, so what makes sure swtpm shuts down? What if swtpm has a problem and shuts down before the container does? At the level of the runtime, there's no "orchestrator" monitoring processes, there's just a container process and a bunch of kernel resources tied to that process (most of which clean themselves up pretty reasonably when the container exits or dies).

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this. If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

@everzakov
Copy link
Author

runc support vtpm is part of the following solution: container remote assertation solution. In this solution, vtpm is a device for storage usage of assertion result.
In this PR, atomic capability of managing vtpm directly will be implemented in runc.

vtpm arch pic1

Container remote assertation process:

  1. Admin deploy a container by k8s api
  2. Runc create vTPM for verification
  3. (planning) runc create vCRTM for measurements.
    PS: since there is no vCRTM de-facto open source project, this functionality will be planned when there is one
  4. Runc start a new business container
  5. vCRTM is used to measure business container files and content
  6. Signature verification result is stored in vTPM device
  7. Report is sent to remote assertation

vTPM:virtualized Trusted Platform Module
vCRTM:virtualized Core Root of Trust Measurement
RAC : remote assertation container who manage the assertation process
RAS: remote assertation service who store the key for verification

In this solution, here is the responsibility of each component:
K8s-api: handle configuration and lifecycle of vtpm for a pod. decide if vtpm is created/deleted/cleaned/recreated. similar behavior as other devices.
kubelet: monitor status of container and vtpm device, and report to k8s api server. node level lifecycle management of vtpm
runc: atomic capability of create/delete/clean vtpm based on the request from containerd.

@everzakov
Copy link
Author

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

This is a good question. If I understand correctly we have a several container extension points:

  1. Kubelet plugins - Dynamic Resource Allocation / Device Manager
  2. Containerd node resource interface (NRI) plugins
  3. Runtime Hooks

We cannot use Runtime Hooks (e.g. createContainer) because the runtime/runc reads the container config only once. So, we won’t be able to extend linux devices.

We cannot use Kubelet Device Manager plugins because there can be a possible use case to share the same vTPM between several containers in a pod.
From KEP 4381 https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md
device plugin API could be extended to share devices between containers in a single pod,
but possible supporting sharing between pods would need an API similar to Dynamic Resource Allocation.

We cannot use only Kubelet Dynamic Resource Allocation plugins because NodePrepareResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L179
and UnprepareResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L405 calls only once for each pod.
As for GetResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L362 function it reads the CDI devices for each container from its cache.
And if we want that vTPM internal state should be recreated each time when container is re/created (for example retry), we should use it with another extension point.

As for the Node Resouce Interface plugins it can be a good candidate to implement vTPM feature (because it can apply container config adjustments to pass device / device cgroup).
However, it has some weak points:

  1. The Node Resource Interface plugins are run only once when containerd is started so it can not use K8s API to handle vTPM configuration.
    So we need to use it with another extension point.
  2. Containerd restarts. When containerd is restarted there is a process with recovering existing containers https://github.com/containerd/containerd/blob/v2.1.4/internal/cri/server/restart.go#L55 .
    Cri Sandboxes and Containers will be populated. Also NRI plugins have a mechanism of syncronyzing containers / sandboxes https://github.com/containerd/containerd/blob/v2.1.4/internal/nri/nri.go#L455 .
    Swtpm processes are forked and they will exist after containerd restart. Their pid can be saved in container annotations and retrived from container spec.
    However, if the swtpm process is killed and another process is run. Should the NRI plugin delete the second process or return error.
    If the NRI plugin will return error then CRI plugin will be locked in error state. If the NRI plugin will delete the second process and recreate the first,
    then the upper component of the second swtpm process will be in error state (because will try to recreate the second and fail).
    Also the Container interface https://github.com/containerd/containerd/blob/v2.1.4/client/container.go#L52 do not have the method to update spec (the new pid of swtpm process).

If swtpm is run with the runtime we can add it's pid to the container state file that's why such problem won't exist.
If the container will be deleted then we should only kill / delete the necessary swtpm processes.

However, the main weak point of using another container extension points than runtime is how runtime works with devices.
We have several use cases and the most common is to pass different vTPM devices with the same dev path (e.g. /dev/tpm0) to the different containers.
This can be done if we create several devices with generated host names and mknod them to the container by using their major/minor. However,
if the runtime is run in the non-default user namespace / should create the new user namespace, it will used bind instead of mknod https://github.com/opencontainers/runc/blob/v1.3.0/libcontainer/rootfs_linux.go#L916 .
In such cases we should pass the generated host name to the container config.

As for the problem of not existing monitoring tools in runtime, in containerd there is a function to monitor task exit https://github.com/containerd/containerd/blob/v2.1.4/internal/cri/server/events.go#L147 .
After task exit, the task will be deleted. So, swtpm processes will be stopped.

@everzakov
Copy link
Author

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this.

we assign runc to create vtpm because we want to allign the same architecture design as VM platform.
In VM scenario it is vBIOS do the job. While in container, from the component call sequence, it is runc that works as a similar role and position.

vtpm vm vs containerd pic2

@everzakov
Copy link
Author

If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

If the runtime do not have a vTPM feature / swtpm is not installed, then the error should be returned.

@everzakov
Copy link
Author

Sorry for late reply I was on PTO :(

@tianon
Copy link
Member

tianon commented Aug 27, 2025

we assign runc to create vtpm because we want to allign the same architecture design as VM platform.
In VM scenario it is vBIOS do the job.

This isn't quite true though, right? In QEMU at least (I'm not sure about other VM platforms), TPM support requires the operator to pre-launch an instance of swtpm, and manage it themselves outside of QEMU: https://qemu-project.gitlab.io/qemu/specs/tpm.html#the-qemu-tpm-emulator-device, https://wiki.archlinux.org/title/QEMU#Trusted_Platform_Module_emulation

If we take a similar approach in runc, then the swtpm devices or sockets are no different than anything else you might share with the container in the bundle spec.

My biggest concern is the lifecycle management of that swtpm process, because again, runc is not running anymore once the container is up, so from the perspective of runc and the container, nothing will be left behind to manage (or stop/cleanup) that swtpm process.

@everzakov
Copy link
Author

everzakov commented Aug 27, 2025

If we take a similar approach in runc, then the swtpm devices or sockets are no different than anything else you might share with the container in the bundle spec.

yes. However i have my concern: we want to create independent vTPM devices (created by several swtpm processes) and pass them to the different containers under the same dev path in the container (e.g. /dev/tpm0). To do this we need to be sure that their host dev path is different and pass their major and minor with the required container dev path (/dev/tpm0).
Runc can use two commands to create devices under the rootfs - mknod and bind. https://github.com/opencontainers/runc/blob/main/libcontainer/rootfs_linux.go#L916
If mknod is used, then this approach will work.
However, if bind is used, then the error will be returned (no /dev/tpm0 on host). To solve this problem we can pass the device host path instead of required dev path (/dev/tpm0).

So, my concern is the following: we can be sure what command will be used by runc only on runtime level. And this affects the value of dev path in the container device config. Anyway we need to extend current device config or add the new field in the runtime spec.

@everzakov
Copy link
Author

My biggest concern is the lifecycle management of that swtpm process, because again, runc is not running anymore once the container is up, so from the perspective of runc and the container, nothing will be left behind to manage (or stop/cleanup) that swtpm process.

I understand your concern that runc is called only once when the container is up. As i know, runc delete is also called when we need to cleanup the container. In this command we can stop/cleaneup all the created swtpm processes for the container. If you also have a concern that runc delete returns an error code, swtpm processes won't be stopped/cleanup. I think we can add additional checks to stop/cleanup swtpm in UnprepareResources (when we need to sync terminating pod) https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L405 .

@everzakov
Copy link
Author

If I understand correctly, the idea is that a runtime is expected to start an instance of swtpm behind the scenes, and wire up the result inside the container. Is that accurate?

This is perhaps mirroring some of the concerns expressed in #920, but what's the benefit of doing that over running swtpm explicitly and mapping the device or socket from it?

To maybe help explain why this makes me nervous, what do we do if the container dies? The runtime is typically long gone at that point, so what makes sure swtpm shuts down? What if swtpm has a problem and shuts down before the container does? At the level of the runtime, there's no "orchestrator" monitoring processes, there's just a container process and a bunch of kernel resources tied to that process (most of which clean themselves up pretty reasonably when the container exits or dies).

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this. If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

Hello! Dear @tianon, we reconsidered our approach to pass vTPM to the container:

  1. There will be a DRA (K8s Dynamic Resource Allocation https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ) plugin which will create/delete/monitor swtpm processes.
    In NodePrepareResources it will start swtpm_cuse process, create CDI (container device interface https://github.com/cncf-tags/container-device-interface ) file, and return CDI id in response.
  2. I repeat my concern that we can be sure what command (mknod/mount) will be used by runc only on runtime level that's why we need to pass both "container" and host path to the container config.
  3. In containerd CDI devices will be parsed https://github.com/containerd/containerd/blob/main/internal/cri/server/container_create_linux.go#L104 , vTPMs will be applied to the container runtime config.
  4. In runc we only need to check what "device" path should be passed to the devices config.
  5. Since v1.34 it is possible to create device health monitoring stream https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring .
    If swtpm process will be killed and can not be restarted then this device will be marked as unhealthy in pod status.

Possible problems:

  1. We create/mknod devices in vTPM plugin. In order to do this, their major / minor should be in vTPM plugin container cgroup device allowlist.
    E.g. this can be done by using NRI plugin: https://github.com/containerd/nri/blob/main/pkg/runtime-tools/generate/generate.go#L288 .
  2. In vTPM plugin we can not be sure when CRI container will be re/started (it can be possible by using GetContainerEvents call https://github.com/containerd/containerd/blob/main/internal/cri/server/container_events.go .
    However, i think there can be a time lag) that's why swtpm state can not be recreated fully (create the new Endorsement Key Pair and etc) for each time when container is recreated.

In runtime-spec the changes will have only:

"vtpms": [
  {
    "containerPath": "/dev/tpm0",
    "hostPath": "/dev/tpm-generated-0",
    "vtpmMajor": 100,
    "vtpmMinor": 1
  }
],

Now, i'm working on vTPM plugin PoC to check this approach.

@rata
Copy link
Member

rata commented Sep 11, 2025

@everzakov Handling the swtpm process creation/lifecycle outside of the runtime as @tianon was saying is great, I think it is a blocker otherwise.

Some questions:

  1. With that example runtime-spec config you shared, what would runc need to do from a high-level point of view? Is it a bind mount of the hostPath in the containerPath? In that case, why do we need the major/minor?
  2. My understanding is that we can create as many vTPMs on a host as we want, is this right? I'm not a DRA expert, but do all DRA device drivers need a capacity or can they say "infinite"? Checking the doc you linked, it doesn't seem like it's possible today to have "no capacity" in a DRA device driver. How would that work in this case?

I think if there isn't any host-device we need to "consume" for swtpm, then not sure why we are using DRA. Or can DRA model things with "infinite" capacity too?

@everzakov
Copy link
Author

  1. With that example runtime-spec config you shared, what would runc need to do from a high-level point of view? Is it a bind mount of the hostPath in the containerPath? In that case, why do we need the major/minor?

Thank you for your comment and reminding me that we can also use mount config to pass device :)

We choose DRA because we want to pass swtpm config to container by resource claims.
In that case, the user only need to set params like vtpm version, vtpmName and etc in the resource claim config.

At the start vTPM plugin will publish resource slice with possible devices with their major / minor values.
The scheduler will allocate some devices from this resource slice for the specified pod.
Their major / minor will be passed to start swtpm_cuse.

In my opinion, after CUSE has created the host device, runc only need to pass / create this device in the test container
(and if possible with specified uid / gid in the test container).
Also, runc need to add the created device major/minor which are keys for virutal devices managed by the kernel in allow list cgroup
(otherwise any operation from user to this device in the created container will be rejected due to permission error)
that's why additional fields like vtpmMajor and vtpmMinor is passed.
It can be done by using NRI plugin https://github.com/containerd/nri/blob/main/pkg/runtime-tools/generate/generate.go#L287 .
However, the necessary major / minor is fixed only when pod is scheduled and the device is allocated (we can not pass them as an annotation).
If we use CDI to pass the device as a mount, it won't be possible because
we need to add the additional device node https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/container-edits.go#L112 .
Or we can add VTPM as a container edit :)

Also, I tried to use mount the device and i have a "bug":

  1. If i understand correctly, when tpm2-tools package is installed on the host,
    then 60-tpm-udev.rules https://packages.ubuntu.com/jammy/all/tpm-udev/filelist file is created.
    The rule https://salsa.debian.org/debian/tpm-udev/-/blob/master/debian/tpm-udev.udev?ref_type=heads is setting tss owner on tpm[0-9]* device name.
  2. if we mount the necessary device to the container rootfs, then the device will have the uid of host tss user (not the container).
  3. We will have Permission denied error, if we try to use this device when the host and container tss user have the different UIDs.

If we use mknod command, the device will have a root uid.

@everzakov
Copy link
Author

2. My understanding is that we can create as many vTPMs on a host as we want, is this right? I'm not a DRA expert, but do all DRA device drivers need a capacity or can they say "infinite"? Checking the doc you linked, it doesn't seem like it's possible today to have "no capacity" in a DRA device driver. How would that work in this case?

If i understand correctly, then now it is impossible to have "infinite" capacity
because all devices should be defined before the devices will be allocated / pod is scheduled
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/dynamic-resource-allocation/structured/internal/stable/allocator_stable.go#L410 .
In my opinion, when deploying the whole solution, the cluster admin should set the device major to use (or container can be with priviledged security mode to allow mknod all possible devices).
In vTPM plugin start, it will exec swtpm with 0 minor to stub the major and to be sure that other CUSE devices won't be allocated with the same major.
After that it will report the all possible devices by resource slice (the maximum is 2^20 - 1 but i think it will be less than the maximum :-).
So it will consume the minor values for devices.

@everzakov
Copy link
Author

Also, i think i need to test the whole solution with use case when hostUsers=false in pod spec. The logic should be the same as working with volumes.
However, CDI do not pass the uid_mappings and gid_mappings when applying a mount https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/oci.go#L34 .
Add DRA plugin do not know about the created the user namespace https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/dra/manager.go#L226 .

In short, now i will try to pass the device created by DRA plugin to the test container with specified user in the test container.

@rata
Copy link
Member

rata commented Sep 19, 2025

@everzakov thanks!

  1. I'm not sure what is the point of a hostPath in the json if we will create it with mknod just using the major/minor. What am I missing?
  2. It feels like an abuse to use DRA with something that has "infinite" capacity, DRA is currently designed to assign consumable devices to containers and track that. We don't have that problem here. It seems like using DRA for something it is not designed to do.
  3. uid_mappings and those won't work with character devices. Those are designed to be used with filesystems. The device will need to have a uid/gid and be set to that. I guess that is missing in the json you proposed. In the userns case, you want to use the hostUID/hostGID that the root user is mapped to (or whichever user inside the container you want to be the owner of it).

I'm probably missing something obvious, but why can't we just use the existing devices section instead of adding a vtpm section? It seems to have everything we need already:

  • It has a path to use inside the container
  • major/minor
  • uid/gid
  • supports character devices

If the CUSE daemon is running on the host and the config is set with the right major/minor, then IIUC runc will do the mknod, the CUSE driver will do its thing to provision this, and it will just work.

Am I missing something?

If that works, we can later think a little bit about the vtpm state: how can we give the same state to this container when it runs on another node later? Can we do that? Or what would be the limitations? Is there something we need to take into account to "isolate" the vtpm of different containers?

But let's leave those questions for later

@everzakov
Copy link
Author

  1. I'm not sure what is the point of a hostPath in the json if we will create it with mknod just using the major/minor. What am I missing?

Thank you for your questions.

The main requirement for vTPM feature - to create several containers with independent TPM state under the same container dev path (e.g. /dev/tpm0).
In order to do it, we need to create the several generated devices and pass them under the required path. The main problem is that runc uses bind mount instead of mknod when runc is running in non default user ns or we have userns in namespaces config https://github.com/opencontainers/runc/blob/main/libcontainer/rootfs_linux.go#L916 .
In such situation if container path (e.g. /dev/tpm0) is passed to devices config field, then we will get Not Found Error (the container path does not exist in host).
If the hostPath (the device generated path) is passed to devices config field, then we can use the vTPM device (excluding the fact that we won't have the nice device name).

I think that only on runtime level we can be sure which command will be used by runc.

@everzakov
Copy link
Author

Also, I tried to use mount the device and i have a "bug":

  1. If i understand correctly, when tpm2-tools package is installed on the host,
    then 60-tpm-udev.rules https://packages.ubuntu.com/jammy/all/tpm-udev/filelist file is created.
    The rule https://salsa.debian.org/debian/tpm-udev/-/blob/master/debian/tpm-udev.udev?ref_type=heads is setting tss owner on tpm[0-9]* device name.
  2. if we mount the necessary device to the container rootfs, then the device will have the uid of host tss user (not the container).
  3. We will have Permission denied error, if we try to use this device when the host and container tss user have the different UIDs.

Also regarding the "bug" with udev rules when mount command is used. I think we can add a CreateContainerHook to chown the device with the necessary container tss user uid/guid.
We can't use StartContainerHook because the process user doesn't have enough permissions to change owner of a device.

@rata
Copy link
Member

rata commented Sep 29, 2025

@everzakov I'd say, let's forget the container hook for now. If we do use the devices array, we can specify the uid/gid there. If we do something else, we should probably give the option to specify the uid/gid, so I think we won't need that hook.

But right, we can't mknod inside a userns (that is something the kernel imposes). So this needs more thought on who will create the device and how. It can't be runc, but it should be done at some point of the picture. containerd/cri-o might be the right place, but it's not obvious either how it should look like.

I think DRA doesn't seem the right place to model this, as this has "infinite" capacity and it can't model that. It's for finite, consumable devices.

@everzakov
Copy link
Author

But right, we can't mknod inside a userns (that is something the kernel imposes). So this needs more thought on who will create the device and how. It can't be runc, but it should be done at some point of the picture. containerd/cri-o might be the right place, but it's not obvious either how it should look like.

@rata Thank you for your comment.

Yeah, the upper component can create device path and pass it to the container. However, the main problem is that we should be able to pass several "/dev/tpm0" devices with different TPM state at the same time.

So, i have created the runc branch with vtpm tests - https://github.com/everzakov/runc/blob/without-vtpm/tests/integration/vtpm.bats .
If you run them, you will see the fail with the simple custom user namespace. The problem is that cuse creates device with 0600 permissions (udev rule sets 0660 permission https://salsa.debian.org/debian/tpm-udev/-/blob/master/debian/tpm-udev.udev?ref_type=heads). When container has the user namespace in the config, then device's uid/gid is only changed to the container root https://github.com/opencontainers/runc/blob/main/libcontainer/specconv/spec_linux.go#L1072 (the default allowed devices have 0666 permission that's why user can call them in the new user namespace https://github.com/opencontainers/runc/blob/main/libcontainer/specconv/spec_linux.go#L225 ). So, in order to pass this test, the device's permission should be changed to 0666. However, i think this is not a right option with operating a TPM device.

The https://github.com/everzakov/runc/tree/vtpm-simple has a code changes to pass these tests. There is a problem that we change user/group of host device (because container device is mounting by bind). However, i think there won't be a big problem because device is generated for each pod, and containers in the pod have the same uid/gid mappings in hostUsers: false scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants