KEP-5027: DRA: admin-controlled device attributes #5034

pohly · 2025-01-10T15:52:02Z

One-line PR description: DRA: admin-controlled device attributes
Issue link: DRA: admin-controlled device attributes (device health, maintenance, priority) #5027
Other comments: first revision

k8s-ci-robot · 2025-01-10T15:52:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-10T15:55:26Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`531a905`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pohly · 2025-01-12T12:31:53Z

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

k8s-ci-robot · 2025-01-12T12:31:56Z

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-01-12T12:33:32Z

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

+    Capacity map[QualifiedName]DeviceCapacity
+}
+
+// AttributeNamePriority is a standardized attribute name. Its value must be an integer.


Or this?

Suggested change

// AttributeNamePriority is a standardized attribute name. Its value must be an integer.

// AttributeNamePriority is an attribute name defined by Kubernetes. Its value must be an integer.

/cc @johnbelamaric

pohly · 2025-01-13T08:02:06Z

/wg device-management
/sig node

pohly · 2025-01-13T11:03:30Z

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

+The scheduler must merge these additional attributes with the ones provided by
+the DRA drivers. The "kubernetes.io/offline" string attribute contains a
+free-form explanation why the device is not currently available. Such a device
+must be ignored by the scheduler. The "kubernetes.io/priority" integer defines


@eero-t asked in #5027 (comment):

How admin should run test workload(s) on device which scheduling has been disabled (e.g. for firmware upgrade), to know whether it can be enabled egain (for production workloads)?

With node taints, one would use taint tolerance for this, but I don't seem from KEP description how similar thing would be achieved for DRA devices.

This is indeed not possible as described here. How about making it configurable whether an offline device is used?

The "normal" DeviceClass that users should pick for production workloads could have a selector which excludes offline devices.

Then there is a second DeviceClass which doesn't exclude them. There's nothing that would prevent users from using that, but if they do, they do at their own risk. This is on-par with node taints.

Match for all offline devices is not enough, as there can be multiple reasons for offline, e.g. health and administration => selection would need to be specific to given offline reason, and not match if there are other reasons.

(With taints, one could use e.g. fw-upgrade taint, and its toleration. While one could still taint whole nodes, that could be rather disruptive, whereas by offlining devices one-by-one, upgrades would cause only slight service degradation while they are being performed / tested / verified.)

The admin can create a custom DeviceClass with a selector which matches exactly the reason they chose when taking the device offline. The ResourceSliceOverride then has kubernetes.io/offline: fw-upgrade and the workload's DeviceClass has device.attributes["kubernetes.io"].offline == "fw-upgrade".

But that doesn't cover the case where a manually created ResourceSliceOverride contains such a kubernetes.io/offline: fw-upgrade and another, automatically created one has kubernetes.io/offline: unhealthy. The admin can make sure that "its" value wins via resourcesliceoverride.spec.rank, but then the kubernetes.io/offline: unhealthy gets lost.

We could specify a different merging strategy for this well-known attribute: instead of keeping exactly one entry, the different instances could be numbered, leading to kubernetes.io/offline: fw-upgrade; kubernetes.io/offline-1: unhealthy. The CEL expressions become a bit more complex, but it would work.

Yet another alternative is to extend the CEL environment so that device.attributes["kubernetes.io"].offline is a list of strings. This might be better than the value name "hack".

Thanks to the KEP. And sorry to cut-in this discussion. I'm curious in the usecase of expossing the device health via ResourceSliceOverride.

In this usecase, are there any ideas how could online/offline status changes affect to running workloads?? In this case, it might be useful for users to introduce device level toleration for more flexible control?

The below is an imaginary spec that I know this is just a juvenile suggestion(it definitely needs more deep considerations):

apiVersion: resource.k8s.io/v1alpha2 kind: ResourceClaim spec: devices: requests: - name: gpu deviceClassName: gpu.nvidia.com # New field # - It might be better to introduce taints information in device side, too? # - toletion should be defined in ResourceClass side? tolerations: - cel: expression: "device.attributes["kubernetes.io"].offline != "" effect: NoExecute | NoSchedule tolerationSeconds: 30s # this is effective only when 'NoExecute'

Reacting to offline on the node for a running workload is currently out-of-scope, but I can see how it would be useful to do something even if that means making the ResourceClaim API more complex. It also means that the kubelet needs to become aware of this because a controller cannot force containers to stop, can it?

We could start without it in 1.33, then add such an API in 1.34 (still as alpha!).

But devices also get deallocated when its consuming pod(s) are in a known final state where it won't run any containers anymore, so it's not necessary to fully remove a pod to reuse devices. The advantage would be that one can still retrieve logs or inspect the pod object to determine what it did (exit code, termination message)

Got it. That makes sense.

I just don't see how an external controller can force a pod into that state, so we would have to go the same route as node tainting.

I agree because there is no such apis (make pods reach final state forcefully) as you stated.

I think we can define device.attributes["kubernetes.io"].offline != "" as a check that, if true, means that the device cannot and/or should not be used. With that definition, not scheduling and evicting running pods seem like the right default behavior if a ResourceClaim doesn't list tolerations.

Yeah. That's simpler than introducing taints.

It's my pros/cons analysis:

Option 1: Unhealthiness via device attributes (e.g. kubernetes.io/offline):

Pros:

Simple (just adding toleration in ResourceClaim)

Cons:

User(ResourceClaim) should aware which attributes define its device unhealthiness to define thieir toleration. These should be documented in DRA driver documents

Option 2: Unhealthiness via taints (in ResourceSlice(by DRA Driver) or ResoruceSliceOverride(by admin or external controller))

Pros:

User(ResourceClaim) can follow standard taint/toleration semantics

taints can express its abnormal usecase, i.e. default (in case no tolerations) behavior, via effect (NoSchedule | NoExecute | PreferNoSchedule | etc. for each failure/offline mode

offline=suspect-failure:PreferNoSchedule

offline=investigation:NoSchedule

offline=fw-upgrade:NoSchedule and offline=fw-upgrade:NoExecute

offline=hardware-failure:NoSchedule and offline=hardware-failure:NoExecute

Cons:

Complex

Even in this case, user(ResourceClaim) should aware which taint key/value are exposed by DRA driver to define tolerations. That should be documented in DRA driver documents.

Hmm. It's actually difficult for me to choose which is better. I personally incline to standard taint/toleration a little bit. But I worry about the API complexity.

WDYT??

I had already put something into the KEP under alternatives about using fields instead of pre-defined attribute names. The original argument was that it would be a small step from having this override mechanism to standardizing some attributes for specific purposes.

But that argument is starting to break down: for kubernetes.io/offline we would already need special merging that combines all values in a list of strings, and one half of the API would be a pre-defined attribute name while the other half are fields (ResourceClaim.Spec.Tolerations).

I'm leaning towards dropping kubernetes.io/offline and replacing it with a "proper" API. It still fits into this KEP because it relies on ResourceSliceOverride.

I had already put something into the KEP under alternatives about using fields instead of pre-defined attribute names.

Oh, thanks for the clarification. I found it in Alterhatives section 🙇

I'm leaning towards dropping kubernetes.io/offline and replacing it with a "proper" API.

👍

It still fits into this KEP because it relies on ResourceSliceOverride.

Honestly, I think it depends on what kind of "proper" API design will be agreed. IF the agreed API WAS taint/toleration in ResourceSlice(Override)/ResourceClaim, then I prefer to this in separate KEP to isolate the intention even though it relies on ResourceSliceOverride because taint/toleration can work without ResourceSliceOverride. WDYT??

You may be right. But it's tricky to have two separate KEPs in-flight at the same time. What if someone disables DRAAdminControlledDeviceAttributes (this KEP) and enables DRADeviceTaints (some new KEP)? Is that a valid setup? Perhaps... the device taint could be stored in the ResourceSlice, just not in the ResourceSliceOverride, because that's disabled.

Okay, let me try two different KEPs.

I think I'll leave out kubernetes.io/priority. The same argument against having device health with taints here apply to it, too (separate feature!), and with the increased complexity of device taints I don't want to bite of more than I (and my reviewers) can chew in this release cycle.

everpeace · 2025-01-16T15:11:24Z

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

+Instead of ResourceSliceOverride as a separate type, new fields in the
+ResourceSlice status could be modified by an admin. That has the problem that
+the ResourceSlice object might get deleted while doing cluster maintenance like
+a driver update, in which case the admin intent would get lost. A driver would
+not be able to publish a new ResourceSlice where a device is immediately marked
+as offline because creating a ResourceSlice strips the status.


JFYI:

@johnbelamaric suggested another alternative, that is drop-in file style to overide/extend device attributes here:

I also think it could be useful for the driver (actually, the base driver framework that we would prefer all drivers to use) to have a hook to allow VM architects to augment the device attributes published by the driver.

For example, dropping a file on the node that can tell you which external network each NIC is plumbed to.

Patrick's KEP gives the cluster admin an opportunity to enhance attributes. That could be sufficient to do what I am saying. But it may also be helpful to have an on-node way of doing this.

If DRA driver authors want to support a way of doing this, then they certainly can. But I don't think we as Kubernetes should standardize and require supporting such a feature. If we want to offer a common API, then this KEP looks like a better approach to me, in particular because accessing the apiserver is easier than creating files on nodes...

If we want to offer a common API, then this KEP looks like a better approach to me, in particular because accessing the apiserver is easier than creating files on nodes...

I also support the ResourceSliceOverride approach. Although drop-in might fit with node-local devices, DRA can now support broader device models, e.g. non-node-local devices (i.e. fabric-attached devices).

DRA: add admin controlled device attributes

531a905

k8s-ci-robot requested a review from johnbelamaric January 10, 2025 15:52

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 10, 2025

k8s-ci-robot requested a review from byako January 12, 2025 12:31

pohly commented Jan 12, 2025

View reviewed changes

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025

pohly mentioned this pull request Jan 13, 2025

DRA: admin-controlled device attributes (device health, maintenance, priority) #5027

Open

4 tasks

pohly commented Jan 13, 2025

View reviewed changes

everpeace mentioned this pull request Jan 14, 2025

Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) NVIDIA/k8s-dra-driver#213

Open

everpeace reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-5027: DRA: admin-controlled device attributes #5034

KEP-5027: DRA: admin-controlled device attributes #5034

pohly commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

pohly commented Jan 12, 2025

k8s-ci-robot commented Jan 12, 2025

pohly Jan 12, 2025 •

edited

Loading

pohly commented Jan 13, 2025

pohly Jan 13, 2025

eero-t Jan 13, 2025 •

edited

Loading

pohly Jan 13, 2025 •

edited

Loading

everpeace Jan 14, 2025 •

edited

Loading

pohly Jan 14, 2025

everpeace Jan 16, 2025 •

edited

Loading

pohly Jan 16, 2025

everpeace Jan 16, 2025 •

edited

Loading

pohly Jan 16, 2025

pohly Jan 16, 2025

everpeace Jan 16, 2025 •

edited

Loading

pohly Jan 16, 2025

everpeace Jan 17, 2025

	// AttributeNamePriority is a standardized attribute name. Its value must be an integer.
	// AttributeNamePriority is an attribute name defined by Kubernetes. Its value must be an integer.

KEP-5027: DRA: admin-controlled device attributes #5034

Are you sure you want to change the base?

KEP-5027: DRA: admin-controlled device attributes #5034

Conversation

pohly commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

pohly commented Jan 12, 2025

k8s-ci-robot commented Jan 12, 2025

pohly Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

pohly commented Jan 13, 2025

pohly Jan 13, 2025

Choose a reason for hiding this comment

eero-t Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

pohly Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

everpeace Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

pohly Jan 14, 2025

Choose a reason for hiding this comment

everpeace Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

pohly Jan 16, 2025

Choose a reason for hiding this comment

everpeace Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

pohly Jan 16, 2025

Choose a reason for hiding this comment

pohly Jan 16, 2025

Choose a reason for hiding this comment

everpeace Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

pohly Jan 16, 2025

Choose a reason for hiding this comment

everpeace Jan 17, 2025

Choose a reason for hiding this comment

pohly Jan 12, 2025 •

edited

Loading

eero-t Jan 13, 2025 •

edited

Loading

pohly Jan 13, 2025 •

edited

Loading

everpeace Jan 14, 2025 •

edited

Loading

everpeace Jan 16, 2025 •

edited

Loading

everpeace Jan 16, 2025 •

edited

Loading

everpeace Jan 16, 2025 •

edited

Loading