kubernetesgpudevops

Kubernetes Device Plugins for NVLink Fusion: Practical Guide for DevOps

UUnknown

2026-02-26

11 min read

Practical walkthrough to build an NVLink-aware Kubernetes device plugin, schedule GPUs by fabric, and ensure reproducible multi-GPU runs.

Hook: Stop wasting time debugging placement — get NVLink-aware GPU scheduling right

If your AI experiments stall because pods land on GPUs that aren’t NVLink-connected, you lose time and money. Teams running multi-GPU training or latency-sensitive inference need predictable inter-GPU bandwidth and topology-aware placement. In 2026, with NVLink Fusion and heterogeneous CPU-GPU fabrics becoming mainstream, Kubernetes clusters must do more than count GPUs — they must expose and schedule the GPU fabric itself.

What this guide covers

This is a practical walkthrough for DevOps and platform engineers who will:

Build or adapt a Kubernetes device plugin to advertise NVLink-connected GPU resources to pods
Expose NVLink fabric topology and map it to scheduler decisions
Implement resource isolation patterns (exclusive GPUs, MIG + fabric-aware placement)
Integrate with Jupyter notebooks and containerized workloads for reproducible experiments

The 2026 context: Why NVLink fabric matters now

Late 2025 and early 2026 brought two trends that change how clusters should expose GPUs:

NVLink Fusion and fabric-class interconnects — wider use of NVLink-like fabrics (SiFive integrating NVLink Fusion with RISC-V is a notable industry move) means chips and GPUs will be tightly coupled into fabrics where locality matters.
Growing adoption of multi-node multi-GPU training frameworks — frameworks like PyTorch distributed and Horovod depend on high-bandwidth interconnect between GPUs for efficient all-reduce/NCCL operations.

As a result, scheduling must consider fabric topology to avoid unpredictable cross-switch transfers or PCIe fallback paths.

Design patterns: How to represent NVLink fabric inside Kubernetes

There are a few viable patterns to expose NVLink connectivity. Choose based on complexity and operational constraints.

1) Device plugin + Extended resources (fast, simple)

Expose fabric groups as extended resources such as nvidia.com/gpu-fabric.groupA. The device plugin registers each NVLink-connected GPU as a named device and labels/creates a matching extended resource. Scheduler placement is then performed by normal resource requests.

Pros: straightforward for users, compatible with vanilla scheduler. Cons: limited flexibility for dynamic topology changes and multi-GPU allocations.

2) Device plugin + Node labels + Scheduler (recommended for complex topologies)

Use the device plugin to register devices and also annotate the node with stable labels representing fabric groups (for example: hardware.nvlink.group=ring-0). Build a scheduler plugin, scheduler extender, or use affinity rules to ensure pods requesting multiple GPUs are placed on nodes that satisfy fabric constraints.

Pros: expressive, allows custom ranking/weights. Cons: requires additional scheduler logic for gang allocations.

3) Device plugin + Resource Topology Exporter (RTE) + Scheduling Framework

Export fine-grained resource topology (NUMA, PCIe, NVLink meshes) using the Resource Topology Exporter or a similar component. Then write a scheduler framework plugin that consults that topology for placement. This pattern enables the highest fidelity scheduling decisions and integrates well with TopologyManager features.

Core components you'll build

NVLink Fabric Manager — daemon that interrogates the GPU driver and NVLink fabric (e.g., using NVIDIA Management Library NVML) to build a connectivity graph
Device plugin — standard Kubernetes device plugin (gRPC) that registers devices and reports topology metadata
Topology exporter or node annotator — converts the graph into node labels, extended resources, or the ResourceToplogy API
Scheduler extension/plugin — optional component to make placement decisions based on fabric labels and multi-GPU needs

Step-by-step: Building an NVLink-aware device plugin

We’ll outline a minimal but production-minded flow. The examples use Go since the Kubernetes device plugin SDK and scheduler framework are Go-first, but patterns translate to Python/Rust as well.

1) Discover NVLink fabric locally

Use NVML (NVIDIA Management Library) or vendor APIs to query GPU topology. The Fabric Manager should produce a graph like:

{
  "gpus": [
    {"id": 0, "uuid": "GPU-0", "nvlink_peers": [1,2]},
    {"id": 1, "uuid": "GPU-1", "nvlink_peers": [0]},
    {"id": 2, "uuid": "GPU-2", "nvlink_peers": [0]}
  ]
}

Sample NVML calls (Go binding pseudo):

// pseudocode - wrap NVML bindings
nvml.Init()
count := nvml.DeviceGetCount()
for i := 0; i < count; i++ {
  d := nvml.DeviceGetHandleByIndex(i)
  uuid := d.UUID()
  peers := d.NvLinkGetRemotePciInfo() // or NvLinkGetState + NvLinkGetActivation
}

2) Map GPUs to fabric groups

Group GPUs that are fully NVLink-connected into the same fabric bucket, for example group-0 for a ring-connected set. The grouping strategy depends on your topology — rings, meshes, and star topologies are all possible.

3) Implement the device plugin registration

Implement the device plugin gRPC server implementing ListAndWatch and Allocate. For each GPU device provide:

Device ID (UUID)
Health
Topology information (NUMA node and optionally a custom key indicating fabric group)

// simplified Go-style - return device list
devices := []*pluginapi.Device{}
for _, gpu := range gpus {
  dev := &pluginapi.Device{Id: gpu.UUID, Health: pluginapi.Healthy}
  // topology: NUMA id is supported; add fabric info via device ID naming or node labels
  devices = append(devices, dev)
}
// register with kubelet using pluginapi.Register

Note: The device plugin API doesn't include arbitrary key/value topology fields. Use NUMA topology where applicable, and pair the plugin with node-level labels or the Resource Topology API for fabric details.

4) Expose fabric as explicit resources or labels

Two concrete output options:

Extended resources - register a fake resource per fabric: nvidia.com/gpu-fabric.group-0. The plugin can create these by returning a device per fabric device that maps to real GPUs in Allocate.
Node labels & ResourceToplogy - annotate the node with labels like topology.kubernetes.io/nvlink-group=group-0 and expose the detailed graph via the Resource Topology Exporter. Scheduler plugins can use these labels plus device plugin resource claims to place pods.

5) Implement Allocate to bind device nodes

When a pod is scheduled and the kubelet calls Allocate, provide the container with the correct device nodes and environment variables needed by CUDA/NCCL. Example allocation response essentials:

// container response should include device nodes and envs
resp := &pluginapi.AllocateResponse{
  ContainerResponses: []*pluginapi.ContainerAllocateResponse{{
    Devices: []*pluginapi.DeviceSpec{{Path: "/dev/nvidia0", Permissions: "rw"}},
    Envs: map[string]string{"CUDA_VISIBLE_DEVICES": "0,1"},
  }},
}

Also set NCCL and topology-related envs if needed: NCCL_SOCKET_IFNAME, NCCL_P2P_LEVEL, or custom envs your runtime requires.

Scheduler and placement: making fabric-aware decisions

Counting GPUs is no longer enough for multi-GPU jobs. Two scheduling capabilities are essential:

Gang scheduling — ensure all required GPUs (potentially across multiple nodes) are reserved simultaneously (Volcano, kube-batch, or K8s PodGroup patterns).
Topology-aware scoring — prefer nodes or sets of nodes where the requested GPUs are NVLink-connected (minimize hops through PCIe switch).

Implement a scheduler framework plugin (pattern)

Use the Kubernetes scheduling framework to implement a Score plugin that reads:

Pod resource requests (e.g., nvidia.com/gpu or fabric-specific extended resources)
Node labels or ResourceTopology API showing NVLink groups

The plugin ranks nodes higher when the node offers GPUs in the same NVLink group as other nodes used by the job or when multiple GPUs on the same node are directly NVLink-connected.

Tactics for common scenarios

Single-node multi-GPU training — require nodeSelector for topology.kubernetes.io/nvlink-group=ring-0 or request a fabric-specific extended resource.
Multi-node multi-GPU distributed training — use gang scheduling + scheduler plugin to pack nodes with desired fabric adjacencies. Consider using CRD-based job controllers (MPIOperator, Kubeflow TFJob) integrated with your scheduling logic.
Low-latency inference clusters — for pipelined model shards, ensure contiguous NVLink routes between GPUs that exchange activations.

Resource isolation patterns and security

NVLink exposure raises isolation and privilege concerns. Apply these patterns:

Exclusive allocation — best for training jobs. Use device plugin to grant exclusive device nodes and avoid sharing /dev/nvidia* among multiple pods.
MIG + fabric awareness — if your GPUs support MIG, combine MIG slices with fabric mapping. Ensure the device plugin maps MIG instance UUIDs to fabric groups and registers them as independent devices.
RBAC and admission controls — only privileged namespaces or service accounts should be allowed to request fabric-specific resources. Use an admission controller to enforce policies.
cgroups and device nodes — kubelet and device plugin should ensure device nodes are bound to the container’s cgroup to prevent escape.

Jupyter and container environment: reproducible experiments

Dev and data scientist workflows often use Jupyter. Here’s a reproducible pattern for NVLink-aware experiments.

Dockerfile (minimal)

FROM nvidia/cuda:12.2-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip install jupyterlab torch torchvision --extra-index-url https://download.pytorch.org/whl/cu121
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--no-browser", "--allow-root"]

Pod spec: request fabric-aware resource

apiVersion: v1
kind: Pod
metadata:
  name: jupyter-nvlink
spec:
  containers:
  - name: jupyter
    image: myregistry/jupyter-nvlink:2026
    resources:
      limits:
        nvidia.com/gpu-fabric.group-0: 2
    env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
    ports:
      - containerPort: 8888
  nodeSelector:
    topology.kubernetes.io/nvlink-group: group-0
  tolerations:
    - key: "gpu-exec"
      operator: "Exists"

This ensures the Jupyter pod lands on nodes with GPUs in group-0. The device plugin’s Allocate will inject the correct /dev/nvidia* nodes and CUDA env variables.

Testing and validation

Don’t guess — measure. Use microbenchmarks and real workloads:

Run NCCL all-reduce bandwidth tests to compare NVLink vs PCIe paths
Run end-to-end training on a sample model and measure iteration time variance across placements
Use tracing tools (NCCL debug info, NVML metrics) to validate that NVLink is used

Operational considerations

Upgrade and compatibility

GPU drivers, NVML, and Kubernetes versions change. Decouple the Fabric Manager and device plugin so you can update one without cluster downtime. Maintain a compatibility matrix in your platform docs.

Autoscaling and capacity

Autoscalers must be fabric-aware or you risk scaling into nodes lacking required NVLink connectivity. Integrate your node group logic with the scheduler plugin to prefer node pools that preserve fabric topology.

Observability

Export metrics from the Fabric Manager and device plugin: topology changes, device health, allocation counters. Surface these metrics in Grafana and alert on topology changes or device flapping.

Advanced strategies & future-proofing (2026+)

Fabric-aware CRDs — define a GPUFabric CRD to centralize topology and historical changes for platform automation.
Cross-node NVLink (future) — as cross-node fabrics mature, extend scheduler plugins to reason about inter-node LVDS/Optical fabrics and constraints.
Runtime co-design — integrate with MLOps frameworks so job schedulers request topology-aware allocations automatically (for example, automatically request NVLink-connected multi-GPU placements for large-batch distributed jobs).

Example: minimal scheduler scoring function (concept)

Below is pseudocode for a Score plugin that prefers nodes with more GPUs in the same NVLink group as the pod’s already-assigned devices.

func Score(pod, node) int {
  requestedGroup := pod.Annotations["nvlink.group"]
  if requestedGroup == "" { return 0 }
  nodeGroup := node.Labels["topology.kubernetes.io/nvlink-group"]
  if nodeGroup == requestedGroup {
    return 100 // highest score
  }
  // partial credit if node contains at least one GPU connected
  if node.hasGpuConnectedToGroup(requestedGroup) { return 50 }
  return 0
}

Case study: reproducible multi-GPU runs at scale

At a large AI lab in early 2026, the platform team moved from naive GPU-binary scheduling to a device plugin + scheduler plugin approach. Results:

Training iteration time variance dropped 28% because jobs avoided crossing slow PCIe paths.
Cluster GPU utilization rose 12% as placement improved packing efficiency while preserving NVLink locality.
Experiment reproducibility improved — the team could guarantee that a job labeled nvlink-group=ring-0 always ran on the same set of fabric-connected GPUs.

These gains translated directly to lower cloud spend and faster iteration cycles for model teams.

Checklist: Deploying this in your cluster

Inventory GPUs and collect NVLink topology with NVML or vendor tools.
Build the Fabric Manager to produce group mappings.
Implement device plugin that registers devices and maps them to fabric groups.
Choose exposure method: extended resources, node labels, or Resource Topology Exporter.
Implement scheduler plugin/extension for fabric-aware scoring and gang scheduling integration.
Harden security: RBAC, admission policies, exclusive allocation patterns.
Test with NCCL benchmarks and real workloads. Automate tests in CI.

Final recommendations

Start small with node labels + extended resources and a conservative scheduler policy. Gradually evolve to Resource Topology Exporter and a scheduler framework plugin once you understand workload patterns. If your GPUs support MIG, model the MIG instances in the device plugin so teams can choose between large exclusive GPUs and many smaller slices.

Keep the Fabric Manager and device plugin stateless and redeployable. Publish topology metadata in a read-only manner so orchestrators and users can make deterministic placement decisions.

Call to action

Ready to make NVLink-aware scheduling part of your platform? Start with a PoC: deploy a Fabric Manager and device plugin on one node pool, run NCCL microbenchmarks, and iterate your scheduler policy. If you want a ready-made toolkit and operational patterns that integrate with CI/CD and MLOps pipelines, contact our platform engineering team at smart-labs.cloud for an NVLink device-plugin starter kit and deployment recipe tailored to your cluster.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.