Skip to main content

Resource Sizing

This page collects the current per-component memory and CPU recommendations for a Defakto deployment on Kubernetes. The numbers below come from a combination of internal testing and production metrics.

Treat as starting points, not targets

Resource needs depend on the underlying platform (kernel version, page size, Go runtime version), workload count, attestation frequency, SVID rotation rate, and per-node pod density. Use these values as a starting point, then adjust based on the metrics in Kubernetes Platform Monitoring and the per-component guides linked below.

Trust Domain Server

Based on internal testing, the following requests and limits are a reasonable starting point. Add these values to your Trust Domain Server Helm values.yaml — see Deploy Trust Domain Servers for the full file structure.

trustDomainDeployment:
deployment:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"

Server memory usage scales with the number of connected agents. For autoscaling guidance, see Trust Domain Server Metrics — Resource Management and Autoscaling.

Agent

These initial settings are based on real-world operations, but we recommend adjustment based on the number of nodes, pods, and the capabilities of the underlying hardware.

agent:
resources:
requests:
memory: "64Mi"
cpu: "25m"
limits:
memory: "256Mi"
cpu: "125m"

CSI Driver

These resource settings will work across a broad range of workloads and scaling characteristics. However, we recommend that you use your metrics to adjust them as necessary.

csiDriver:
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "150m"
memory: "196Mi"

The fix in CSI driver 0.2.11 reduces peak memory in the mount-info parsing path that the driver hits during kubelet's periodic per-volume health checks. Use a spirl-system release that bundles CSI driver 0.2.11 or later. If your spirl-system chart's bundled CSI driver is older than 0.2.11, set images.csiDriver.tag: "0.2.11" in your Helm values until you upgrade to a release that bundles the patched driver by default.

Reflector

If you are running the Reflector, use the following formula as a starting point for memory sizing:

Memory (MiB) = 50 + (0.5 × Agents) + (0.004 × DistinctSVIDs)
  • 50 MiB — base memory plus offline-events buffer overhead.
  • 0.5 MiB per agent — gRPC connection overhead.
  • 0.004 MiB per SVID — ~4 KiB per distinct SVID.

Example: 50 agents + 100 SVIDs = 75.4 MiB.

The formula was calibrated against small-to-medium deployments. For clusters with several hundred agents or more, treat it as a lower bound and validate against observed memory usage; if you see numbers significantly higher than the formula predicts, contact Defakto Support.

Common pitfalls

  • No memory limit at all. The Helm charts ship without resource requests or limits set by default. In production, set both — without them, the Kubernetes scheduler can't make good placement decisions and pods become BestEffort QoS, which means they're killed first under node memory pressure.
  • Sizing only against steady-state memory. Several components (notably the CSI driver) have transient burst spikes during scale-out that can exceed steady-state by 10× or more. Size against observed peaks, not averages.

Validating your sizing

After rolling out the recommended values, watch the following for one to two weeks before tightening or relaxing limits:

  1. Peak memory vs. limit. Track container_memory_working_set_bytes against kube_pod_container_resource_limits (resource="memory"). Aim for peaks below 70% of the limit during normal operation. PromQL examples are in Kubernetes Platform Monitoring.
  2. OOM events. Track container_oom_events_total. Any non-zero count for a Defakto component means the limit is too low — increase it and continue monitoring. The runbooks (agent, server) cover the diagnostic flow.
  3. CPU throttling. Track container_cpu_cfs_throttled_seconds_total. Sustained throttling means the CPU limit is too low and is likely contributing to elevated request latency.
  4. Burst vs. steady state. During a node scale-out or autoscaling event, capture peak memory for the CSI driver and Reflector and compare against steady state. Components with large burst-to-steady ratios should be sized for the peak, not the average.
  5. After two weeks of clean numbers, you can tighten limits — but leave headroom so future spikes don't trigger an OOM. If you're unsure, leave the recommended values in place; the cost of over-provisioning is small relative to the cost of an OOM during a customer-visible event.