Trust Domain Server Metrics

This guide covers metrics collection, configuration, and monitoring for Trust Domain Servers. Servers expose Prometheus-compatible metrics for SVID operations, attestations, gRPC performance, and resource utilization.

Enabling Metrics

Trust Domain Servers expose metrics on a configurable port (default: 9090) via a /metrics endpoint.

Enable metrics in your Helm chart values file:

trust-domain-values.yaml
telemetry:
  enabled: true
  collectors:
    grpc:
      emitLatencyMetrics: true  # Optional: enables gRPC latency histograms (produces ~500 additional metrics series per instance)
  metricsAPI:
    port: 9090

Apply the configuration:

helm upgrade --install <trust-domain-name> \
  oci://ghcr.io/spirl/charts/spirl-server \
  --values trust-domain-values.yaml

See the Metrics Reference for the complete list of available metrics.

Verifying Metrics Endpoint

Test that metrics are accessible:

# Locate a server pod (replace <namespace> with your trust domain deployment namespace)
kubectl -n <namespace> get po -l app.kubernetes.io/name=spirl-server

# Port-forward to a server pod
kubectl port-forward -n <namespace> spirl-server-0 9090:9090

# In a separate shell, query the metrics endpoint
curl http://localhost:9090/metrics

Example output:

# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000582792
go_gc_duration_seconds{quantile="0.25"} 0.00085675
...

Key Metrics to Monitor

gRPC Performance

grpc_server_handled_total - Total gRPC requests completed
grpc_server_handling_seconds - Request latency histogram
grpc_server_started_total - Total gRPC streams started

Resource Utilization

go_memstats_alloc_bytes - Current memory allocation
go_goroutines - Number of goroutines
process_cpu_seconds_total - CPU time

Kubernetes Runtime

See Kubernetes Metrics for guidance on monitoring the Kubernetes runtime for issues.

Resource Management and Autoscaling

Resource Requests and Limits

The Defakto Helm charts support configuring resource requests and limits for all server components. Setting these values is vital for:

Ensuring the Kubernetes scheduler places pods on nodes with sufficient capacity
Preventing resource contention with other workloads
Enabling accurate capacity planning and monitoring
Supporting Horizontal Pod Autoscaling (HPA)

Configuring Resources

By default, the Helm charts do not configure resource requests or limits. While this provides flexibility, setting explicit values is recommended for production deployments.

Usage Varies by Pattern

Resource requirements depend on your specific usage patterns, including attestation frequency, SVID rotation rates, and API request volume. Start with conservative estimates and adjust based on observed metrics. Additionally, ensure sufficient headroom to support failovers.

Set resource requests and limits in your Helm values file (values shown are for example purposes only):

trust-domain-values.yaml
trustDomainDeployment:
  deployment:
    resources:
      requests:
        cpu: "500m"      # Request 0.5 CPU cores
        memory: "512Mi"  # Request 512 MiB memory
      limits:
        cpu: "1000m"     # Limit to 1 CPU core
        memory: "1Gi"    # Limit to 1 GiB memory

Horizontal Pod Autoscaling (HPA)

Defakto Helm charts support configuring the Horizontal Pod Autoscaler for server components. HPA automatically adjusts the number of pod replicas based on observed metrics.

CPU-Based Scaling Recommended

Since Defakto components are written in Go, which uses efficient garbage collection and memory management, we recommend scaling based on CPU usage only. Memory usage in Go applications tends to remain relatively stable and doesn't correlate as strongly with load as CPU does.

Enabling HPA

Configure HPA in your Helm values file:

trust-domain-values.yaml
trustDomainDeployment:
  deployment:
    hpa:
      enabled: true
      minReplicas: 2
      maxReplicas: 10
      targetCPUUtilizationPercentage: 70  # Scale when CPU > 70% of requested
      # behavior:  # Optional: configure scale-up/scale-down behavior

HPA Requirements

For HPA to function properly, you must:

Set resource requests - HPA calculates utilization as a percentage of requested resources:
```
resources:
  requests:
    cpu: "500m"  # Required for CPU-based HPA
```
Deploy metrics-server - HPA requires the metrics-server to retrieve pod resource usage:
```
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server
```
tip
For additional configuration options, see the metrics-server Helm chart documentation.
Verify metrics availability:
```
kubectl top pods -n <namespace>
```

Monitoring HPA Behavior

Check HPA status:

# View HPA status
kubectl get hpa -n <namespace>

# View detailed HPA status
kubectl describe hpa <hpa-name> -n <namespace>

# Monitor HPA events
kubectl get events -n <namespace> --field-selector involvedObject.kind=HorizontalPodAutoscaler

Example HPA status output:

NAME           REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
spirl-server   Deployment/spirl-server  45%/70%   2         10        2          5d

The TARGETS column shows current vs target utilization (45% current, 70% target).

HPA Prometheus Metrics

Monitor HPA decisions with Prometheus (requires kube-state-metrics):

# Current replica count (actual running pods)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}

# Desired replica count (what HPA wants)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}

# Maximum replica limit
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}

# Check if HPA is scaling (current != desired)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}
!=
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}

# Alert: HPA at max capacity (cannot scale further)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}
>=
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}

# HPA target metrics (e.g., CPU utilization target)
kube_horizontalpodautoscaler_status_target_metric{namespace="<namespace>"}

Out of Autoscaling Capacity

If kube_horizontalpodautoscaler_status_current_replicas equals kube_horizontalpodautoscaler_spec_max_replicas, the HPA has reached its maximum scaling limit and cannot add more replicas, even if load continues to increase.

Resource Management Best Practices

Start conservatively - Begin with generous resource requests and monitor actual usage. Defakto runs single processes written in Go, so 0.5 cores and 128 MB of RAM are reasonable starting points.
Monitor utilization - Use Prometheus queries to track CPU/memory usage against requests/limits and check for throttling, which can impact service behavior.
Avoid memory-based HPA - For Go applications, CPU is a more reliable scaling signal
Set reasonable max replicas - Prevent runaway scaling that exhausts cluster capacity

Troubleshooting Server Metrics

Metrics Endpoint Not Accessible

Test the endpoint directly:

kubectl port-forward -n <namespace> <pod-name> 9090:9090
curl http://localhost:9090/metrics

Next Steps

Agent Metrics - Configure metrics for SPIRL Agents
Review All Metrics - Complete metrics reference

Enabling Metrics​

Verifying Metrics Endpoint​

Key Metrics to Monitor​

gRPC Performance​

Resource Utilization​

Kubernetes Runtime​

Resource Management and Autoscaling​

Resource Requests and Limits​

Configuring Resources​

Horizontal Pod Autoscaling (HPA)​

Enabling HPA​

HPA Requirements​

Monitoring HPA Behavior​

HPA Prometheus Metrics​

Resource Management Best Practices​

Troubleshooting Server Metrics​

Metrics Endpoint Not Accessible​

Next Steps​