Skip to main content

Trust Domain Server Metrics

This guide covers metrics collection, configuration, and monitoring for Trust Domain Servers. Servers expose Prometheus-compatible metrics for SVID operations, attestations, gRPC performance, and resource utilization.

Enabling Metrics​

Trust Domain Servers expose metrics on a configurable port (default: 9090) via a /metrics endpoint.

Enable metrics in your Helm chart values file:

trust-domain-values.yaml
telemetry:
enabled: true
collectors:
grpc:
emmitLatencyMetrics: true # Optional: enables gRPC latency histograms (produces ~500 additional metrics series per instance)
metricsAPI:
port: 9090

Apply the configuration:

helm upgrade --install <trust-domain-name> \
oci://ghcr.io/spirl/charts/spirl-server \
--values trust-domain-values.yaml

See the Metrics Reference for the complete list of available metrics.

Verifying Metrics Endpoint​

Test that metrics are accessible:

# Locate a server pod (replace <namespace> with your trust domain deployment namespace)
kubectl -n <namespace> get po -l app.kubernetes.io/name=spirl-server

# Port-forward to a server pod
kubectl port-forward -n <namespace> spirl-server-0 9090:9090

# In a separate shell, query the metrics endpoint
curl http://localhost:9090/metrics

Example output:

# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000582792
go_gc_duration_seconds{quantile="0.25"} 0.00085675
...

Key Metrics to Monitor​

gRPC Performance​

  • grpc_server_handled_total - Total gRPC requests completed
  • grpc_server_handling_seconds - Request latency histogram
  • grpc_server_started_total - Total gRPC streams started

Resource Utilization​

  • go_memstats_alloc_bytes - Current memory allocation
  • go_goroutines - Number of goroutines
  • process_cpu_seconds_total - CPU time

Kubernetes Runtime​

See Kubernetes Metrics for guidance on monitoring the Kubernetes runtime for issues.

Resource Management and Autoscaling​

Resource Requests and Limits​

The Defakto Helm charts support configuring resource requests and limits for all server components. Setting these values is vital for:

  • Ensuring the Kubernetes scheduler places pods on nodes with sufficient capacity
  • Preventing resource contention with other workloads
  • Enabling accurate capacity planning and monitoring
  • Supporting Horizontal Pod Autoscaling (HPA)

Configuring Resources​

By default, the Helm charts do not configure resource requests or limits. While this provides flexibility, setting explicit values is recommended for production deployments.

Usage Varies by Pattern

Resource requirements depend on your specific usage patterns, including attestation frequency, SVID rotation rates, and API request volume. Start with conservative estimates and adjust based on observed metrics. Additionally, ensure sufficient headroom to support failovers.

Set resource requests and limits in your Helm values file (values shown are for example purposes only):

trust-domain-values.yaml
trustDomainDeployment:
deployment:
resources:
requests:
cpu: "500m" # Request 0.5 CPU cores
memory: "512Mi" # Request 512 MiB memory
limits:
cpu: "1000m" # Limit to 1 CPU core
memory: "1Gi" # Limit to 1 GiB memory

Horizontal Pod Autoscaling (HPA)​

Defakto Helm charts support configuring the Horizontal Pod Autoscaler for server components. HPA automatically adjusts the number of pod replicas based on observed metrics.

CPU-Based Scaling Recommended

Since Defakto components are written in Go, which uses efficient garbage collection and memory management, we recommend scaling based on CPU usage only. Memory usage in Go applications tends to remain relatively stable and doesn't correlate as strongly with load as CPU does.

Enabling HPA​

Configure HPA in your Helm values file:

trust-domain-values.yaml
trustDomainDeployment:
deployment:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70 # Scale when CPU > 70% of requested
# behavior: # Optional: configure scale-up/scale-down behavior

HPA Requirements​

For HPA to function properly, you must:

  1. Set resource requests - HPA calculates utilization as a percentage of requested resources:

    resources:
    requests:
    cpu: "500m" # Required for CPU-based HPA
  2. Deploy metrics-server - HPA requires the metrics-server to retrieve pod resource usage:

    helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
    helm upgrade --install metrics-server metrics-server/metrics-server
    tip

    For additional configuration options, see the metrics-server Helm chart documentation.

  3. Verify metrics availability:

    kubectl top pods -n <namespace>

Monitoring HPA Behavior​

Check HPA status:

# View HPA status
kubectl get hpa -n <namespace>

# View detailed HPA status
kubectl describe hpa <hpa-name> -n <namespace>

# Monitor HPA events
kubectl get events -n <namespace> --field-selector involvedObject.kind=HorizontalPodAutoscaler

Example HPA status output:

NAME           REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
spirl-server Deployment/spirl-server 45%/70% 2 10 2 5d

The TARGETS column shows current vs target utilization (45% current, 70% target).

HPA Prometheus Metrics​

Monitor HPA decisions with Prometheus (requires kube-state-metrics):

# Current replica count (actual running pods)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}

# Desired replica count (what HPA wants)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}

# Maximum replica limit
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}

# Check if HPA is scaling (current != desired)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}
!=
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}

# Alert: HPA at max capacity (cannot scale further)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}
>=
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}

# HPA target metrics (e.g., CPU utilization target)
kube_horizontalpodautoscaler_status_target_metric{namespace="<namespace>"}
Out of Autoscaling Capacity

If kube_horizontalpodautoscaler_status_current_replicas equals kube_horizontalpodautoscaler_spec_max_replicas, the HPA has reached its maximum scaling limit and cannot add more replicas, even if load continues to increase.

Resource Management Best Practices​

  • Start conservatively - Begin with generous resource requests and monitor actual usage. Defakto runs single processes written in Go, so 0.5 cores and 128 MB of RAM are reasonable starting points.
  • Monitor utilization - Use Prometheus queries to track CPU/memory usage against requests/limits and check for throttling, which can impact service behavior.
  • Avoid memory-based HPA - For Go applications, CPU is a more reliable scaling signal
  • Set reasonable max replicas - Prevent runaway scaling that exhausts cluster capacity

Troubleshooting Server Metrics​

Metrics Endpoint Not Accessible​

Test the endpoint directly:

kubectl port-forward -n <namespace> <pod-name> 9090:9090
curl http://localhost:9090/metrics

Next Steps​