Trust Domain Server Metrics
This guide covers metrics collection, configuration, and monitoring for Trust Domain Servers. Servers expose Prometheus-compatible metrics for SVID operations, attestations, gRPC performance, and resource utilization.
Enabling Metricsβ
Trust Domain Servers expose metrics on a configurable port (default: 9090) via a /metrics endpoint.
Enable metrics in your Helm chart values file:
telemetry:
enabled: true
collectors:
grpc:
emmitLatencyMetrics: true # Optional: enables gRPC latency histograms (produces ~500 additional metrics series per instance)
metricsAPI:
port: 9090
Apply the configuration:
helm upgrade --install <trust-domain-name> \
oci://ghcr.io/spirl/charts/spirl-server \
--values trust-domain-values.yaml
See the Metrics Reference for the complete list of available metrics.
Verifying Metrics Endpointβ
Test that metrics are accessible:
# Locate a server pod (replace <namespace> with your trust domain deployment namespace)
kubectl -n <namespace> get po -l app.kubernetes.io/name=spirl-server
# Port-forward to a server pod
kubectl port-forward -n <namespace> spirl-server-0 9090:9090
# In a separate shell, query the metrics endpoint
curl http://localhost:9090/metrics
Example output:
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000582792
go_gc_duration_seconds{quantile="0.25"} 0.00085675
...
Key Metrics to Monitorβ
gRPC Performanceβ
grpc_server_handled_total- Total gRPC requests completedgrpc_server_handling_seconds- Request latency histogramgrpc_server_started_total- Total gRPC streams started
Resource Utilizationβ
go_memstats_alloc_bytes- Current memory allocationgo_goroutines- Number of goroutinesprocess_cpu_seconds_total- CPU time
Kubernetes Runtimeβ
See Kubernetes Metrics for guidance on monitoring the Kubernetes runtime for issues.
Resource Management and Autoscalingβ
Resource Requests and Limitsβ
The Defakto Helm charts support configuring resource requests and limits for all server components. Setting these values is vital for:
- Ensuring the Kubernetes scheduler places pods on nodes with sufficient capacity
- Preventing resource contention with other workloads
- Enabling accurate capacity planning and monitoring
- Supporting Horizontal Pod Autoscaling (HPA)
Configuring Resourcesβ
By default, the Helm charts do not configure resource requests or limits. While this provides flexibility, setting explicit values is recommended for production deployments.
Resource requirements depend on your specific usage patterns, including attestation frequency, SVID rotation rates, and API request volume. Start with conservative estimates and adjust based on observed metrics. Additionally, ensure sufficient headroom to support failovers.
Set resource requests and limits in your Helm values file (values shown are for example purposes only):
trustDomainDeployment:
deployment:
resources:
requests:
cpu: "500m" # Request 0.5 CPU cores
memory: "512Mi" # Request 512 MiB memory
limits:
cpu: "1000m" # Limit to 1 CPU core
memory: "1Gi" # Limit to 1 GiB memory
Horizontal Pod Autoscaling (HPA)β
Defakto Helm charts support configuring the Horizontal Pod Autoscaler for server components. HPA automatically adjusts the number of pod replicas based on observed metrics.
Since Defakto components are written in Go, which uses efficient garbage collection and memory management, we recommend scaling based on CPU usage only. Memory usage in Go applications tends to remain relatively stable and doesn't correlate as strongly with load as CPU does.
Enabling HPAβ
Configure HPA in your Helm values file:
trustDomainDeployment:
deployment:
hpa:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70 # Scale when CPU > 70% of requested
# behavior: # Optional: configure scale-up/scale-down behavior
HPA Requirementsβ
For HPA to function properly, you must:
-
Set resource requests - HPA calculates utilization as a percentage of requested resources:
resources:
requests:
cpu: "500m" # Required for CPU-based HPA -
Deploy metrics-server - HPA requires the metrics-server to retrieve pod resource usage:
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-servertipFor additional configuration options, see the metrics-server Helm chart documentation.
-
Verify metrics availability:
kubectl top pods -n <namespace>
Monitoring HPA Behaviorβ
Check HPA status:
# View HPA status
kubectl get hpa -n <namespace>
# View detailed HPA status
kubectl describe hpa <hpa-name> -n <namespace>
# Monitor HPA events
kubectl get events -n <namespace> --field-selector involvedObject.kind=HorizontalPodAutoscaler
Example HPA status output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
spirl-server Deployment/spirl-server 45%/70% 2 10 2 5d
The TARGETS column shows current vs target utilization (45% current, 70% target).
HPA Prometheus Metricsβ
Monitor HPA decisions with Prometheus (requires kube-state-metrics):
# Current replica count (actual running pods)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}
# Desired replica count (what HPA wants)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}
# Maximum replica limit
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}
# Check if HPA is scaling (current != desired)
kube_horizontalpodautoscaler_status_desired_replicas{namespace="<namespace>"}
!=
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}
# Alert: HPA at max capacity (cannot scale further)
kube_horizontalpodautoscaler_status_current_replicas{namespace="<namespace>"}
>=
kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace>"}
# HPA target metrics (e.g., CPU utilization target)
kube_horizontalpodautoscaler_status_target_metric{namespace="<namespace>"}
If kube_horizontalpodautoscaler_status_current_replicas equals kube_horizontalpodautoscaler_spec_max_replicas, the HPA has reached its maximum scaling limit and cannot add more replicas, even if load continues to increase.
Resource Management Best Practicesβ
- Start conservatively - Begin with generous resource requests and monitor actual usage. Defakto runs single processes written in Go, so 0.5 cores and 128 MB of RAM are reasonable starting points.
- Monitor utilization - Use Prometheus queries to track CPU/memory usage against requests/limits and check for throttling, which can impact service behavior.
- Avoid memory-based HPA - For Go applications, CPU is a more reliable scaling signal
- Set reasonable max replicas - Prevent runaway scaling that exhausts cluster capacity
Troubleshooting Server Metricsβ
Metrics Endpoint Not Accessibleβ
Test the endpoint directly:
kubectl port-forward -n <namespace> <pod-name> 9090:9090
curl http://localhost:9090/metrics
Next Stepsβ
- Agent Metrics - Configure metrics for SPIRL Agents
- Review All Metrics - Complete metrics reference