Defakto Trust Domain Server Runbook
This runbook provides diagnostic and remediation steps for operational issues with Trust Domain Servers. Use it alongside your monitoring stack when you observe unusual metric readings.
All examples use metric names and PromQL that work with any Prometheus-compatible backend (Thanos, VictoriaMetrics, etc.). Grafana users can use these queries in the Explore view or add them to panels.
Understanding the Server's Role
The Defakto server handles requests from agents to mint SVIDs. It:
- Attests agents — verifies that the connecting agent belongs to an authorized cluster
- Mints SVIDs — receives requests and generates an SVID based on configured policies, signing it with the built-in or an external CA
- Rotates the authority — when using the built-in authority, rotates the CA on a configured schedule and distributes those updates
- Distributes Trust Bundles — polls any configured federation endpoints and builds and distributes a SPIFFE trust bundle to agents for all configured trust domains
- Connects to the Defakto control plane — retrieves its configuration and streams events to the Defakto control plane for use in the Console
Key Metrics Reference
The following metrics are the most operationally significant for Trust Domain Servers.
grpc_server_handling_seconds and
spirl_server_mint_svid_duration_seconds histograms are only
populated when emitLatencyMetrics is enabled in your server
configuration. See the Server installation
guide.
Application Metrics
The application metrics focus on measuring the SVID signing operations of the Server.
| Metric | Labels | What it measures |
|---|---|---|
spirl_server_mint_svid_total | svid_typestatus_code | Counter of SVID minting operations, by outcome |
spirl_server_mint_svid_duration_seconds | svid_typestatus_code | Prometheus histogram of SVID minting timing |
The status_code label on spirl_server_mint_svid_total is the
primary indicator of what went wrong when minting fails. Key values:
status_code | Meaning |
|---|---|
attestation_failed | Server could not attest the workload using a plugin |
auth_failed | Agent was rejected due to an invalid auth session |
invalid_public_key | The public key in the Mintx509SVID request from the agent could not be parsed |
invalid_request | Malformed request from the agent |
ok | Successful |
signing_failed | Signing authority (Internal/KMS/key manager) returned an error when producing a signature for the SVID |
template_error | SPIFFE ID or certificate template evaluation failed |
gRPC Metrics
Defakto uses gRPC for communication between services and reports metrics per gRPC method invoked. The server acts as either a client or a server depending on the gRPC service that is being invoked.
| Metric | Key Labels | What it measures |
|---|---|---|
grpc_server_handled_total | grpc_codegrpc_methodgrpc_service | Completed RPCs, by result code |
grpc_server_started_total | grpc_methodgrpc_service | Inbound Streaming RPC start rate |
grpc_server_handling_seconds | grpc_codegrpc_method | Inbound RPC latency histogram (requires emitLatencyMetrics) |
grpc_client_handled_total | grpc_codegrpc_method grpc_service | Outbound RPCs (to the control plane), by result code |
grpc_client_handling_seconds | grpc_codegrpc_method | Outbound RPC latency (requires emitLatencyMetrics) |
Services where the Defakto Server operates as a server:
grpc_service | Meaning |
|---|---|
| com.spirl.agent.v1.agent.API | API used by the agent to request SVIDs from the Server |
| com.spirl.agent.v1.configuration.API | Used to send configuration when using the reflector |
| com.spirl.agent.v1.identityexchange.API | Used by Developer Identity to exchange SVIDs for developers |
| com.spirl.agent.v1.session.API | API used by agents to authenticate with the Server and retrieve a temporary session |
| com.spirl.private.common.api.resource.v1.Sink | Used to distribute configuration changes between components |
| com.spirl.private.common.api.resource.v1.Source | Used to distribute configuration changes between components |
| com.spirl.private.signer.api.v1.configuration.API | Used by the Defakto control-plane to send configuration changes to the Servers |
| com.spirl.private.signer.api.v1.federation.API | Used by the control-plane to trigger a refresh of any configured trust-domain federations |
| com.spirl.private.signer.api.v1.regionauthority.API | Used by the Defakto control-plane to manage key rotations and trust bundle coordination within the trust domain. Generally doesn't cause any immediate impact as long as issues are not allowed to persist for more than 24 hours. |
| com.spirl.serverless.alpha.SpiffeWorkloadAPI | Alpha service to support serverless workloads |
| grpc.health.v1.Health | Health checks of the gRPC connection |
Services where the Defakto Server operates as a client:
grpc_service | Meaning |
|---|---|
| com.spirl.events.v1.ingest.API | Used to send events to the Defakto control-plane to be shown in the Console |
All gRPC metrics carry constant labels for spirl_component (e.g. "agent", "server"), spirl_trust_domain, spirl_trust_domain_deployment, and spirl_app_version.
Healthy Baseline
| Signal | Expected state | Notes |
|---|---|---|
| SVID mint success rate | >99% | |
| gRPC API success rate | >99% | |
| p95 mint latency | <100ms under normal load | |
| Error rate on control plane connection | 0% or transient only | |
| Pod restarts | 0 in the past hour | |
| CPU usage | <50% of limit | Sustained usage should stay well below this limit. |
| Memory usage | <60% of limit | Memory usage is a function of the number of agent connections. |
Checking Overall Health
Before diving into a specific symptom, run these PromQL queries to get a snapshot:
# SVID mint success rate
# Expected value: 1 (100%)
sum(rate(spirl_server_mint_svid_total{status_code="ok"}[5m]))
/
sum(rate(spirl_server_mint_svid_total[5m]))
# gRPC error rate from agents
# Expected value: 0 (0% errors)
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.agent.*", grpc_code!~"OK|Canceled"}[5m]))
/
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.agent.*"}[5m]))
# Control plane error rate
# Expected value: 0 (0% errors)
(
(sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.private.signer.*", grpc_code!~"OK|Canceled"}[5m])) or vector(0))
+
(sum(rate(grpc_client_handled_total{spirl_component="server", grpc_service=~"com.spirl.events.*", grpc_code!~"OK|Canceled"}[5m])) or vector(0))
)
/
(
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.private.signer.*"}[5m]))
+
sum(rate(grpc_client_handled_total{spirl_component="server", grpc_service=~"com.spirl.events.*"}[5m]))
)
Also check the server cluster(s):
# List server pods and their status (replace <namespace> with your trust domain namespace, e.g. tdd-myorg)
kubectl get pods -n <namespace> -l app.kubernetes.io/name=spirl-server
# Check recent events for a server pod
kubectl events -n <namespace> --for pod/<pod-name>
# Tail server logs
kubectl logs -n <namespace> <pod-name> --tail=100
Scenarios
1. Elevated API error rate
Symptom: grpc_server_handled_total shows >1% of requests
completing with a non-OK, non-Canceled code.
What it means: The server is sending back error responses to agents on the agent-to-signer protocol. This protocol is used by the agent to request SVIDs (aka "mint"), update trust bundles, and receive configuration changes. Errors could indicate problems with any of these operations, which may or may not impact services to workloads.
Diagnosis
First, identify which gRPC error code, service, and method are dominant:
# Error breakdown by gRPC code
sum by (grpc_code, grpc_service, grpc_method) (
rate(grpc_server_handled_total{
spirl_component="server",
grpc_code!~"OK|Canceled"
}[5m])
) > 0
If related to SVID issuance, check the application-level status codes for more detail:
# Issuance failures by status_code
sum by (svid_type, status_code) (
rate(spirl_server_mint_svid_total{status_code!="ok"}[5m])
)
Use the grpc_code to guide further investigation:
grpc_code | Likely cause | Next step |
|---|---|---|
FailedPrecondition | Control plane config not yet received, or feature unavailable | See Control plane connectivity errors |
Unauthenticated | Agent session token rejected or expired | Check server logs (kubectl logs) for "auth_failed" |
PermissionDenied | Agent authenticated but issuance fails during attestation | Check the server logs from the instance that produced the metric |
InvalidArgument | Malformed agent request (often caused by a missing attribute used in a PathTemplate) | Check agent logs for attestation errors using the Agent Runbook — Workload Attestation Failures |
NotFound | Call-dependent, check server logs | When returned for the NewSession API call, this indicates an incorrect agent private key. |
Internal / Unknown | Signing failure or unexpected server-side error | See SVID issuance failures |
Remediation
# Search server logs for the relevant error code (example: attestation failures)
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -i "attestation_failed\|FailedPrecondition\|auth_failed"
If errors persist across all pods, the issue is likely configuration or upstream (control plane or key manager). Contact Defakto Support with the relevant log output.
2. SVID issuance failures
Symptom: spirl_server_mint_svid_total shows a sustained non-zero
rate with status_code != "ok".
What it means: The server has received a gRPC request to mint an SVID, but has failed to issue it due to an application-layer rejection. See the Application Metrics status code table to determine the reason for the rejection.
The server is receiving agent requests but cannot fulfill them. Workloads will not receive fresh SVIDs.
Workloads with existing SVIDs will continue to function without interruption, as long as the SVID remains valid. Only new workloads will be impacted.
Diagnosis
# SVID failure rate by cause
sum by (svid_type, status_code) (
rate(spirl_server_mint_svid_total{status_code!="ok"}[5m])
)
# Issuance duration: Elevated durations often precede failures and can be an indication of an overloaded server or resource constraint
histogram_quantile(0.95,
sum by (le, svid_type) (
rate(spirl_server_mint_svid_duration_seconds_bucket[5m])
)
)
The next steps depend on the predominant status code returned in the first query:
attestation_failed: The server could not attest the workload. This will occur if a server-side attestation extension is in use to collect additional evidence about a workload.
# Look for attestation details
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -i "attestation"
Troubleshoot the extension to validate it is functioning normally and not returning errors to requests.
signing_failed: The signing authority returned an error.
Review the key manager configuration. For AWS KMS deployments, check KMS service health in the AWS console and verify IAM permissions are intact.
template_error: A SPIFFE ID template or certificate
customization template failed to evaluate.
Check your trust domain configuration for recently changed path templates or X.509 customization rules. Refer to the path template documentation and X.509 customization documentation for syntax reference.
Also check the agents for failures, attribute redaction configuration changes, and whether there are problems collecting/providing attestation attributes.
A frequent cause of template rendering issues is referring to an attribute that wasn't attested to. This could be a typo in the attribute reference, or commonly, an attribute that doesn't exist for every workload in the cluster.
auth_failed: Agent sessions are being rejected.
The authentication session was considered invalid after originally being accepted at the transport layer. This generally represents a product issue and should be reported to the Defakto support team if observed.
invalid_request and invalid_public_key: The server is rejecting the request as invalid.
Ensure that the agent and server versions are compatible with each other. Check the release notes for the server and agent for any notes on compatibility with enabled features.
3. Control plane connectivity errors
The Trust Domain Server maintains a bidirectional relationship with the Defakto control plane:
- Inbound (control plane → server): The control plane initiates gRPC calls to the server to push configuration changes, coordinate key rotations, and manage trust bundles. These calls are reflected in
grpc_server_handled_totalwithgrpc_service=~"com.spirl.private.signer.*". - Outbound (server → control plane): The server sends events back to the control plane for display in the Console. These calls are reflected in
grpc_client_handled_totalwithgrpc_service=~"com.spirl.events.*".
The Defakto team monitors the inbound control-plane-to-server connection. If errors are detected on this path, the Defakto team will proactively reach out.
Symptom: grpc_client_handled_total (server-to-control-plane)
shows a non-zero error rate, or grpc_server_handled_total for
com.spirl.private.signer.* services shows elevated errors.
What it means: The server and the Defakto control plane cannot communicate reliably. Outbound errors mean events used to populate the Console Dashboard are not being delivered. Events are buffered and retried, but the in-memory buffer size is limited and cannot sustain persistent disruptions in connectivity without losing visibility on the Console. Inbound errors mean the control plane cannot push configuration or coordinate key rotations with the server. Even with inbound errors, the server and agents will continue to operate normally for at least 7 days.
Diagnosis
# Outbound: server-to-control-plane event delivery errors
sum by (grpc_code, grpc_method) (
rate(grpc_client_handled_total{
spirl_component="server",
grpc_code!~"OK|Canceled"
}[5m])
)
# Inbound: control-plane-to-server errors (monitored by Defakto)
sum by (grpc_code, grpc_method) (
rate(grpc_server_handled_total{
spirl_component="server",
grpc_service=~"com.spirl.private.signer.*",
grpc_code!~"OK|Canceled"
}[5m])
)
# Check for control plane connectivity errors in logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -iE "control.plane\|Unavailable\|connection refused"
# Verify control plane connectivity using spirldbg
kubectl debug -it <TARGET_SERVER_POD> --image=public.ecr.aws/d1i7q6j7/spirldbg:latest \
--target=<TARGET_CONTAINER> --profile=general
$ spirldbg network-diagnostics
Running network diagnostics with "api.spirl.com" ...
✓ DNS lookup successful
✓ TLS dial successful
✓ HTTP/2 request successful
✓ gRPC unary request successful
✓ gRPC streaming request successful
# (Expected output for successful connection)
Remediation
- Verify network egress to the Defakto control plane is open (check firewall rules and NetworkPolicy if on Kubernetes). Ensure that the server is configured to use any proxies required for reaching services outside the Kubernetes cluster.
- If using a TLS-intercepting proxy, confirm the proxy's CA is included in
controlPlane.relay.webPKISupplementalRootCAsso the server can validate the control plane connection. - The server will continue to operate independently of the control plane
using a snapshot of the last received configuration. This snapshot is
persisted inside Kubernetes, so server instances can restart or scale without disruption. While disconnected, the following impacts may be observed:
- The server will not receive configuration changes.
- The server will not be able to rotate its signing keys.
- The server will not receive trust bundle updates (including the inclusion of key rotations from other trust domain deployments).
The default settings for key rotations provide a 7-day window between generating a new key and beginning to use it for SVID signing. The control-plane connection needs to remain offline for several days before a service impact in SVID signing will occur.
Check the Defakto status page and contact support if the issue isn't explained by any outbound connectivity issues from the network.
4. Elevated latency
Symptom: p95 SVID mint latency
(spirl_server_mint_svid_duration_seconds) exceeds 500ms, or gRPC
latency (grpc_server_handling_seconds) climbs above 1s.
Diagnosis
# p95 SVID mint latency by type
histogram_quantile(0.95,
sum by (le, svid_type) (
rate(spirl_server_mint_svid_duration_seconds_bucket[5m])
)
)
# p95 gRPC latency by method
histogram_quantile(0.95,
sum by (le, grpc_method) (
rate(grpc_server_handling_seconds_bucket{spirl_component="server"}[5m])
)
)
If latency is elevated, check for resource saturation — CPU throttling is the most common cause.
5. Resource Saturation
Symptom: CPU usage approaches the configured limit, memory usage exceeds 80% of limit, or OOM events are recorded.
CPU Saturation
# CPU usage as a percentage of limit, per pod
100 * sum by (pod) (
rate(container_cpu_usage_seconds_total{namespace=~"^tdd-.*"}[5m])
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace=~"^tdd-.*", resource="cpu"}
)
If CPU usage is consistently above 70% of the limit, consider:
- Enabling or adjusting HPA
- Increasing the CPU limit
- Distributing agents across more trust domain deployments
Memory Saturation
Memory usage as a percentage of limit, per pod:
100 * sum by (pod) (
container_memory_working_set_bytes{namespace=~"^tdd-.*"}
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace=~"^tdd-.*", resource="memory"}
)
The Defakto services are written in Go and tend to maintain stable memory usage. The memory usage will scale proportional to the number of agents connected to the server. Increase the memory requests/limits for the servers or the number of replicas in the deployment.
If you observe memory usage trending upward over time without a corresponding increase in agent count, please contact Defakto support.
For OOM events:
# OOM kill count
sum by (pod) (
container_oom_events_total{namespace=~"^tdd-.*"}
)
Increase the memory limit if OOM events are occurring. Start with a limit of 512Mi and adjust based on observed usage.
6. Pod Restarts
Symptom: Pod restart count increases unexpectedly.
# Recent restart rate
rate(kube_pod_container_status_restarts_total{namespace=~"^tdd-.*"}[15m]) > 0
# Get restart count and last restart time
kubectl get pods -n <namespace> -o wide
# Check what the pod exited with
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Last State"
# Get logs from the previous (crashed) container
kubectl logs -n <namespace> <pod-name> --previous
OOMKilled: Increase the memory limit (see Resource Saturation).
Exit code 1 or 2 (startup failure): Usually a configuration error. Check the logs from the previous container for parsing or validation errors.
Repeated restarts (CrashLoopBackOff): If a pod restarts more than 3 times in 10 minutes, Kubernetes will apply an exponential backoff delay. Review the previous container's logs for a root cause before taking further action.
Understanding Server Health Checks
The server includes a health check subsystem (the readyChecker) that monitors the status of internal components. Kubernetes liveness and readiness probes use this subsystem to determine whether the server is healthy and ready to accept agent connections.
During startup, the server must complete several prerequisite steps before the readyChecker is initialized:
- Load or receive configuration — on first boot, the server connects to the Defakto control plane, authenticates, establishes a relay connection, and receives its trust domain configuration via
PushResources. On subsequent startups, the server loads its configuration from an in-cluster cache (a Kubernetes Custom Resource) and can start independently of the control plane. - Initialize the readyChecker — individual health checks are registered and begin in a
downstate - Start service subsystems — the agent-facing gRPC server, key manager, federation poller, and other subsystems begin starting
Once the readyChecker is running, it monitors the following checks:
| Health check | What it waits for |
|---|---|
configCache | Trust domain configuration has been received and cached |
bundleCache | A trust bundle has been assembled from partial bundles coordinated with the control plane |
agentServer | The agent-facing gRPC server is listening and ready to accept connections |
jwtKeyCache | JWT signing keys have been loaded — used to issue sessions to agents (this isn't referring to the JWT Signing Authority used for JWT SVIDs; these are used for short-term sessions with agents) |
jwksCache | Cache of JWKS keysets from external issuers has been initialized (used for example for PSAT Agent Attestation) |
The server will not pass readiness probes until all health checks report up. This is by design — agents should not be routed to a server that is not ready to service requests.
On a first installation, the server must wait for the control plane to deliver its configuration, generate signing keys, and build the initial trust bundle. This typically completes within 6 minutes.
On subsequent startups, the server loads its configuration and trust bundle from an in-cluster cache and can become ready without waiting for the control plane. This means servers can restart, scale up, or recover from node failures even if the control plane is temporarily unreachable. Once the control plane connection is re-established, the server will refresh its configuration and trust bundle in the background.
If the server is stuck in a not-ready state or failing liveness probes, inspect the health check status in the logs. Look for readyChecker log entries that report the current status and the state of individual checks:
kubectl logs -n <namespace> <pod-name> | grep readyChecker
Each health check entry includes:
status— the overall health status (upordown)checks— a map of individual subsystem checks, each showingStatus,ContiguousFails,LastSuccessAt, andLastFailureAt
A health check that remains down indicates the specific subsystem that is blocking startup:
configCachestaysdown: The server has not received or loaded its configuration. On first boot, this requires a connection to the control plane — check Control plane connectivity errors. On subsequent startups, the configuration is loaded from the in-cluster cache; check that thedata.spirl.comCustom Resource in the trust domain namespace has not been deleted.bundleCachestaysdown: The trust bundle has not been assembled. This requires the control plane to coordinate key rotations across all deployments in the trust domain. Check the control plane connection and look for errors inregionAuthorityCachelog entries.agentServerstaysdown: The agent-facing gRPC server has not started. Check the server logs for binding errors or port conflicts.jwtKeyCacheorjwksCachestaysdown: JWT signing key material has not been loaded. This typically resolves once the configuration is available.
Diagnostic Command Reference
Increasing Log Verbosity
For deeper troubleshooting, enable debug logging on the server by setting the SPIRL_LOG_DEBUG environment variable. Add the following to your server Helm values and restart:
trustDomainDeployment:
deployment:
env:
SPIRL_LOG_DEBUG: "true"
Debug logging increases log volume. Disable it after troubleshooting is complete.
Useful Commands
# List all server pods across trust domain namespaces
kubectl get pods -A -l app.kubernetes.io/name=spirl-server
# Check metrics endpoint on a specific pod
kubectl port-forward -n <namespace> <pod-name> 9090:9090
# In another shell:
curl -s http://localhost:9090/metrics | grep spirl_server
# Recent error logs across all server pods
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=30m | grep -E "ERROR|WARN"
# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa -n <namespace>
Escalation
If an issue cannot be resolved using this runbook:
- Collect the relevant log output (minimum 15 minutes before and after the incident)
- Note the specific metric values and PromQL queries that showed the anomaly
- Record the trust domain name and deployment region
- Contact Defakto Support with this information
For issues affecting certificate delivery to production workloads, treat as high priority.