Skip to main content

Defakto Trust Domain Server Runbook

This runbook provides diagnostic and remediation steps for operational issues with Trust Domain Servers. Use it alongside your monitoring stack when you observe unusual metric readings.

PromQL examples

All examples use metric names and PromQL that work with any Prometheus-compatible backend (Thanos, VictoriaMetrics, etc.). Grafana users can use these queries in the Explore view or add them to panels.

Understanding the Server's Role

The Defakto server handles requests from agents to mint SVIDs. It:

  1. Attests agents — verifies that the connecting agent belongs to an authorized cluster
  2. Mints SVIDs — receives requests and generates an SVID based on configured policies, signing it with the built-in or an external CA
  3. Rotates the authority — when using the built-in authority, rotates the CA on a configured schedule and distributes those updates
  4. Distributes Trust Bundles — polls any configured federation endpoints and builds and distributes a SPIFFE trust bundle to agents for all configured trust domains
  5. Connects to the Defakto control plane — retrieves its configuration and streams events to the Defakto control plane for use in the Console

Key Metrics Reference

The following metrics are the most operationally significant for Trust Domain Servers.

Latency metrics require configuration

grpc_server_handling_seconds and spirl_server_mint_svid_duration_seconds histograms are only populated when emitLatencyMetrics is enabled in your server configuration. See the Server installation guide.

Application Metrics

The application metrics focus on measuring the SVID signing operations of the Server.

MetricLabelsWhat it measures
spirl_server_mint_svid_totalsvid_type
status_code
Counter of SVID minting operations, by outcome
spirl_server_mint_svid_duration_secondssvid_type
status_code
Prometheus histogram of SVID minting timing

The status_code label on spirl_server_mint_svid_total is the primary indicator of what went wrong when minting fails. Key values:

status_codeMeaning
attestation_failedServer could not attest the workload using a plugin
auth_failedAgent was rejected due to an invalid auth session
invalid_public_keyThe public key in the Mintx509SVID request from the agent could not be parsed
invalid_requestMalformed request from the agent
okSuccessful
signing_failedSigning authority (Internal/KMS/key manager) returned an error when producing a signature for the SVID
template_errorSPIFFE ID or certificate template evaluation failed

gRPC Metrics

Defakto uses gRPC for communication between services and reports metrics per gRPC method invoked. The server acts as either a client or a server depending on the gRPC service that is being invoked.

MetricKey LabelsWhat it measures
grpc_server_handled_totalgrpc_code
grpc_method
grpc_service
Completed RPCs, by result code
grpc_server_started_totalgrpc_method
grpc_service
Inbound Streaming RPC start rate
grpc_server_handling_secondsgrpc_code
grpc_method
Inbound RPC latency histogram (requires emitLatencyMetrics)
grpc_client_handled_totalgrpc_code
grpc_method
grpc_service
Outbound RPCs (to the control plane), by result code
grpc_client_handling_secondsgrpc_code
grpc_method
Outbound RPC latency (requires emitLatencyMetrics)

Services where the Defakto Server operates as a server:

grpc_serviceMeaning
com.spirl.agent.v1.agent.APIAPI used by the agent to request SVIDs from the Server
com.spirl.agent.v1.configuration.APIUsed to send configuration when using the reflector
com.spirl.agent.v1.identityexchange.APIUsed by Developer Identity to exchange SVIDs for developers
com.spirl.agent.v1.session.APIAPI used by agents to authenticate with the Server and retrieve a temporary session
com.spirl.private.common.api.resource.v1.SinkUsed to distribute configuration changes between components
com.spirl.private.common.api.resource.v1.SourceUsed to distribute configuration changes between components
com.spirl.private.signer.api.v1.configuration.APIUsed by the Defakto control-plane to send configuration changes to the Servers
com.spirl.private.signer.api.v1.federation.APIUsed by the control-plane to trigger a refresh of any configured trust-domain federations
com.spirl.private.signer.api.v1.regionauthority.APIUsed by the Defakto control-plane to manage key rotations and trust bundle coordination within the trust domain. Generally doesn't cause any immediate impact as long as issues are not allowed to persist for more than 24 hours.
com.spirl.serverless.alpha.SpiffeWorkloadAPIAlpha service to support serverless workloads
grpc.health.v1.HealthHealth checks of the gRPC connection

Services where the Defakto Server operates as a client:

grpc_serviceMeaning
com.spirl.events.v1.ingest.APIUsed to send events to the Defakto control-plane to be shown in the Console

All gRPC metrics carry constant labels for spirl_component (e.g. "agent", "server"), spirl_trust_domain, spirl_trust_domain_deployment, and spirl_app_version.

Healthy Baseline

SignalExpected stateNotes
SVID mint success rate>99%
gRPC API success rate>99%
p95 mint latency<100ms under normal load
Error rate on control plane connection0% or transient only
Pod restarts0 in the past hour
CPU usage<50% of limitSustained usage should stay well below this limit.
Memory usage<60% of limitMemory usage is a function of the number of agent connections.

Checking Overall Health

Before diving into a specific symptom, run these PromQL queries to get a snapshot:

# SVID mint success rate
# Expected value: 1 (100%)
sum(rate(spirl_server_mint_svid_total{status_code="ok"}[5m]))
/
sum(rate(spirl_server_mint_svid_total[5m]))

# gRPC error rate from agents
# Expected value: 0 (0% errors)
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.agent.*", grpc_code!~"OK|Canceled"}[5m]))
/
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.agent.*"}[5m]))

# Control plane error rate
# Expected value: 0 (0% errors)
(
(sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.private.signer.*", grpc_code!~"OK|Canceled"}[5m])) or vector(0))
+
(sum(rate(grpc_client_handled_total{spirl_component="server", grpc_service=~"com.spirl.events.*", grpc_code!~"OK|Canceled"}[5m])) or vector(0))
)
/
(
sum(rate(grpc_server_handled_total{spirl_component="server", grpc_service=~"com.spirl.private.signer.*"}[5m]))
+
sum(rate(grpc_client_handled_total{spirl_component="server", grpc_service=~"com.spirl.events.*"}[5m]))
)

Also check the server cluster(s):

# List server pods and their status (replace <namespace> with your trust domain namespace, e.g. tdd-myorg)
kubectl get pods -n <namespace> -l app.kubernetes.io/name=spirl-server

# Check recent events for a server pod
kubectl events -n <namespace> --for pod/<pod-name>

# Tail server logs
kubectl logs -n <namespace> <pod-name> --tail=100

Scenarios

1. Elevated API error rate

Symptom: grpc_server_handled_total shows >1% of requests completing with a non-OK, non-Canceled code.

What it means: The server is sending back error responses to agents on the agent-to-signer protocol. This protocol is used by the agent to request SVIDs (aka "mint"), update trust bundles, and receive configuration changes. Errors could indicate problems with any of these operations, which may or may not impact services to workloads.

Diagnosis

First, identify which gRPC error code, service, and method are dominant:

# Error breakdown by gRPC code
sum by (grpc_code, grpc_service, grpc_method) (
rate(grpc_server_handled_total{
spirl_component="server",
grpc_code!~"OK|Canceled"
}[5m])
) > 0

If related to SVID issuance, check the application-level status codes for more detail:

# Issuance failures by status_code
sum by (svid_type, status_code) (
rate(spirl_server_mint_svid_total{status_code!="ok"}[5m])
)

Use the grpc_code to guide further investigation:

grpc_codeLikely causeNext step
FailedPreconditionControl plane config not yet received, or feature unavailableSee Control plane connectivity errors
UnauthenticatedAgent session token rejected or expiredCheck server logs (kubectl logs) for "auth_failed"
PermissionDeniedAgent authenticated but issuance fails during attestationCheck the server logs from the instance that produced the metric
InvalidArgumentMalformed agent request (often caused by a missing attribute used in a PathTemplate)Check agent logs for attestation errors using the Agent Runbook — Workload Attestation Failures
NotFoundCall-dependent, check server logsWhen returned for the NewSession API call, this indicates an incorrect agent private key.
Internal / UnknownSigning failure or unexpected server-side errorSee SVID issuance failures

Remediation

# Search server logs for the relevant error code (example: attestation failures)
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -i "attestation_failed\|FailedPrecondition\|auth_failed"

If errors persist across all pods, the issue is likely configuration or upstream (control plane or key manager). Contact Defakto Support with the relevant log output.


2. SVID issuance failures

Symptom: spirl_server_mint_svid_total shows a sustained non-zero rate with status_code != "ok".

What it means: The server has received a gRPC request to mint an SVID, but has failed to issue it due to an application-layer rejection. See the Application Metrics status code table to determine the reason for the rejection.

The server is receiving agent requests but cannot fulfill them. Workloads will not receive fresh SVIDs.

tip

Workloads with existing SVIDs will continue to function without interruption, as long as the SVID remains valid. Only new workloads will be impacted.

Diagnosis

# SVID failure rate by cause
sum by (svid_type, status_code) (
rate(spirl_server_mint_svid_total{status_code!="ok"}[5m])
)

# Issuance duration: Elevated durations often precede failures and can be an indication of an overloaded server or resource constraint
histogram_quantile(0.95,
sum by (le, svid_type) (
rate(spirl_server_mint_svid_duration_seconds_bucket[5m])
)
)

The next steps depend on the predominant status code returned in the first query:

attestation_failed: The server could not attest the workload. This will occur if a server-side attestation extension is in use to collect additional evidence about a workload.

# Look for attestation details
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -i "attestation"

Troubleshoot the extension to validate it is functioning normally and not returning errors to requests.

signing_failed: The signing authority returned an error.

Review the key manager configuration. For AWS KMS deployments, check KMS service health in the AWS console and verify IAM permissions are intact.

template_error: A SPIFFE ID template or certificate customization template failed to evaluate.

Check your trust domain configuration for recently changed path templates or X.509 customization rules. Refer to the path template documentation and X.509 customization documentation for syntax reference.

Also check the agents for failures, attribute redaction configuration changes, and whether there are problems collecting/providing attestation attributes.

tip

A frequent cause of template rendering issues is referring to an attribute that wasn't attested to. This could be a typo in the attribute reference, or commonly, an attribute that doesn't exist for every workload in the cluster.

auth_failed: Agent sessions are being rejected.

The authentication session was considered invalid after originally being accepted at the transport layer. This generally represents a product issue and should be reported to the Defakto support team if observed.

invalid_request and invalid_public_key: The server is rejecting the request as invalid.

Ensure that the agent and server versions are compatible with each other. Check the release notes for the server and agent for any notes on compatibility with enabled features.


3. Control plane connectivity errors

The Trust Domain Server maintains a bidirectional relationship with the Defakto control plane:

  • Inbound (control plane → server): The control plane initiates gRPC calls to the server to push configuration changes, coordinate key rotations, and manage trust bundles. These calls are reflected in grpc_server_handled_total with grpc_service=~"com.spirl.private.signer.*".
  • Outbound (server → control plane): The server sends events back to the control plane for display in the Console. These calls are reflected in grpc_client_handled_total with grpc_service=~"com.spirl.events.*".
Inbound errors are monitored by Defakto

The Defakto team monitors the inbound control-plane-to-server connection. If errors are detected on this path, the Defakto team will proactively reach out.

Symptom: grpc_client_handled_total (server-to-control-plane) shows a non-zero error rate, or grpc_server_handled_total for com.spirl.private.signer.* services shows elevated errors.

What it means: The server and the Defakto control plane cannot communicate reliably. Outbound errors mean events used to populate the Console Dashboard are not being delivered. Events are buffered and retried, but the in-memory buffer size is limited and cannot sustain persistent disruptions in connectivity without losing visibility on the Console. Inbound errors mean the control plane cannot push configuration or coordinate key rotations with the server. Even with inbound errors, the server and agents will continue to operate normally for at least 7 days.

Diagnosis

# Outbound: server-to-control-plane event delivery errors
sum by (grpc_code, grpc_method) (
rate(grpc_client_handled_total{
spirl_component="server",
grpc_code!~"OK|Canceled"
}[5m])
)

# Inbound: control-plane-to-server errors (monitored by Defakto)
sum by (grpc_code, grpc_method) (
rate(grpc_server_handled_total{
spirl_component="server",
grpc_service=~"com.spirl.private.signer.*",
grpc_code!~"OK|Canceled"
}[5m])
)
# Check for control plane connectivity errors in logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=15m | grep -iE "control.plane\|Unavailable\|connection refused"

# Verify control plane connectivity using spirldbg
kubectl debug -it <TARGET_SERVER_POD> --image=public.ecr.aws/d1i7q6j7/spirldbg:latest \
--target=<TARGET_CONTAINER> --profile=general
$ spirldbg network-diagnostics
Running network diagnostics with "api.spirl.com" ...
✓ DNS lookup successful
✓ TLS dial successful
✓ HTTP/2 request successful
✓ gRPC unary request successful
✓ gRPC streaming request successful
# (Expected output for successful connection)

Remediation

  1. Verify network egress to the Defakto control plane is open (check firewall rules and NetworkPolicy if on Kubernetes). Ensure that the server is configured to use any proxies required for reaching services outside the Kubernetes cluster.
  2. If using a TLS-intercepting proxy, confirm the proxy's CA is included in controlPlane.relay.webPKISupplementalRootCAs so the server can validate the control plane connection.
  3. The server will continue to operate independently of the control plane using a snapshot of the last received configuration. This snapshot is persisted inside Kubernetes, so server instances can restart or scale without disruption. While disconnected, the following impacts may be observed:
    • The server will not receive configuration changes.
    • The server will not be able to rotate its signing keys.
    • The server will not receive trust bundle updates (including the inclusion of key rotations from other trust domain deployments).
tip

The default settings for key rotations provide a 7-day window between generating a new key and beginning to use it for SVID signing. The control-plane connection needs to remain offline for several days before a service impact in SVID signing will occur.

tip

Check the Defakto status page and contact support if the issue isn't explained by any outbound connectivity issues from the network.


4. Elevated latency

Symptom: p95 SVID mint latency (spirl_server_mint_svid_duration_seconds) exceeds 500ms, or gRPC latency (grpc_server_handling_seconds) climbs above 1s.

Diagnosis

# p95 SVID mint latency by type
histogram_quantile(0.95,
sum by (le, svid_type) (
rate(spirl_server_mint_svid_duration_seconds_bucket[5m])
)
)

# p95 gRPC latency by method
histogram_quantile(0.95,
sum by (le, grpc_method) (
rate(grpc_server_handling_seconds_bucket{spirl_component="server"}[5m])
)
)

If latency is elevated, check for resource saturation — CPU throttling is the most common cause.


5. Resource Saturation

Symptom: CPU usage approaches the configured limit, memory usage exceeds 80% of limit, or OOM events are recorded.

CPU Saturation

# CPU usage as a percentage of limit, per pod
100 * sum by (pod) (
rate(container_cpu_usage_seconds_total{namespace=~"^tdd-.*"}[5m])
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace=~"^tdd-.*", resource="cpu"}
)

If CPU usage is consistently above 70% of the limit, consider:

  • Enabling or adjusting HPA
  • Increasing the CPU limit
  • Distributing agents across more trust domain deployments

Memory Saturation

Memory usage as a percentage of limit, per pod:

100 * sum by (pod) (
container_memory_working_set_bytes{namespace=~"^tdd-.*"}
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace=~"^tdd-.*", resource="memory"}
)

The Defakto services are written in Go and tend to maintain stable memory usage. The memory usage will scale proportional to the number of agents connected to the server. Increase the memory requests/limits for the servers or the number of replicas in the deployment.

tip

If you observe memory usage trending upward over time without a corresponding increase in agent count, please contact Defakto support.

For OOM events:

# OOM kill count
sum by (pod) (
container_oom_events_total{namespace=~"^tdd-.*"}
)

Increase the memory limit if OOM events are occurring. Start with a limit of 512Mi and adjust based on observed usage.


6. Pod Restarts

Symptom: Pod restart count increases unexpectedly.

# Recent restart rate
rate(kube_pod_container_status_restarts_total{namespace=~"^tdd-.*"}[15m]) > 0
# Get restart count and last restart time
kubectl get pods -n <namespace> -o wide

# Check what the pod exited with
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Last State"

# Get logs from the previous (crashed) container
kubectl logs -n <namespace> <pod-name> --previous

OOMKilled: Increase the memory limit (see Resource Saturation).

Exit code 1 or 2 (startup failure): Usually a configuration error. Check the logs from the previous container for parsing or validation errors.

Repeated restarts (CrashLoopBackOff): If a pod restarts more than 3 times in 10 minutes, Kubernetes will apply an exponential backoff delay. Review the previous container's logs for a root cause before taking further action.

Understanding Server Health Checks

The server includes a health check subsystem (the readyChecker) that monitors the status of internal components. Kubernetes liveness and readiness probes use this subsystem to determine whether the server is healthy and ready to accept agent connections.

During startup, the server must complete several prerequisite steps before the readyChecker is initialized:

  1. Load or receive configuration — on first boot, the server connects to the Defakto control plane, authenticates, establishes a relay connection, and receives its trust domain configuration via PushResources. On subsequent startups, the server loads its configuration from an in-cluster cache (a Kubernetes Custom Resource) and can start independently of the control plane.
  2. Initialize the readyChecker — individual health checks are registered and begin in a down state
  3. Start service subsystems — the agent-facing gRPC server, key manager, federation poller, and other subsystems begin starting

Once the readyChecker is running, it monitors the following checks:

Health checkWhat it waits for
configCacheTrust domain configuration has been received and cached
bundleCacheA trust bundle has been assembled from partial bundles coordinated with the control plane
agentServerThe agent-facing gRPC server is listening and ready to accept connections
jwtKeyCacheJWT signing keys have been loaded — used to issue sessions to agents (this isn't referring to the JWT Signing Authority used for JWT SVIDs; these are used for short-term sessions with agents)
jwksCacheCache of JWKS keysets from external issuers has been initialized (used for example for PSAT Agent Attestation)

The server will not pass readiness probes until all health checks report up. This is by design — agents should not be routed to a server that is not ready to service requests.

Startup time expectations

On a first installation, the server must wait for the control plane to deliver its configuration, generate signing keys, and build the initial trust bundle. This typically completes within 6 minutes.

On subsequent startups, the server loads its configuration and trust bundle from an in-cluster cache and can become ready without waiting for the control plane. This means servers can restart, scale up, or recover from node failures even if the control plane is temporarily unreachable. Once the control plane connection is re-established, the server will refresh its configuration and trust bundle in the background.

If the server is stuck in a not-ready state or failing liveness probes, inspect the health check status in the logs. Look for readyChecker log entries that report the current status and the state of individual checks:

kubectl logs -n <namespace> <pod-name> | grep readyChecker

Each health check entry includes:

  • status — the overall health status (up or down)
  • checks — a map of individual subsystem checks, each showing Status, ContiguousFails, LastSuccessAt, and LastFailureAt

A health check that remains down indicates the specific subsystem that is blocking startup:

  • configCache stays down: The server has not received or loaded its configuration. On first boot, this requires a connection to the control plane — check Control plane connectivity errors. On subsequent startups, the configuration is loaded from the in-cluster cache; check that the data.spirl.com Custom Resource in the trust domain namespace has not been deleted.
  • bundleCache stays down: The trust bundle has not been assembled. This requires the control plane to coordinate key rotations across all deployments in the trust domain. Check the control plane connection and look for errors in regionAuthorityCache log entries.
  • agentServer stays down: The agent-facing gRPC server has not started. Check the server logs for binding errors or port conflicts.
  • jwtKeyCache or jwksCache stays down: JWT signing key material has not been loaded. This typically resolves once the configuration is available.

Diagnostic Command Reference

Increasing Log Verbosity

For deeper troubleshooting, enable debug logging on the server by setting the SPIRL_LOG_DEBUG environment variable. Add the following to your server Helm values and restart:

trustDomainDeployment:
deployment:
env:
SPIRL_LOG_DEBUG: "true"
caution

Debug logging increases log volume. Disable it after troubleshooting is complete.

Useful Commands

# List all server pods across trust domain namespaces
kubectl get pods -A -l app.kubernetes.io/name=spirl-server

# Check metrics endpoint on a specific pod
kubectl port-forward -n <namespace> <pod-name> 9090:9090
# In another shell:
curl -s http://localhost:9090/metrics | grep spirl_server

# Recent error logs across all server pods
kubectl logs -n <namespace> -l app.kubernetes.io/name=spirl-server \
--since=30m | grep -E "ERROR|WARN"

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa -n <namespace>

Escalation

If an issue cannot be resolved using this runbook:

  1. Collect the relevant log output (minimum 15 minutes before and after the incident)
  2. Note the specific metric values and PromQL queries that showed the anomaly
  3. Record the trust domain name and deployment region
  4. Contact Defakto Support with this information

For issues affecting certificate delivery to production workloads, treat as high priority.