Defakto Agent Runbook
This runbook provides diagnostic and remediation steps for operational issues with SPIRL Agents. Use it when workloads are not receiving credentials, when agent metrics indicate issues, or when agents appear disconnected from Trust Domain Servers.
All examples use metric names and PromQL that work with any Prometheus-compatible backend (Thanos, VictoriaMetrics, etc.). Grafana users can use these queries in the Explore view or add them to panels.
Understanding the Agent's Role
The agent sits between workloads and the Trust Domain Server. It:
- Attests workloads — identifies the process or pod making an identity request, collects evidence about the process, and forwards the collected evidence to the server
- Assists in SVID issuance — forwards workload requests with collected evidence to the server for signing and relays the answer back to the workload
- Distributes Trust Bundles — monitors for changes to the trust configuration of the trust domain and distributes updates to workloads
- Notifies workloads — workload clients that maintain a connection to the agent will be notified of a new SVID (X.509 only) for rotations or attribute changes
The agent cannot serve credentials independently. Every mint operation requires a round-trip to the Trust Domain Server. This means server-side issues will surface in agent metrics. The Downtime Protection feature can be enabled in the agent, which will provide a cluster cache of SVIDs for reuse if there are server or network interruptions.
Key Metrics Reference
SVID Lifecycle Metrics
| Metric | Labels | What it measures |
|---|---|---|
spirl_agent_mint_svid_total | svid_typestatus_code | SVID minting attempts, by outcome |
spirl_agent_mint_svid_duration_seconds | svid_typestatus_code | Histogram of time taken to mint SVIDs (covering key generation, the RPC to the server, response parsing) |
spirl_agent_svid_expiring | svid_type | Gauge of SVIDs that have reached more than 1 minute past their half-life (for workloads maintaining an active streaming connection to the agent) |
spirl_agent_svid_outdated | svid_type | Gauge of SVIDs that are currently expired (for workloads maintaining an active streaming connection to the agent) |
spirl_agent_bundle_update_total | svid_typestatus_code | Counter of trust bundle updates sent to workloads |
Workload API Metrics
| Metric | Labels | What it measures |
|---|---|---|
spirl_agent_workload_attestation_total | svid_typestatus_code | Workload attestation attempts |
spirl_agent_workload_attestation_duration_seconds | svid_typestatus_code | Histogram of how long it takes to collect evidence (attestation data) about a workload |
spirl_agent_workload_api_connections_total | status_code | Total workload connections opened |
spirl_agent_workload_api_connections | — | Currently active workload connections (gauge) |
gRPC Metrics
| Metric | Key Labels | What it measures |
|---|---|---|
grpc_server_handled_total | grpc_codegrpc_methodgrpc_service | Completed Workload API calls (agent is server for workload request) |
grpc_server_handling_seconds | grpc_codegrpc_method | Workload API latency (requires emitLatencyMetrics) |
grpc_client_handled_total | grpc_codegrpc_methodgrpc_service | Outbound API calls (agent is client for Defakto Server) |
grpc_client_handling_seconds | grpc_codegrpc_method | Outbound API call latency histogram (requires emitLatencyMetrics) |
All gRPC metrics carry constant labels for spirl_component, spirl_trust_domain, spirl_trust_domain_deployment, and spirl_app_version.
Key status_code Values for Agent Metrics
status_code | Metric context | Meaning |
|---|---|---|
ok | All | Successful |
signer_mint_failed | mint_svid, svid_rotation | An error occurred when requesting the server to mint the SVID (network failure, server error, server rejection, etc) |
key_generation_failed | mint_svid | The agent could not generate a key pair |
attestation_failed | mint_svid, workload_attestation | Workload or agent attestation failed |
Healthy Baseline
| Signal | Expected state |
|---|---|
| SVID mint success rate | >99% |
| SVID expiring | 0 (non-zero indicates the agent is falling behind proactive rotation of credentials) |
| SVID outdated | 0 (non-zero indicates a workload has an expired credential) |
| Workload attestation success rate | >99% |
| Agent-to-server gRPC error rate | transient only |
| Pod/process restarts | 0 in the past hour |
Checking Overall Health
Before diving into a specific symptom, run these queries for a snapshot:
# SVID mint success rate (last 5 minutes)
(
sum(rate(spirl_agent_mint_svid_total{status_code="ok"}[5m]))
/
sum(rate(spirl_agent_mint_svid_total[5m]))
) * 100
# Workload attestation success rate
(
sum(rate(spirl_agent_workload_attestation_total{status_code="ok"}[5m]))
/
sum(rate(spirl_agent_workload_attestation_total[5m]))
) * 100
# Agent-to-server error rate (expect transient only)
(
sum(rate(grpc_client_handled_total{spirl_component="agent", grpc_code!~"OK|Canceled"}[5m]))
/
sum(rate(grpc_client_handled_total{spirl_component="agent"}[5m]))
)*100
# SVID cache health (expected to be 0 when healthy)
spirl_agent_svid_expiring
spirl_agent_svid_outdated
- Kubernetes
- Linux
# List all agent pods
kubectl get pods -n spirl-system -l app=spirl-agent
# Check a specific agent pod's status
kubectl describe pod -n spirl-system <pod-name>
# Get last 100 logs from the pod
kubectl logs -n spirl-system --tail=100 <pod-name>
# Check service status
systemctl status spirl-agent
# Tail logs
journalctl -u spirl-agent -n 100
Scenarios
1. Workloads Not Receiving SVIDs
Symptom: Workloads report certificate errors or timeouts, workloads report SPIFFE socket errors or connection refused, FetchX509SVID / FetchJWTSVID calls are failing or hanging, or spirl_agent_svid_expiring has a non-zero value.
What it means: The workload is not receiving a valid SVID from the agent. This can be caused by infrastructure issues, workload misconfiguration, or agent/server failures. How this failure presents to the workload depends on the type of request:
- X.509 SVIDs (streaming): The agent handles minting and rotation on behalf of the workload. If the agent encounters a server-side or network error, it does not forward the error to the workload. Instead, the agent retries internally and the workload's request hangs until the agent successfully mints an SVID, or the workload's own timeout is reached. The workload will not see a gRPC error — it will simply not receive an SVID message.
- JWT SVIDs (unary): Each JWT is individually requested and signed. Unlike X.509, server-side errors are returned directly to the workload as gRPC errors. The workload or SPIFFE SDK must handle retries itself.
- Trust bundles (streaming): The agent serves trust bundles from its own local copy and will not begin listening on the workload socket until it has received an initial copy from the server. If the agent has not yet completed startup, workloads will see socket not found or connection refused errors.
Start with the preliminary checks below before moving to metrics-based diagnosis.
Check: Is the workload socket accessible?
On Kubernetes, the SPIFFE Workload API socket must be mounted into each workload pod. The Defakto CSI driver handles this automatically using a mutating admission webhook that injects the CSI volume mount and adds the SPIFFE_ENDPOINT_SOCKET environment variable.
1. Verify the socket file exists inside the workload container:
# The default path is shown below — check the SPIFFE_ENDPOINT_SOCKET env variable for the actual path
kubectl exec -n <namespace> <workload-pod> -- ls -la /spirl-agent-socket/agent.sock
If the socket exists, the CSI driver is working and the socket file is accessible to the workload. If the socket is not found, continue to step 2; otherwise, jump to Check: Is the workload consuming SVIDs correctly?
2. Check if the workload pod has the socket volume mounted:
kubectl describe pod -n <namespace> <workload-pod> | grep -A3 -B2 "spiffe"
If no volume is shown with driver csi.spiffe.io, the admission webhook did not inject the CSI volume. Continue to step 3.
3. Check the admission webhook configuration:
Not all customers use the admission webhook to mount the CSI driver. For alternate configurations, see Integrating SPIFFE in your environment.
# View the webhook and its selector rules
kubectl get mutatingwebhookconfigurations spirl-controller-webhook -o yaml
Look at the objectSelector or namespaceSelector in the output to determine what the webhook matches on. By default, the webhook uses an object selector matching pods with the label k8s.spirl.com/spiffe-csi: enabled, but this can be overridden during installation.
4. Verify the workload has the correct label:
The webhook will only inject the CSI volume into pods that match its selector. Verify the workload has the required label as per the webhook configuration verified above:
- Pod Label (default)
- Namespace Label
# Check the pod's labels for the expected selector label
kubectl get pod -n <namespace> <workload-pod> --show-labels
By default, the pod must have the label k8s.spirl.com/spiffe-csi: enabled. If the label is missing, add it to the pod's template in the owning Deployment, StatefulSet, or DaemonSet, and roll out the change.
# Check the namespace labels for the expected selector label
kubectl get ns <namespace> --show-labels
If the webhook is configured with a namespaceSelector, the namespace must carry the matching label. If the label is missing, add it:
kubectl label ns <namespace> <label-key>=<label-value>
Existing pods in the namespace will need to be restarted to pick up the CSI volume injection.
5. Verify the CSI driver DaemonSet is running on the workload's node:
# Find which node the workload pod is running on
kubectl get pod -n <namespace> <workload-pod> -o wide
# Note the NODE column in the output
# Verify a CSI driver pod is running on that same node
kubectl get pods -n spirl-system -l app=spiffe-csi-driver -o wide | grep <node-name>
# Check the DaemonSet health
kubectl get daemonset -n spirl-system
If using EKS Fargate pods, DaemonSets aren't supported and as such we are unable to deliver a SPIFFE socket to the particular pod. See EKS Fargate Documentation for more information on the limitations of using fargate.
If you wish to use Defakto with Fargate, please reach out to Defakto support as other integration options may be available.
6. Restart the workload pod.
If this is a new installation and pods were created before the webhook was installed, some pods may have been created before the webhook could inject the mount. Restart the pod for the SPIFFE injection to occur.
Check: Is the workload consuming SVIDs correctly?
The SPIFFE Workload API uses different patterns for X.509 and JWT SVIDs:
- X.509 SVIDs are delivered over a streaming gRPC call (
FetchX509SVID). The workload should maintain an open stream and receive updates when the SVID is rotated or the trust bundle changes. If the workload makes a single fetch and closes the connection, it will not receive rotated SVIDs and the credential will eventually expire. - JWT SVIDs are delivered over a unary gRPC call (
FetchJWTSVID). The workload must request a new JWT before the current one expires.
If workloads are losing credentials over time but the agent and server are healthy, check whether the workload's SPIFFE SDK client is maintaining a persistent stream (for X.509) or that the workload is re-fetching before expiry (for JWT).
Check: Are SVIDs expiring or outdated?
The spirl_agent_svid_expiring and spirl_agent_svid_outdated metrics indicate whether the agent is falling behind on SVID renewals for workloads that are maintaining a stream for updates.
# SVIDs past their half-life that have not been renewed
spirl_agent_svid_expiring
# SVIDs that have fully expired without being renewed
spirl_agent_svid_outdated
A non-zero spirl_agent_svid_expiring value means the agent has not been able to renew the SVID after reaching its half-life. The metric includes a 1-minute buffer before considering an SVID expiring, so any non-zero value should be considered abnormal. The agent is still attempting to renew and will continue retrying. This is often caused by a transient server or network issue, but can also be caused by configuration errors.
A non-zero spirl_agent_svid_outdated value means an SVID has fully expired. This is a more urgent condition — the workload's credential is no longer valid and the agent has failed to distribute a fresh credential. Check for persistent server connectivity or issuance failures.
Diagnosis (metrics-based)
If the preliminary checks above do not reveal the issue, use metrics to identify the failure:
# SVID mint failure breakdown by cause
sum by (svid_type, status_code) (
rate(spirl_agent_mint_svid_total{status_code!="ok"}[5m])
)
# SVID rotation failure rate
sum by (svid_type, status_code) (
rate(spirl_agent_svid_rotation_total{status_code!="ok"}[5m])
)
# JWT-specific errors returned to workloads
sum by (grpc_code) (
rate(grpc_server_handled_total{
spirl_component="agent",
grpc_method="FetchJWTSVID",
grpc_code!~"OK|Canceled"
}[5m])
)
The grpc_server_handled_total metric tracks the agent's workload-facing API. For X.509 SVIDs, this metric is unlikely to show errors because the agent absorbs server-side failures and retries internally — the workload just sees a hanging request.
For JWT SVIDs, errors are returned directly to the workload and will appear in this metric. Focus on spirl_agent_mint_svid_total and grpc_client_handled_total (agent-to-server) for the most reliable picture of failures.
Use the status_code to narrow down the cause:
signer_mint_failed: The agent received an error when attempting to contact the server.
This will indicate either a networking issue or a rejection of the mint request from the server. Check the gRPC metrics for a breakdown of the reason for the gRPC failure Server Connectivity Issues or check server health using the Server Runbook.
When using an X509Source the agent will be responsible for retries and will stream the X.509 SVID to the workload when available. The agent internally applies retry logic using an exponential backoff algorithm: starting at ~500ms and capping at 1 minute.
When using a JWTSource, the FetchJWTSVID call in most SPIFFE libraries will not retry automatically and needs to be retried by the caller upon receiving an error, including transient issues.
attestation_failed: The agent failed to attest the workload.
See Workload Attestation Failures to troubleshoot attestation issues.
key_generation_failed: The agent could not generate a key pair.
This is rare and typically indicates a resource issue on the agent host (CPU or entropy starvation). Check CPU utilization on the affected pod/host and the kernel logs on the host.
2. Elevated Latency Serving Workloads
Symptom: grpc_server_handling_seconds with spirl_component="agent" shows high p95 or p99 latency for workload requests.
grpc_server_handling_seconds is only populated when emitLatencyMetrics is enabled. See the Agent installation guide.
What it means: The agent is taking longer than expected to serve workload requests. Because the agent must contact the server for every SVID mint, elevated server-side latency or network latency will be included in these calculations.
Diagnosis
# Workload API p95 latency by method
histogram_quantile(0.95,
sum by (le, grpc_method) (
rate(grpc_server_handling_seconds_bucket{spirl_component="agent"}[5m])
)
)
# Agent-to-server latency (root cause check)
histogram_quantile(0.95,
sum by (le, grpc_method) (
rate(grpc_client_handling_seconds_bucket{spirl_component="agent"}[5m])
)
)
If agent-to-server latency is high, see the Server Runbook — Elevated Latency.
If agent-to-server latency is normal but workload API latency is high, check the agent's CPU usage and check for workload attestation slowness:
# Attestation duration p95
histogram_quantile(0.95,
sum by (le, svid_type) (
rate(spirl_agent_workload_attestation_duration_seconds_bucket[5m])
)
)
High attestation duration can indicate:
- Kubelet API server is slow to respond (for pod attestation)
- High number of concurrent workload SVID requests
- CPU saturation on the agent node
- Using an attestation plugin that is introducing latency
3. Workload Attestation Failures
Symptom: spirl_agent_workload_attestation_total shows failures, or spirl_agent_mint_svid_total is returning status_code attestation_failed.
What it means: The agent failed to collect all evidence about the workload it was configured to collect. This failure prevents the agent from requesting an SVID from the server.
Diagnosis
# Attestation failure breakdown by cause
sum by (svid_type, status_code) (
rate(spirl_agent_workload_attestation_total{status_code!="ok"}[5m])
)
- Kubernetes
- Linux
kubectl logs -n spirl-system <pod-name> --since=15m
journalctl -u spirl-agent --since "15 minutes ago"
Check the status_code label on spirl_agent_mint_svid_total for more detail:
attestation_failed: The agent failed to gather attestation evidence about a workload.
The configured workload attestor on the agent was unable to complete evidence collection about a workload due to an error. This can happen if:
- For Kubernetes workloads, the Kubelet API may be unavailable — this can occur on freshly booted workers that start up workloads before kubelet provisioning is complete
- The node's container runtime socket is not accessible to the agent
- A custom workload attestor returned an error or timed out (see workload attestation extension docs)
Some attestors are optional and the agent will attempt to make the mint SVID request to the server even if attestation fails. If none of the attributes produced by a workload attestor are needed by the server, it can still issue an SVID despite the attestation failure.
When using an X509Source, the results of attestation are cached and the attestation is re-executed every 30 minutes. If the workload attributes change, such as a runtime reconfiguration of a process, a new SVID will be requested. However, if the server returns an InvalidArgument error to the gRPC request, this will force the agent to not cache the attestation results and re-attest the workload on the next attempt to retrieve an SVID. This is a self-healing behavior — you may see additional attestation attempts in the logs while the agent attempts to recover from a temporary issue with attestation.
4. Server Connectivity Issues
Symptom: grpc_client_handled_total with spirl_component="agent" shows a rising error rate, with grpc_code="Unavailable".
What it means: The agent cannot reach the Trust Domain Server. New SVID minting will fail. Workloads that have an SVID will be able to continue to use those SVIDs.
Diagnosis
# Agent-to-server errors by code
sum by (grpc_code) (
rate(grpc_client_handled_total{
spirl_component="agent",
grpc_code!~"OK|Canceled"
}[5m])
)
- Kubernetes
- Linux
# Check for connectivity errors in logs
kubectl logs -n spirl-system <pod-name> --since=15m
journalctl -u spirl-agent --since "15 minutes ago"
curl -s https://<server-endpoint>/health
Remediation
The agent uses multiple strategies to detect a failed network connection, including triggering an immediate redial upon receiving an Unavailable result from a server or load balancer. This will trigger a recovery mode in the agent where it will attempt to dial the server using an exponential backoff strategy until it successfully connects to a healthy server, starting at 1 second and capping at 2 minutes.
Multiple endpoints can be configured in the agent configuration, which are used in a priority order. This can be used for regional failovers, or when using a geo-load balancer as the first target that is slow to react to site outages. See Endpoint Configuration on how to configure multiple server targets.
By default, the agent will keep a gRPC connection for up to 30 minutes. This is to allow for rebalancing of connections if a site that was previously unavailable becomes available again. The agent will only abandon the current connection and use the newly dialed connection if it passes a health check, so no interruption to service is caused by this redialing behavior.
- Kubernetes
- Linux
# Check agent configuration for server endpoint
kubectl get pods -n spirl-system <pod-name> -o jsonpath='{.spec.containers[0].env}' | grep -A1 SPIRL_ENDPOINT
# Review the agent config file for the endpoint
grep -i "endpoint\|server" /etc/spirl-agent/config.yaml
If connectivity is persistently failing:
- Check network policies — on Kubernetes, verify that
NetworkPolicyallows egress fromspirl-systemto the servers - Check firewall rules — ensure the node can reach the configured gRPC server.
- Check TLS — Verify the server certificate is valid and the agent has the correct CA configured. Look for
tls:errors in logs. - Check server endpoint configuration — verify the agent's endpoint points to the expected address.
- Check load balancers — verify any load balancers in use are configured as expected and have the proper targets passing health checks.
- Check server health — use the Server Runbook to verify servers are healthy.
5. Resource Saturation
Symptom: Agent CPU or memory is approaching configured limits, or OOM events are occurring.
# CPU usage as a percentage of limit, per agent pod
100 * sum by (pod) (
rate(container_cpu_usage_seconds_total{namespace="spirl-system"}[5m])
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace="spirl-system", resource="cpu"}
)
# Memory usage as a percentage of limit
100 * sum by (pod) (
container_memory_working_set_bytes{namespace="spirl-system"}
)
/ sum by (pod) (
kube_pod_container_resource_limits{namespace="spirl-system", resource="memory"}
)
# OOM events
sum by (pod) (
container_oom_events_total{namespace="spirl-system"}
)
Agents run as a single instance per node and do not scale with additional instances on a node. Mitigation options:
- Increase CPU/memory limits in your Helm values
- Reduce the number of concurrent workloads on heavily loaded nodes
- If high CPU correlates with attestation activity, check for workloads that are connecting and disconnecting at a high rate (each connection or SVID request triggers a fresh attestation)
- Kubernetes
- Linux
# Check current resource usage
kubectl top pods -n spirl-system
To update resource limits, see the Helm Values Reference — Agent Resources.
# Check resource usage
ps -o pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -20
# For systemd services, update resource limits
systemctl edit spirl-agent
# Add:
# [Service]
# MemoryMax=256M
# CPUQuota=50%
systemctl daemon-reload
systemctl restart spirl-agent
6. Agent Pod or Process Restarts
Symptom: Pod restart count increases, the agent enters CrashLoopBackOff, or the agent process is repeatedly cycling.
# Agent restart rate
rate(kube_pod_container_status_restarts_total{namespace="spirl-system"}[15m]) > 0
- Kubernetes
- Linux
# Check restart count and status (look for CrashLoopBackOff or Error)
kubectl get pods -n spirl-system -l app=spirl-agent
# View exit reason and restart count
kubectl describe pod -n spirl-system <pod-name> | grep -A5 "Last State"
# View logs from the crashed container
kubectl logs -n spirl-system <pod-name> --previous
# View logs from the current container (if it's still starting)
kubectl logs -n spirl-system <pod-name>
systemctl show spirl-agent --property=NRestarts
journalctl -u spirl-agent -b -1 --no-pager | tail -50
OOMKilled: Increase memory limits (see Resource Saturation).
Exit code 1 at startup: The agent failed to start due to invalid configuration or an unrecoverable error.
Check the agent logs from the failed container (kubectl logs --previous) for the specific error message.
CrashLoopBackOff: The agent is repeatedly crashing and Kubernetes is backing off restarts. This is most commonly caused by a persistent configuration error (exit code 1) or an OOM kill, but can also be caused by network issues causing the agent to not pass health checks. Check kubectl describe pod for the exit code and reason, then follow the relevant guidance above.
Understanding Agent Health Checks
The agent includes a health check subsystem that monitors the status of internal components. Kubernetes liveness and readiness probes use this subsystem to determine whether the agent is healthy.
During startup, individual health checks begin in a down state and transition to up once their subsystem is operational. For example, the bundleCache health check starts as down and transitions to up once the agent has received and cached an initial trust bundle from the server.
If the agent is stuck in a not-ready state or failing liveness probes, inspect the health check status in the logs. Look for readyChecker log entries that report the current status and the state of individual checks:
- Kubernetes
- Linux
kubectl logs -n spirl-system <pod-name> | grep readyChecker
journalctl -u spirl-agent | grep readyChecker
Each health check entry includes:
status— the overall health status (upordown)checks— a map of individual subsystem checks, each showingStatus,ContiguousFails,LastSuccessAt, andLastFailureAt
A health check that remains down indicates the specific subsystem that is blocking startup. For example, if bundleCache stays down, the agent has not been able to retrieve an initial trust bundle from the server — check Server Connectivity Issues.
Diagnostic Command Reference
Increasing Log Verbosity
For deeper troubleshooting, enable debug logging on the agent. See the Agent logging documentation for instructions on enabling SPIRL_LOG_DEBUG for Kubernetes, Linux, and Node Group deployments.
Debug logging increases log volume and can be unwieldy. Disable it after troubleshooting is complete.
Escalation
If an issue cannot be resolved using this runbook:
- Collect agent logs covering at least 15 minutes before and after the incident
- Note the specific metric values and PromQL queries that showed the anomaly
- Record the cluster name, node name (for DaemonSet agents), and trust domain
- If the issue affects server connectivity, run the Server Runbook in parallel
- Contact Defakto Support with this information
For issues resulting in workloads losing all credentials (SVID expired, cache empty), treat as high priority.