Kubernetes Metrics
This guide focuses on monitoring the Kubernetes platform itself: Pod health, resource utilization, node status, and cluster state. These metrics are essential for understanding the runtime health of any Kubernetes workload, including Defakto components.
Overviewβ
Monitoring the Kubernetes runtime provides visibility into:
- Resource utilization - CPU, memory, network, and storage usage
- Pod health - Readiness, restarts, and lifecycle events
- Node status - Capacity, conditions, and pressure indicators
- Cluster state - Overall health and resource availability
This platform-level monitoring complements application-specific metrics (see Server Metrics and Agent Metrics) to provide comprehensive observability.
Kubernetes Runtime Metricsβ
Monitoring pod and container metrics is critical to understanding the operational health of your Defakto installation. The Kubernetes runtime will collect and provide insight into the host resources made available to Defakto Components and the usage of those components.
Essential Kubernetes Metricsβ
Monitor these key metrics for Defakto pods and the cluster:
Pod-Level Metricsβ
- CPU usage -
container_cpu_usage_seconds_total - Memory usage -
container_memory_working_set_bytes - Memory limits -
container_spec_memory_limit_bytes - CPU throttling -
container_cpu_cfs_throttled_seconds_total - Network I/O -
container_network_receive_bytes_total,container_network_transmit_bytes_total
Pod Health & Lifecycleβ
- Pod restarts -
kube_pod_container_status_restarts_total - Pod status -
kube_pod_status_phase(Running, Pending, Failed, etc.) - Container ready -
kube_pod_container_status_ready - OOM kills -
kube_pod_container_status_terminated_reason(OOMKilled)
Node-Level Metricsβ
- Node capacity -
kube_node_status_capacity(CPU, memory, pods) - Node allocatable -
kube_node_status_allocatable - Node conditions -
kube_node_status_condition(Ready, DiskPressure, MemoryPressure) - Disk pressure - Monitor for nodes under disk pressure affecting agents
Recommended: kube-state-metricsβ
kube-state-metrics is useful for collecting Kubernetes object state metrics. It exposes metrics about Kubernetes API objects like pods, deployments, nodes, and more.
Installing kube-state-metricsβ
Using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-state-metrics prometheus-community/kube-state-metrics \
--namespace kube-system \
--create-namespace
Key Metrics from kube-state-metricsβ
Particularly useful for Defakto monitoring:
For Trust Domain Servers (deployed in tdd-* namespaces):
# Trust Domain Server replica status
kube_deployment_spec_replicas{namespace=~"^tdd-.*"}
kube_deployment_status_replicas{namespace=~"^tdd-.*"}
# Pod readiness for servers
kube_pod_status_ready{namespace=~"^tdd-.*"}
# Resource requests vs limits for servers
kube_pod_container_resource_requests{namespace=~"^tdd-.*"}
kube_pod_container_resource_limits{namespace=~"^tdd-.*"}
For Agents (deployed in spirl-system namespace):
# Count of agents running vs desired
kube_daemonset_status_desired_number_scheduled{namespace="spirl-system"}
kube_daemonset_status_number_ready{namespace="spirl-system"}
# Agent controller deployment status (webhook handler)
kube_deployment_spec_replicas{namespace="spirl-system"}
kube_deployment_status_replicas{namespace="spirl-system"}
# Pod readiness across agent namespace
kube_pod_status_ready{namespace="spirl-system"}
# Resource requests vs limits for agents
kube_pod_container_resource_requests{namespace="spirl-system"}
kube_pod_container_resource_limits{namespace="spirl-system"}
Recommended: metrics-serverβ
metrics-server provides real-time resource utilization metrics for pods and nodes. Required for kubectl top commands and horizontal pod autoscaling.
Installing metrics-serverβ
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
For local/development clusters, you may need:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Patch for self-signed certificates (development only)
kubectl patch deployment metrics-server -n kube-system --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'
Using metrics-serverβ
Once installed, use kubectl to view resource usage:
# View pod resource usage
kubectl top pods -n spirl-system
# View node resource usage
kubectl top nodes
Recommended: cadvisor Metricsβ
cAdvisor (Container Advisor) provides container resource usage and performance metrics. It's built into the kubelet, so it's automatically available in most Kubernetes distributions.
Key cAdvisor Metricsβ
cAdvisor exposes metrics through the kubelet on each node:
container_cpu_usage_seconds_total- Total CPU time consumedcontainer_memory_working_set_bytes- Current working set memorycontainer_memory_rss- Resident set sizecontainer_network_receive_bytes_total- Network bytes receivedcontainer_network_transmit_bytes_total- Network bytes transmitted
Prometheus can scrape these directly from the kubelet /metrics/cadvisor endpoint.
Kubernetes Platform Metricsβ
Beyond Defakto-specific metrics, monitor the platform:
Cluster Health:
# Nodes ready
sum(kube_node_status_condition{condition="Ready",status="true"})
# Pods in Failed state
sum(kube_pod_status_phase{namespace=~"spirl-system|^tdd-.*",phase="Failed"})
# Pods not ready (should be 0 for healthy systems)
sum(kube_pod_status_ready{namespace=~"spirl-system|^tdd-.*",condition="false"})
# Defakto pods restarting
rate(kube_pod_container_status_restarts_total{namespace=~"spirl-system|^tdd-.*"}[5m])
Resource Saturation:
# CPU usage percentage
100 * sum(rate(container_cpu_usage_seconds_total{namespace=~"spirl-system|^tdd-.*"}[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="cpu"}) by (pod)
# Memory usage percentage
100 * sum(container_memory_working_set_bytes{namespace=~"spirl-system|^tdd-.*"}) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="memory"}) by (pod)
Capacity Planning:
- Node CPU and memory allocatable vs requested
- Network bandwidth trends
Next Stepsβ
- Server Metrics - Application-specific metrics for Trust Domain Servers
- Agent Metrics - Application-specific metrics for SPIRL Agents
- Review All Metrics - Complete metrics reference