Prometheus metrics within Kubernetes — an aerial view
An integral element of a system’s architecture will be its monitoring tool. At Mosaic Learning, we use Kubernetes as for our infrastructure management and utilize Prometheus as our metrics tool. In order to help get monitoring set up systematically and quickly, we utilize the prometheus community helm charts.
As this repository contains a LOT of charts, some of which we use and some not, I wanted to take some time to understand what each one does and how they interact with each other. Additionally, I wanted to use this as an opportunity to review the full metrics flow within Kubernetes and beyond into Prometheus itself.
I have created a rough diagram of how metrics flow from Kubernetes into Prometheus (and ultimately Grafana). (As a word of note, the boxes shaded grey represent resources which automatically get deployed together, see section below on the Kube Prometheus Stack.)
Although a picture is worth a thousand words, lets walk through this together.
Kubernetes Core metrics
As stated in the documentation, Kubernetes by default exposes metrics (in Prometheus format) for many of its core services and exposes them at the path /metrics
. The kublet, which has cAdvisor built into its binary, exposes its metrics at /metrics/cAdvisor
and also defines the endpoints of /metrics/probes
and /metrics/resource
.
A very good diagram can be found here. In the diagram, we can see that the kublet’s metrics get exported via the Summary API to Metrics Server which exposes it further to HPA and kubelet top. Lets discuss this at a deeper level.
Prometheus Adapter
In actuality, the Summary API is consumed not just by the Metrics Server but anything extending the Kubernetes API Aggregation Layer, hooking into the existing metrics API. (see https://github.com/kubernetes/kube-aggregator). One example of this is the Prometheus Adapter.
If one looks in the prometheus-adapter
chart, you’ll see 3 files:
- resource-metrics-apiservice.yaml
- custom-metrics-apiservice.yaml
- external-metrics-apiservice.yaml
Each one of these manifests creates a APIService
object, setting its group to metrics.k8s.io
, custom.metrics.k8s.io
, or external.metrics.k8s.io
respectively. We now have a implementation of the the Metrics API for the specific use by the Prometheus server.
Kube State Metrics
Kube State Metrics is another service deployed into the cluster which also queries the Kubernetes API Server. However, the kube-state-metrics does not use the Summary API alone and rather utilizes the client-go client (https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#go-client) to access the cluster (see the API spec). As an example, when detecting the state of a StatefulSet, kube-state-metrics
utilizes both the CoreV1().Pods API and AppsV1().StatefulSets API.
The metrics exposed by the Metrics API (such as CPU and Memory usage) are metrics as pertains to the internal functioning of a pod. These metrics are therefore useful for autoscaling purposes (ie: Horizontal Pod Autoscaler). When a pod’s cpu/memory usage hits a certain percentage, HPA now knows to add another replica of the pod to the cluster. On the other hand, the metrics exposed by kube-state-metrics
, are, as its name indicates, used to define the state of the cluster. Understanding the number of failed pods, unavailable nodes, or any other resource level event is crucial for the health management of the cluster.
Prometheus Node Exporter
Prometheus Node Exporter is another crucial element in our metrics architecture. This exporter does not access the Kubernetes API Server at all, and sometimes does not exist inside the cluster at all. Even when the kube-prometheus-stack
installs it inside the cluster (as one of its dependencies), it does so as a Daemonset and therefore lives on every node. It has access to and exposes node-level metrics external to the cluster to aid in monitoring the health of the node.
Kube Prometheus Stack
We finally arrive at the kube-prometheus-stack
chart. This chart provides a full metrics stack — Prometheus, Grafana, Alert Manager etc for actually pulling, processing (and eventually visualizing with Grafana) the cluster data metrics. Lets review the contents of this chart a bit to understand how it works.
CRDs
This chart is the only chart in the prometheus community helm charts which defines Custom Resource Definitions (CRD). These are located in the /crds
subfolder and contain the following definitions, all under the monitoring.coreos.com
group:
- AlertManagerConfig
- AlertManager
- PodMonitor
- Probe
- Prometheus
- PrometheusRule
- ServiceMonitor
- ThanosRuler
The main CRD I would like to focus on is the ServiceMonitor. The Service Monitor resource is a custom resource that is used by Prometheus for pulling metrics. In /templates/exporters
, you’ll find ServiceMonitors defined for for each core component at the metrics paths discussed above. (When path is not defined, the default is just /metrics
.) The Prometheus Operator (see below) watches the ServiceMonitor resources in the cluster in order to scrape metrics from the endpoints defined. With almost all of our applications, we define additional ServiceMonitor resources and those are automatically picked up by the operator and pulled into Prometheus.
(If you are interested about how operators differ from controllers in Kubernetes, I found this article to be excellent.)
Components and Dependencies
When reviewing the /templates
folder, you will see that this chart installs Prometheus , Prometheus Operator and Prometheus AlertManager. Additionally, it installs as dependencies Kube State Metrics, Prometheus Node Exporter and Grafana helm charts (with the first two helm charts actually being part of the prometheus community helm charts repository itself.)
The Prometheus Adaptor chart, while not an actual dependency of this stack, is actually also located in the same repository.
Service Monitors
Here is an example of how to use ServiceMonitors to expose custom metrics:
Our application is built on PHP and running php-fpm. In order for us to expose our php-fpm metrics to Prometheus, we would take the following steps:
- Create a sidecar metrics exporter container in the deployment’s pod spec:
- name: fpm-metrics
image: hipages/php-fpm_exporter
imagePullPolicy: IfNotPresent
ports:
- name: metrics
containerPort: 6999
protocol: TCP
2. Create a Service to expose this port for this pod
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
type: my-metrics
spec:
ports:
- port: 9216
targetPort: metrics
protocol: TCP
name: metrics
selector:
app.kubernetes.io/name: example-app
3. Create a Service Monitor to select those Service objects
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-phpfpm-monitor
labels:
app.kubernetes.io/name: my-phpfpm-monitor
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
release: prometheus
spec:
endpoints:
- port: metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
selector:
matchLabels:
type: my-metrics
See here for more details on how to set it up, and here for the api spec.
Once this is done, you can configure Prometheus Adapter to generate further custom metrics with its configuration values.
Metrics API
As a final point of discussion, I wanted to show how one would manually access metrics from the Metrics API (via the Kubernetes API).
Metrics API
kubectl get --raw "/apis/metrics.k8s.io/v1beta1"
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/ubc/pods" | jq '.'
Custom Metrics API
The following example implies that a custom metric named phpfpm_process_utilization
was created.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/*/phpfpm_process_utilization"
If you want to see a full list of all custom metrics, you can use this:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '. | {metrics: .resources[].name}'
Note: If you run kubectl api-resources | grep metrics
, you will expect to see NodeMetrics and PodMetrics for metrics.k8s.io
, but nothing for custom.metrics.k8s.io
or external.metrics.k8s.io
. The reason for this is that these are not Kubernetes resources themselves, but rather they are API groups that expose custom metrics data.
Conclusion
Kubernetes is an amazing system for the management and networking of application infrastructure. This organization allows it the ability to expose crucial metrics to ensure system stability. I would definitely recommend installing the Kube Prometheus Stack chart together with the Prometheus Adapter chart to take advantage of what this package can offer.