A Detailed guide to Probes In Kubernetes

Published in

DevOps.dev

10 min readJan 31, 2024

When you’re striving to enhance the stability, availability, and durability of your applications deployed on Kubernetes, probes become indispensable tools at your disposal. In this guide, I’ll walk you through the three different types of probes applicable to containers running on pods, explaining their functionalities and when to employ them. Additionally, I’ll delve into the four distinct mechanisms for probing containers. If this sounds overwhelming, fear not! By the end of this read, you’ll be a Kubernetes probing pro!

Introduction to probes

A probe is a diagnostic that the Kubelet, the Kubernetes node agent, employs to evaluate the health of a running container. Based on the probe’s outcome, the Kubelet can initiate specific actions concerning the container. There are three types of probes that can be configured for a pod, each serving a distinct purpose and triggering different actions:

Startup
Liveness
Readiness

The mechanism for probing the container can either be a network request against the container or code execution within the container. When a pod is probed there can be one of three outcomes to the probe:

Success — The container successfully passed the diagnostic, (i.e. a network connection was successfully or a command had a return code of 0)
Failure — The container failed the diagnostic check
Unknown — The container failed the diagnostic check for an unknown reason, like an error in the Kubelet when trying to run the diagnostic. If this is the case the Kubelet will retry the probe.

Success and Failure are the two important and most common outcomes to probes.

Probe Types

Startup Probes

Startup probes evaluate pods during their initialization phase, signaling to Kubernetes when they are ready to transition to the “Ready” state. This probe is particularly beneficial for applications with prolonged startup times. For instance, applications requiring initialization processes, like establishing communications with external services or warming up caches by fetching data from sources like S3.

When a startup probe is defined for a container all other probes are paused until the startup probe succeeds. This probe is typically used in combination with a liveness probe for containers that have extended start-up duration. The startup probe will have a higher failure threshold than the liveness probe giving the container time to start up. Once the container has started, the liveness probe with a small failure threshold is used, this allows there to be shorter detection time for container failures but longer leniency in startup time.

If the startup probe fails then the Kubelet will terminate the container and restart it if the parent pod’srestartPolicy is set to Always or OnFailure.

Liveness Probes

Liveness probes continuously check whether the container is still healthy, the probes only start after the startup probe has succeeded.

This probe is useful for containers that can enter into an error state with the actual process still running. If the container process fails then the Kubelet will restart the pod (based on the pods restartPolicy). For example if the container has multiple threads running, running data processing and a REST API. If the data processing is still running but the REST API goes down or if an application runs into a deadlock situation, The best way to get the container back into a healthy state could be to restart the entire container. This is what the liveness Probe enables.

Like the startup probe, if the probe fails then the Kubelet will kill the container and restart it if the parent pod’s restartPolicy is set to Always or OnFailure.

Readiness Probes

Readiness probes determine whether a container is ready and able to respond to requests.

These probes are crucial for containers that serve some sort of traffic through a Kubernetes Service. It is useful for applications that may need to go into some sort of maintenance mode, where the container should cease serving traffic but stay alive. In cases like these the liveness probe and the readiness probe of the container should be different such that in the maintenance mode the liveness probes still pass but the readiness probes do not.

Unlike liveness and startup probes, readiness probe failures do not trigger container restarts. Instead, it result’s in the Pod’s IP address being removed from the endpoints of all services that match that pod.

Probe Mechanisms

Now that we know about the different types of probes in Kubernetes we can explore the different mechanisms that Kubernetes uses to probe containers and how we can apply this to our containers

The four probe mechanisms that are available are:

HTTP Get — probes by sending a HTTP get to an endpoint on the containers
Execution — probes by executing a command in the container
TCP Socket — probes by opening a TCP socket to a port on the container
gRPC — probes by using the gRPC health checking protocol to a gRPC endpoint on the container

We’ll delve deeper into these mechanisms in the examples below.

Defining probes

When defining probes there are five shared fields that define how the Kubelet will execute the probes, classify and act on their results.

periodSeconds (int) — specifies how often the Kubelet will perform the probes. Defaults to 10 and requires a minimum of 1
timeoutSeconds (int) — the number of seconds after which the probe will timeout and fail. Defaults to 10 and requires a minimum of 1
failureThreshold (int) — how many consecutive failed probes need to occur for the overall probe to be considered a failure and the action for the probe to be triggered. Defaults to 3, and requires a minimum of 1.
successThreshold (int) — how many consecutive successful probes need to occur, after a failed probe, for the overall probe to be considered a successful. Defaults to 3, and requires a minimum of 1.
terminationGracePeriodSeconds (int)(optional) — this is used for probes that terminate a container (liveness and start-up). The grace period that will be given to the container to shutdown after a SIGTERM is sent. After this grade period a SIGKILL will be forcibly killed with a SIGKILL. If this value isn’t set then the pods terminationGracePeriodSeconds will be used, otherwise this value will override the value provided by the pod spec. If a value of 0 is set then the container will immediately be killed without any grace period.

With these in mind let’s look at examples of the different probing mechanisms.

HTTPGet Liveness/Startup Probes

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  restartPolicy: OnFailure
  containers:
  - name: liveness
    image: registry.k8s.io/liveness
    args:
    - /server
    livenessProbe:
      httpGet:
        path: /healthz          # (Optional) The path to get on the HTTP server
        port: 8080              # The name or number of the port to access on the container
        httpHeaders:            # (Optional) Any custom headers to set in the request to in the request
        - name: Custom-Header
          value: Awesome
        scheme: HTTP            # (Optional) (Defaults to HTTP) The scheme (HTTP/HTTPS) to use for the connection
#       host:                   # (Optional) (Defaults to the Pod's IP) The host name to connect to 
      initialDelaySeconds: 1   # (Optional) The time to wait before starting the 
      periodSeconds: 3
      failureThreshold: 1
      successThreshold: 1

Here we have defined a liveness probe that initially waits for 1 second before starting. This probe will send a HTTP GET request to port 8080 and if it receives a response with a code that is greater than 200 and less than 400 it will mark the probe as a success. If the request receives any other code or times out then it will be consider the probe failed.

The registry.k8s.io/liveness container will return 200’s for 10 seconds after which it will return 500’s, so if this example is ran the first 3 probes should succeed but after that the probes will fail and the Kubelet will kill and restart the container.

This sets the failureThreshold to 1 which is not a good idea. Having such a low threshold can make the probes flaky and transient network failure or packet drops can cause the Kubelet to kill and restart the container. It is better practice to set this value higher, the default of 3 is a good place to start.

The initialDelaySeconds field is only used for liveness probes and is an alternative way to give the container to get ready over startup probe. The downside is that using this delay will cause any startup failures to take longer to be caught as it is a set time. A better practice is to use a startup probe along with a liveness probe:

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
    httpHeaders:            
 - name: Custom-Header
   value: Awesome
 scheme: HTTP
  failureThreshold: 30
  periodSeconds: 10
livenessProbe:
  httpGet:
 path: /healthz          
 port: 8080              
 httpHeaders:            
 - name: Custom-Header
   value: Awesome
 scheme: HTTP            
  periodSeconds: 3

This startup probe will send a probe every 10 seconds 30 times so will wait for a total of 5 minutes (10 * 30 seconds) before restarting the pod. During this period the liveness probes won’t be sent, they will only begin once the liveness probe succeeds. Once a liveness probe succeeds they will stop and won’t start again unless the container is restarted.

TCP Readiness Probe

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: registry.k8s.io/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080              # The name or number of the port to access on the container
#       host:                   # (Optional) (Defaults to the Pod's IP) The host name to connect to
      initialDelaySeconds: 15
      periodSeconds: 10
      failureThreshold: 5

Here a readiness probe is defined that will attempt to connect to port 8080 on the container every 10 seconds, after waiting 15 seconds from the container starting. The TCP probe just attempts to open a TCP socket on the container on the specified port, if a connection can be made then the probe will be marked as a success otherwise it will considered a failure.

We have increased the failure threshold on this example to 5, this means that the probe needs to fail 5 consecutive times before the pod will be removed from the service. This means the pod is more resilient to transient problems but it also means it takes longer to take action if something is genuinely wrong with the container.

Command Liveness probe

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:        # The command to run on the container
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

Here the liveness probe is set to be the command cat /tmp/healthy, any command can be set and it will run in the root directory of the containers filesystem. This command is not ran inside a shell and is simply executed so shell instructions (e.g. ‘|’ ) won’t work. To run a command in a shell the shell needs to explicitly called. For example, to run a command that greps /tmp/healthy for the word “ready” a probe like the following can be used

exec:
 command:
 - /bin/sh
 - -c
 - cat /tmp/healthy | grep ready

The probe is marked as a success if the return code of the command is 0, otherwise it is marked a failure.

In this example the container creates the /tmp/healthy file, sleeps for 30 seconds, deletes the file then sleeps for 4 minutes. So the original probe will succeed during those first 30 seconds but will then fail after, this will cause the Kubelet to restart the container (restartPolicy defaults to Always).

GRPC Liveness Probe

apiVersion: v1
kind: Pod
metadata:
  name: etcd-with-grpc
spec:
  containers:
  - name: etcd
    image: registry.k8s.io/etcd:3.5.1-0
    command: [ "/usr/local/bin/etcd", "--data-dir",  "/var/lib/etcd", "--listen-client-urls", "http://0.0.0.0:2379", "--advertise-client-urls", "http://127.0.0.1:2379", "--log-level", "debug"]
    ports:
    - containerPort: 2379
    livenessProbe:
      grpc:
        port: 2379              # Port number of the gRPC service
#       service: Health         # (Optional) The service to health check
      initialDelaySeconds: 10

Here we have set a liveness probe that probes on port 2379. The gRPC probe uses the gRPC health check protocol. If a service isn’t defined then the general servers health is checked, however if a service is provided then the health of that service will be queried. You can read more about the gRPC health check protocol at https://github.com/grpc/grpc/blob/master/doc/health-checking.md.

It is worth noting that the gRPC probe does not currently support any authentication parameters such as TLS, this results in some situations where a HTTP probe has to be used instead if TLS is a requirement.

Good Practices

Employing liveness probes for all long running pods is a highly recommended as they can automate recovery for the containers if something goes wrong. It is particularly powerful to combat things like deadlocks.
In practice it is best to have a relatively high periodSeconds to prevent overloading your services with excessive probes, which can degrade performance by consuming resources.
Ensure that probe endpoints are lightweight and simple. Opt for endpoints that perform minimal logic, focusing solely on checking the application’s health.
Select an appropriate value for failureThreshold. If the threshold is too low, such as 1 or 2 then the probe can run into false-positives where an intermittent failure like the server temporarily being too busy can cause the probe to be failed and the container being restarted unnecessarily. The default of 3 is typically a good value balancing durability and time to recognise failures.
Set timeoutSeconds appropriately, if your application is known to take long responding to calls on it's endpoints then increase the timeout to give your application more time. This will prevent the application being started needlessly.
Use a start-up probe for applications that are known take long time to get started instead relying on a high initialDelaySeconds. This ensures the liveness probes begin as soon as the application is ready, facilitating early detection of container issues..
When using liveness and startup probes ensure that the restartPolicy is set to Always or OnFailure for your pod if you want the probe to restart the pod.

Takeaways

Probes play a pivotal role in ensuring the stability of your Kubernetes deployments. By adhering to best practices, they can efficiently offload workload management from service operators onto Kubernetes. Make the most of probes by implementing these recommended practices!