The Kubernetes Series

(Photo by Matthew Smith on Unsplash)

Scheduling in Kubernetes is the process of how it assigns Pods to nodes. We briefly covered this topic in the kube-scheduler section in the previous post on the Master node, so let's have a brief re-cap.

Kube-Scheduler in brief

Kube-scheduler, if you recall, is a service that runs on the master node that monitors the cluster for unscheduled pods. It does this by checking for a property called nodeName on a created Pods definition. If the nodeName property is not set, it knows it should find a node to run it on, and it then proceeds with the following steps;

It starts by filtering out nodes that do not meet the Pod/container resource requirements
it ranks the filtered nodes out of 10, according how much resources those nodes have left available
and then it filters out the node options even further, by taking any other specific rules you might have defined into account.

But before we go into defining the parameters for controlling kube-scheduler in more detail, let's look at how to schedule a Pod to a node manually.

Manual Scheduling

When you have no scheduling service running on your cluster(if you didn't install the kube-scheduler binary on master or didn't create your cluster with kubeadm), any Pods created would stay in a pending state, when you run get pods.

You can manually assign a Pod to a node on creation, by manually setting the node name in nodeName property in your Pod YAML file.

apiVersion: v1
kind: Pod
metadata:
  name: web-app-pod
  labels:
    name:web-app 
spec:
  containers:
  - name: web-app-container
    image: web-app-image
    ports:
    - containerPort: 80
  nodeName: node01

When a Pod is already created and not assigned to a node, it gets a little bit more complicated, which involves manually sending a Binding object to the Pods binding api. Let's not get into that right now and merely suggest that if you want to manually schedule a Pod that is already running(and thus pending), just delete it and create it again with the nodeName property set.

Node Resource Requirements

The kube-scheduler looks at 3 resource elements in order to decide if a node is capable of accepting a Pod. These are;

CPU –> The current minimum default for a container is 0.5 CPU units
Memory –> The current minimum default for a container is 256mb of RAM.
Disk space –> The minimum storage is determined by your cloud provider

If you know your application has higher minimum resource requirements, you should set the minimum requirements in your pod YAML definition, under the spec/containers dictionary, like so;

apiVersion: v1
kind: Pod
metadata:
  name: web-app-pod
  labels:
    name:web-app 
spec:
  containers:
  - name: web-app-container
    image: web-app-image
    ports:
    - containerPort: 80
    resources:
    	requests:
          memory: "2Gi"
          cpu: 1
  nodeName: node01

Note that resource changes do not get propagated to existing Pods. If you want to change with resource requirements and limits for Pods, you'll need to delete existing ones and spin up new ones with your updated definition files.

CPU Resource Units

The CPU units works as follows - 1 CPU is equal to roughly 1 core or hyperthread. On AWS it's 1 vCPU.

RAM Resource Units

1 G = Gigabyte = 1,000,000,000 bytes
1 Gi = Gigibyte = 1,073,741,824 bytes
1 M = Megabyte = 1,000,000 bytes
1 Mi = Mebibyte = 1,048,576 bytes
1K = Kilobyte = 1,000 bytes
1Ki = Kibibyte = 1,024 bytes

Resource Limits

By default Kubernetes limits container to 1 CPU and 512Mi of RAM per node. Custom limits can also be set on your pod definition file, under the resources dictionary.

Limits are enforced differently for RAM than for CPU usage. Pods that try to use more CPU units than their limits simply get throttled, but Pods that try to use more RAM than the limits allow for get terminated.

Taints and Tolerations

Nodes and Pods have properties that can help them control what kind of Pods are able to run on which nodes. These properties are called Taints and Tolerations.

Taints are properties set on the node which would only allow Pods to run on it when those Pods has an equivalent Toleration property. No Pods without the appropriate Toleration can be launched on a node with a set Taint. Indecently, this is why no Pods you create yourself gets scheduled on the Master node - the Master node has a taint preventing this added automatically on creation. You can of course change this behaviour or add the appropriate master tolerances to your own Pods, but this is a no-no and considered bad practice.

Taints on Nodes

Taints can be added to node with command line, like so;

kubectl taint nodes node01 app=web-app:NoSchedule

The NoSchedule above is a taint effect, or taint execution strategy. There are three taint-effects;

NoSchedule –> no pod will be scheduled on this node unless it has a matching toleration
PreferNoSchedule –> Same as NoSchedule, but not a strict rule, so a Pod might be launched on this node even if taints and tolerances don't match.
NoExecute –> Any non-matching pods already running on the node will be evicted and not be run on the node again in the future.

Tolerations on Pods

You can define a Pods tolerations in its YAML file like so;

apiVersion: v1 
kind: Pod 
metadata:   
  name: web-app-pod spec:   
  containers:   
  - name: web-app-container     
    image: web-app-image   
  tolerations:   
  - key: "app"     
    operator: "Equal"     
    value: "web-app"     
    effect: "NoSchedule"

That's cool, so we have some control of which Pods should not run on what nodes.

Pods with Tolerations set can run on both nodes with a matching Taint and on nodes with no Taints. If you want a Pod to only be allowed on a particular node, you need to set a Node Affinity on that Pod.

But before we look at Node Affinity, we need to look at Node Labels.

Node Labels

Just like we can add labels on Pods, ReplicaSets and Deployments, we can add labels on particular node. Let's set it with CLI

kubectl label nodes node01 size=monster-beast

We just gave node01 the label monster-beast.

Now we can create a Pod that will only run on node01 with the following definition file, glutton-pod.yml, and with added property nodeSelector;

apiVersion: v1
kind: Pod
metadata:
  name: gluttonous-app-pod
spec:
  containers:
  - name: gluttonous-container
    image: gluttonous-image
  nodeSelector:
    size: monster-beast

The gluttonous-app-pod will now run on node01 exclusively.

But what if we want even more control, like having operator expressions - equivalent to AND, OR, EXCEPT, for instance? Now we're getting to Node Affinity.

Node Affinity

With Node Affinity we can be extremely specific by saying we want a Pod to be on all nodes labeled something, or any node except those labeled something, or all nodes labeled something and something else.

Node Affinity can be set in your Pod YAML file and there are currently 2 types(and 1 future one).

requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingRequiredDuringExecution(available in the future)

requiredDuringSchedulingIgnoredDuringExecution

RequiredDuringSchedulingIgnoredDuringExecution specifies that a Pod can only be scheduled to a specific node meeting specified conditions, and if that if no node can be found matching those conditions, the Pod shouldn't be deployed at all. It's thus a hard enforcement of the affinity rules.

preferredDuringSchedulingIgnoredDuringExecution

PreferredDuringSchedulingIgnoredDuringExecution specifies that a Pod should preferably be scheduled to a specific node meeting specified conditions, but if no node can be found matching those conditions, the Pod can be deployed on the next-best node available. In other words, it softer enforcement of the Node Affinity rules.

requiredDuringSchedulingRequiredDuringExecution(upcoming)

RequiredDuringSchedulingRequiredDuringExecution will be supported in a future version of Kubernetes, and will be the same as requiredDuringSchedulingIgnoredDuringExecution, except that if the scheduler finds a Pod running in a node that no longer meets the conditions of the affinity rules, that Pod will then be ejected from that node.

Let's look at the previous Pod YAML file, re-written with Node Affinity;

apiVersion: v1
kind: Pod
metadata:
  name: gluttonous-app-pod
spec:
  containers:
  - name: gluttonous-container
    image: gluttonous-image
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: size
            operator: In
            values:
            - monster-size
            - large-size
            - medium-size

Now our Pod will be able to launch in containers labelled monster-size, large-size or medium-size. We can also change the operator to be NotIn or more. See the Node Affinity Design Docs for more.

Node Affinity VS Taints/Tolerations

So, to recap to differences between these two scheduling controls;

Taints/Tolerations limit which Pods are allowed on which nodes
Node Affinity limit on which specific nodes Pods want to be launched on.

So if you want to limit a Pod to be only allowed on a certain node, and not allow any other Pods to be on that node either, you'll use a combination of Taints/Toleration and Node Affinity rules to enforce that restriction.

Conclusion

We've now had a look at how to exert more control over how our Pods get scheduled.

We looked at how to manually provision pods to nodes, how to control the amount of resources required by them, and how to control which pods are allowed to run on which nodes with Taints and Tolerances. We also look at controlling on which a specific pod gets launched on with Node Affinity.

Next up we'll explore DaemonSets.