Kubernetes: Advanced Q&A

I’d like to share with you some questions we did to the Google Kubernetes team and that might be useful for others. Enjoy it

  1. Does each pod only have access to one node resources? (the node in which it lives?
    • Yes
  2. If I have two node pools, one with low memory and one with high memory, is Kubernetes smart to put the pods that use a lot of memory in the pool with high memory? Can it do it on-the-fly?
    • You can define requests and limits to ensure that pods are scheduled on nodes with sufficient resources. You can also utilize node affinity/anti-affinity to influence where pods go, or taint/toleration to block pools from accepting pods unless explicitly specified. More details on how Kubernetes handles resourcescan be found here.
  3. How do preemptive nodes in the cluster work? If I have two nodes in the cluster and each node has 4 pods, when one node is preempted, will Kubernetes move the 4 pods to another node? And when the other node is available again will the Kubernetes rebalance the location of the pods?
    • Preemptive nodes are created as a node pool. If your node pool is preempted and becomes unavailable, GKE will attempt to reschedule your pods in other available nodes (unless you set it up in your deployment spec to only deploy to preemptive nodes). When the node pool becomes available, GKE will schedule appropriate pods there (it will not rebalance automatically).
  4. How does in-memory disk cache work in the cluster? I’m referring to this cache. Each pod will have its cache or the cache is shared between all the pods in the node?
    • Pods have their own storage/cache isolated from each other when created, emptydir. By default, emptyDir volumes are stored on whatever medium is backing the node – that might be disk or SSD or network storage, depending on your environment. However, you can set the emptyDir.medium field to “Memory” to tell Kubernetes to mount a tmpfs (RAM-backed filesystem) for you instead. While tmpfs is very fast, be aware that
      unlike disks, tmpfs is cleared on node reboot and any files you write will count against your Container’s memory limit.
  5. We have an application that uses the name of the machine on which it is running to decide what it will do. The machine name must form a sequence from 0 to the total number of machines minus 1. Example: maq-0, maq-1, maq-2. Each application also needs to know the total number of machines. How to do this in kubernetes? I was able to do this using a statefulSet, but the total number of pods got hardcoded in the application. Any suggestion?
    • Assuming that we are looking at machines as pods, then StatefulSets are required. StatefulSets provide a stable network ID which allows for the naming convention specified. The current number of replicas can be acquired using the StatefulSetStatus v1 apps Kubernetes API.
  6. Is it possible to do a statefulSet without a headlessService? I am asking this because the application I want to run in the statefulSet does not need to be accessed by anyone (neither inside nor outside the cluster).
    • No, StatefulSets require a headlessService. You can use network policies to deny Ingress

How I reduced 48% of my cloud cost by using Google Cloud preemptible instances

Google Cloud Platform has an amazing feature that few people use, partially because it is unknown, but mainly because it is very difficult to set up a system architecture that allows you to use. This feature is preemptible instances. How does it work? Simple: you have a virtual machine like any other, except that this VM will shutdown unexpectedly within 24 hours and be eventually unavailable for short periods. The advantage: this preemptive instances cost less than 50% compared to the ordinary machine.

Usually, people use this kind of machine for servers that run workers or asynchronous jobs, a kind of application that does not need 24/7 availability. In my case, I could use the preemptible instances for my internal API, an application that do need 24/7 availability. This internal API can’t stay offline, so the way I solved the unavailability problem was by running many servers in parallel  behind a haproxy load balancer. So, in basically 3 steps I could reduce my API infrastructure cost by 50%.

Step 1 – Setup the client to be fault tolerant

My code is in Scala language. Basically, I made the client to repeat a request when it eventually failed. This is necessary because, even if the API machines are behind the load balancer, the load balancer takes some time (seconds) to realize that a specific machine is down, so eventually it sends some requests to unavailable machines. The client code snippet is:

def query(params, retries = 0) {
  val response = api.query(params)
  response.onSuccess {
  response.onFailure {
    case x => {
      LOG.error(s"Failure on $retries try of API request: " + x.getMessage)
      Thread.sleep(retries * 3000) //this sleep is optional
      query(params, retries + 1) //the could be a maximum number of retries here

Step 2 – put all servers behind a load balancer

I created a haproxy config file that I can auto-update based on a list of servers that I get from the gcloud command line. Here is the script that re-writes the haproxy config file with a list of all servers that has a specific substring in their names:

EMPTY_FILE=`cat /etc/haproxy/haproxy.cfg |grep -v $SERVER_SUBSTRING`
NEW_LINES=`gcloud compute instances list |grep $SERVER_SUBSTRING | sed 's/true//g' |sed 's/ [ ]*/ /g'|cut -d" " -f4|awk '{print " server playax-fingerprint" $NF " " $NF ":9000 check inter 5s rise 1 fall 1 weight 1"}'`
echo "$EMPTY_FILE" >new_config
echo "$NEW_LINES" >>new_config
sudo cp new_config /etc/haproxy/haproxy.cfg
sudo ./restart.sh

The restart script reloads the haproxy configuration without any outage.

Step 3 – create an instance group for these servers

By creating an instance template and an instance group, I can easily add or remove servers to the infrastructure. The preemptible configuration is inside the instance template page in google cloud panel.

  1. Create an instance template with preemptible option checked
  2. Create an instance group that uses that template

Screen Shot 2016-05-04 at 10.40.58 PM


Screen Shot 2016-05-04 at 10.41.18 PM

One very important warning is that you need to plan your capacity to allow 20% of your servers to be down (remember that preemptible instances eventually are out). In my case, I had 20 servers before using the preemptible option. With the preemptible on, I changed the group to 25 servers.

Before After
Servers 20 24
Cost per server $0.07 $0.03
Total cost per hour $1.4 $0.72
Total cost per month $1,008 $518

Price reduction:  $490 or 48.6%

Graphs of server usage along 1 day (observe how many outages there are, but application ran perfectly ):

Screen Shot 2016-05-04 at 11.12.36 PM

Comparing cloud services for Startups

nuvemEvery Startup that has services online needs a cloud provider. Startups do not have time to build their own physical server infrastructure. They need to focus on their product or service development. But what cloud to use? There are so many different options, and CTOs do not have time to test each one of them. Maybe this post will help new Startups  to choose between all cloud providers available.

The experience that I had with Playax was not typical, for two reasons: the first was that I have a lot of experience working with cloud. After working at Locaweb for 5 years, and developing software for internal cloud team, I spent one year in my PhD studying cloud services. The second reason is that Playax product is highly dependent from cloud. We are a BigData company. We needed a big infrastructure from day one. Our MVP needed a lot of cloud resources to be useful to our customers. Most of Startups do not need that much infrastructure, at least not before it starts growing fast.

Continue reading