How I reduced 48% of my cloud cost by using Google Cloud preemptible instances

Google Cloud Platform has an amazing feature that few people use, partially because it is unknown, but mainly because it is very difficult to set up a system architecture that allows you to use. This feature is preemptible instances. How does it work? Simple: you have a virtual machine like any other, except that this VM will shutdown unexpectedly within 24 hours and be eventually unavailable for short periods. The advantage: this preemptive instances cost less than 50% compared to the ordinary machine.

Usually, people use this kind of machine for servers that run workers or asynchronous jobs, a kind of application that does not need 24/7 availability. In my case, I could use the preemptible instances for my internal API, an application that do need 24/7 availability. This internal API can’t stay offline, so the way I solved the unavailability problem was by running many servers in parallel  behind a haproxy load balancer. So, in basically 3 steps I could reduce my API infrastructure cost by 50%.

Step 1 – Setup the client to be fault tolerant

My code is in Scala language. Basically, I made the client to repeat a request when it eventually failed. This is necessary because, even if the API machines are behind the load balancer, the load balancer takes some time (seconds) to realize that a specific machine is down, so eventually it sends some requests to unavailable machines. The client code snippet is:

def query(params, retries = 0) {
  val response = api.query(params)
  response.onSuccess {
    codeForSuccess()
  }
  response.onFailure {
    case x => {
      LOG.error(s"Failure on $retries try of API request: " + x.getMessage)
      Thread.sleep(retries * 3000) //this sleep is optional
      query(params, retries + 1) //the could be a maximum number of retries here
    }
  }
}

Step 2 – put all servers behind a load balancer

I created a haproxy config file that I can auto-update based on a list of servers that I get from the gcloud command line. Here is the script that re-writes the haproxy config file with a list of all servers that has a specific substring in their names:

#!/bin/bash
SERVER_SUBSTRING=playax-fingerprint
EMPTY_FILE=`cat /etc/haproxy/haproxy.cfg |grep -v $SERVER_SUBSTRING`
NEW_LINES=`gcloud compute instances list |grep $SERVER_SUBSTRING | sed 's/true//g' |sed 's/ [ ]*/ /g'|cut -d" " -f4|awk '{print " server playax-fingerprint" $NF " " $NF ":9000 check inter 5s rise 1 fall 1 weight 1"}'`
echo "$EMPTY_FILE" >new_config
echo "$NEW_LINES" >>new_config
sudo cp new_config /etc/haproxy/haproxy.cfg
sudo ./restart.sh

The restart script reloads the haproxy configuration without any outage.

Step 3 – create an instance group for these servers

By creating an instance template and an instance group, I can easily add or remove servers to the infrastructure. The preemptible configuration is inside the instance template page in google cloud panel.

  1. Create an instance template with preemptible option checked
  2. Create an instance group that uses that template

Screen Shot 2016-05-04 at 10.40.58 PM

 

Screen Shot 2016-05-04 at 10.41.18 PM

One very important warning is that you need to plan your capacity to allow 20% of your servers to be down (remember that preemptible instances eventually are out). In my case, I had 20 servers before using the preemptible option. With the preemptible on, I changed the group to 25 servers.

Before After
Servers 20 24
Cost per server $0.07 $0.03
Total cost per hour $1.4 $0.72
Total cost per month $1,008 $518

Price reduction:  $490 or 48.6%

Graphs of server usage along 1 day (observe how many outages there are, but application ran perfectly ):

Screen Shot 2016-05-04 at 11.12.36 PM

Comparing cloud services for Startups

nuvemEvery Startup that has services online needs a cloud provider. Startups do not have time to build their own physical server infrastructure. They need to focus on their product or service development. But what cloud to use? There are so many different options, and CTOs do not have time to test each one of them. Maybe this post will help new Startups  to choose between all cloud providers available.

The experience that I had with Playax was not typical, for two reasons: the first was that I have a lot of experience working with cloud. After working at Locaweb for 5 years, and developing software for internal cloud team, I spent one year in my PhD studying cloud services. The second reason is that Playax product is highly dependent from cloud. We are a BigData company. We needed a big infrastructure from day one. Our MVP needed a lot of cloud resources to be useful to our customers. Most of Startups do not need that much infrastructure, at least not before it starts growing fast.

Continue reading

DevOps patterns to scale web applications using cloud services

This article was accepted to publication at SPLASH 2013Wavefront Experience track.

Scaling a web applications can be easy for simple CRUD software running when you use Platform as a Service Clouds (PaaS). But if you need to deploy a complex software, with many components and a lot users, you will need have a mix of cloud services in PaaS, SaaS and IaaS layers. You will also need knowledge in architecture patterns to make all these software components communicate accordingly.

In this article, we share our experience of using cloud services to scale a web application. We show usage examples of load balancing, session sharing, e-mail delivery, asynchronous processing, logs processing, monitoring, continuous deployment, realtime user monitoring (RUM). These are a mixture of development and system operations (DevOps) that improved our application availability, scalability and performance.
Continue reading

Small EC2 cloud usage demo for choreographies

The CHOReOS middleware must be capable of providing the required runtime support to deploy, enact, monitor, and dynamically reconfigure large-scale choreographies. These choreographies might be large scale in one or more of the following dimensions: number of requests, users, roles, services, nodes, and communication among services. For instance, the middleware should be scalable enough to accommodate choreography with 1 thousand simultaneous users or with 100 different roles, or with 100 services for a given role, or with thousands of messages exchanged per second.

Continue reading

DevOpsDays – Agilidade em todos os níveis

Sexta passada estive no evento DevOpsDays, que aconteceu em Santa Clara (Califórnia), no escritório central do LinkedIn. O termo DevOps (criado por Patrick Debois) surgiu no final do ano passado, mais ou menos na época em que Andrew Schafer e Paul Nasrat deram uma palestra na Agile 2009 sobre Infraestrutura Ágil.

Onde Surgiu DevOps?

DevOps tem vários signifcados. O mais óbvio deles, como o próprio termo já diz, significa a união de Desenvolvedores (devs) e Operadores (ops) de Sistemas (também conhecidos como SysAdmins).

Em startups, é muito comum que não exista separação entre Devs e Ops. Nessas empresas, os técnicos sabem tanto escrever o software como dar manutenção e administrar os servidores de produção. Conforme as empresas crescem, começa a surgir a necessidade de especialização nas áreas de desenvolvimento e sysadmins (DBAs, Storages, Rede, Linux, Windows, etc). Problemas começam a surgir quando criam-se silos e a empresa fica dividida entre aqueles que criam o software e aqueles que mantém tudo funcionando em produção. Essa divisão pode ser muito nociva para a empresa, uma vez que os profissionais, ao invés de colaborarem para o sucesso da empresa, ficam num jogo de apontar o dedo um para o outro, na busca de um culpado que, convenhamos, pouco importa para o negócio.

DevOps tem o objetivo de trazer os conceitos e boas práticas aprendidas pelos Engenheiros de Software Ágeis para o mundo dos SysAdmins. Não só isso, DevOps também procura clarear para os desenvolvedores as preocupações (justas) e práticas dos SysAdmins. O principal trabalho do SysAdmin é manter tudo no ar. Qualquer coisa a mais que o desenvolvedor quiser, coloca em risco o trabalho o SysAdmin. O desenvolvedor tem que entender isso e trabalhar como parceiro do SysAdmin. Ele tem que se preocupar para que nada quebre em produção e estar disponível para ajudar o administrador caso algo dê errado. Faz parte do trabalho de devs e ops estarem alinhados e colaborarem um com o outro.
Continue reading

Cloud Computing x Grid Computing

We all already know that Cloud Computing is a buzzword since its unknown origin. Moreover, we also know that many concepts of cloud computing are not as knew as some companies and their marketing department pretend they are. Specially when compared to Grid Computing, which is a more than 13 year old paradigm, there are many similarities (and also some differences). I’ll try to expose some of these points here.

Question no. 1: Is “Cloud Computing” just a new name for Grid?

YES, in the sense that both aim to reduce cost of computing, increase reliability and flexibility

But NO: Grid is more than 10 years ago, when we didn’t have the computer power available today. They context and scale it was made to operate (expensive hundreds of machines clusters) is different of today’s available infrastructure (hundreds of thousands of “low cost” computers and virtual machines created within them). Grid and Cloud operate in different scales.

Nevertheless, YES: Cloud and Grid problems are mostly the same. The details are different, but both deal with the same issues.

Question no. 2: What is Cloud Computing?

Yes, I know, there are a lot of different definitions of what Cloud Computing is and not much consensus between those definitions [2]. Then, I choose the best and most complete definition I found til now. It is the US National Institute of Standards and Technology (NIST) definition. Their definition is short and complete, when they say that Cloud Computing is

 “a model for enabling convenient, on-demand network access to a shared pool of configurable resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” [1]

In other words, Cloud is meant to be:

  1. Massively scalable
  2. Encapsulated as an abstract entity that delivers different levels of services
  3. Driven by economies of scale [3]
  4. Dynamically configured

Question no. 3: What is Grid Computing?

Grid Computing, through well defined standard protocols, aims to

“enable  resource sharing and coordinated problem solving in dynamic, multi-institutional virtual organizations” [4][5]

this means that Grid:

  1. Is distributed computing
  2. Operates across multiple federated organizations
  3. Coordinates resources that are not subject to centralized control
  4. Uses standards, open, and general-purpose protocols
  5. Delivers non-trivial QoS
While points 1, 2 and 3 holds true also for Cloud Computing, points 4 and 5 are still a challenge in the Cloud area.

Question no. 4: How to compare Cloud and Grid side-by-side? [6]

The following Figure is an insightful overview when we try to compare Grid and Cloud:
 

References

[1] MELL, P. and GRANCE, T. 2009. Draft NIST Working Definition of Cloud Computing.
[2] “Twenty Experts Define Cloud Computing”, SYS-CON Media Inc, 2008.
[3] J. Silvestre. “Economies and Diseconomies of Scale,” The New Palgrave: A Dictionary of Economics, v. 2, pp. 80–84, 1987.
[4] I. Foster, C. Kesselman, S. Tuecke. The anatomy of the Grid: Enabling scalable virtual organization. The Intl. Jrnl. of High Performance Computing Applications, 15(3):200–222, 2001.
[5] I. Foster. What is the Grid? A Three Point Checklist, July 2002.
[6] FOSTER, I., ZHAO, Y., RAICU, I. and LU, S. 2008. Cloud Computing and Grid Computing 360-Degree Compared. In Grid Computing Environments Workshop (GCE ’08), Austin, Texas, USA, November 2008, 1-10.