I have a serious concern with the EC2 Container Service. The way it is built facilitates use of infrastructure and not the use by applications. Just my opinion based on what I know so far.
I am still amazed that many developers when they create a feature or fix a bug that they don’t write a unit test (or functional test) to ensure that things work as expected. Years ago I had a developer on my team that would not write tests. The impact was that they would report a bug as fixed, pass the application over to QA for testing, and then QA would fail the test and the developer would have to continue to work on fixing that bug. No matter how hard I tried, this developer just would not test their code or write a test to ensure that things were working as expected. Unfortunately that behavior led to us letting them go.
Just today I was reading an email from one of the insider communities I belong. There was a bug reported with a new release of software that seemed to me to be an obvious bug, one that was most likely seen before. It was a bug that certainly should have been covered by a test when the feature was created. It made me think “How could this bug have gotten to a release version of software?”. If only they had a test to cover the feature or bug.
I definitely recommend that you write tests as you are writing code and hopefully your software is relatively bug free. However, when bugs happen, and they do, writing a test to ensure that the bug is fixed is a great way to increase the numbers tests for your application over time.
I recently had a conversation with another architect about docker. We were discussing where I thought docker was applicable. Our discussion highlighted that he was concerned that docker was labeled “The Next Big Thing” and that everyone wants to use it because of the hype. He referred to docker as a “panacea” commenting on the fact that everyone expects docker to just solve all their problems. I couldn’t agree with him more. Docker has its uses and it is a great technology. It is not going to solve all your problems. However, that does not prevent one from using docker to solve some problems that continue to vex us today.
One area where I think docker is especially useful and still untapped is in solving resource utilization issues. I think this is so game changing that we will see a similar shift to what we saw when virtualization and cloud computing became mainstream. Let’s start by discussing applications and how they use resources.
Applications Use Resources
Developers write applications. Those applications use resources. What resources do they use? That depends on the application. Some applications like mathematical models and machine learning algorithms are processor intensive and therefore use a lot of CPU. Other applications like databases, file servers, and web servers are disk intensive and therefore use a lot of disk I/O. Yet other applications such as graphics-based applications use lots of memory. Of course all applications use some combination of resources (i.e. CPU, memory, disk, and networking). However, they don’t use all these resources all the time.
Then you need to take into account the various parts or tiers of an application. Most applications that are built today are distributed applications. That means they are partitioned in some way either by tiers (such as front, middle, and back-end) or functionality (such as user interface, authentication, management, etc.) or both. The affect of this is usually more and more servers dedicated to a particular function or tier of your application. The result though is less resource utilization per server. Wait, that is not all. You need to make each of these functions highly available. So now you are adding additional virtual machines for high availability, further decreasing the resource utilization of your servers. Don’t forget disaster recovery! You may need to allocate more servers there too. Welcome to server sprawl.
Stacking Our Application
So what is the solution to this server sprawl? That is easy! Start comingling functionality together. I like to call this “stacking” your application. What I mean by “stacking” is to deploy multiple parts of an application to virtual machines to better utilize the resources available on that virtual machine. Often this is done by comingling all the parts of an application across all virtual machines. That way you don’t have a single virtual machine dedicated to one part of your application. Unfortunately comingling your application in this way is not always the best approach and can lead to disaster. Let’s discuss a real world example.
Mobile Application with Cloud Services
Last year we were developing a mobile application that had three sets of cloud-based services: authentication, management, and application. Authentication was an OAuth token service that was responsible for handing out tokens to access our management and application services. Management was for user management and provided functions like change password, create user, etc. Our application services were a set of core APIs that provided the functionality of the application. Each of these services had a different resource profile. For example, authentication was very CPU intensive because of the encryption and hashing algorithms, management was a mix of both CPU and I/O (both network and disk), and application was almost purely network and disk based operations. When you looked at the resource profile of our application in CloudWatch you could easily see that the various parts of our application were wildly different. Unfortunately what that meant was that there were resources on each of the machines that were not being utilized. You know what they say, “an idle resource is the devil’s play thing”.
You might want to start comingling all the parts of the application together to have less servers and therefore drive down costs. Unfortunately we could not do this. We learned early on in development that comingling all the services together was not the right thing to do. The “not so funny” story was that our initial QA environment had all services comingled on one server. We did this to save money. Unfortunately we ran into a scenario when our mobile client had poor network connectivity it would detect this as an authentication failure. Those few mobile clients would then request a new access token from the authentication service. As you know, the authentication service is CPU intensive. The single server that was handling our QA environment was overwhelmed with authentication requests. The negative affect of this was that all mobile clients were impacted. Even those mobile clients with great (5 bars) network connectivity. Why? Those clients had good authentication tokens and were trying to access the management and application services on the same machine. Unfortunately that machine was swamped with authentication requests. Yes, we denial-of-service attacked ourselves. Not intentionally of course and thankfully in QA. Fortunately for us this was very early on in development and we knew what was happening immediately. Thank you Dynatrace.
So what did we do? We deployed our services to physically separate servers. Our services were already built as functionally partitioned services. So we were able to solve this problem in a matter of minutes. Thank you Octopus Deploy! Unfortunately deploying to completely separate servers is the complete opposite side of the spectrum. This is comingling of nothing.
Somewhere In Between
When we finally deployed our application to production, we did so deploying all three services onto their own set of hardware. We completely functionally partitioned the application. At the time this was really the best choice given the circumstances. Docker at the time was not natively supported by either Amazon Web Services or Azure. Our services were built on .NET and Windows. Even now Windows does not support containerization using docker. Its coming, but it still is not here yet. We could have gone with Ubuntu running Mono/.NET. This was definitely possible, but we had enough things to worry about and could not add this to the mix. All told we had about 9-12 servers to support our needs. Looking back I find this very frustrating. If technology had just caught up or if we had more time we could have spent the time and gotten our server count to about 5 servers. That is a 44-58% reduction in our server count and therefore a similar reduction in our costs. If only we had docker to help us with this.
Helping fill the gaps in our resource utilization footprint is where I think docker will be most useful long term. Certainly docker does some amazing things like providing a very portable container technology for your application. This is helping teams take their application and deploy to development, QA, and production with no changes. When I say no changes, I mean no changes. What you deployed in development is exactly what you deployed in QA or production. This is one of the great things about docker. It helps you manage your dependencies, both application and environmental dependencies. This is done either by extracting away specific dependencies so you can deploy to different environments more easily or by ensuring that your application has access to those things it needs to run.
Docker as a portable container technology which helps you manage dependencies is a great start. This gets two thumbs up from me and I give it my “No Guessing” seal of approval. “No Guessing” is the term I use when I catch either myself or one of my colleagues trying to solve a problem without having enough information to understand the problem. In this case, docker takes much of the guess work out of deployments and allows us to focus on being more productive. So as a productivity tool I highly recommend docker. Just know that you need to invest time in learning docker first. Once you do this you will start seeing increased productivity with your teams.
So we still haven’t answered the issue of how docker will handle your resource utilization issues. That of course is what prompted this article. Let’s start off by talking about two sets of technologies, one specific to Linux and one specific to docker. Those technologies are cgroups in Linux and Docker Swarm.
cgroups on Linux
cgroups (aka control groups) is a Linux kernel level feature that provides resource limitation, prioritization, accounting, and control. This allows us to manage what resources a process sees and how that process can use that resource. This becomes useful if you have various types of processes whose resource needs can be carved up and allocated in discreet chunks.
Docker is all about managing containers and the processes contained within them and because of this docker has first class support for cgroups. You can use cgroups to set limits on a docker container such as CPU (i.e. cpu shares), memory, and number of cores. Network and Disk I/O is also possible but may require some additional steps. Setting these limits allows you to specify the maximum resources that a docker container can use.
To an infrastructure guy or your everyday developer, this sounds fantastic. You can build your applications as a set of docker containers. Then you can specify the maximum resources each container can use and start deploying them wherever it makes sense. I am hear to tell you that it is not that simple.
The first thing to understand is that applications and how they use resources changes over time. New versions of an application are going to have changes to the code to support new features, bug fixes, refactoring, and much more. Each version needs to be treated almost independently. At a minimum, you need to revisit your limits as you deploy new versions of your application. For example, what if your application went from being single threaded to being a multi-threaded? Limiting the number of cores or CPU shares may have an adverse affect on your application such as making your application slower.
The second thing to understand is that applications may consume more resources at different times. I only mention this because I think this type of time-based view needs to be removed from your thinking. We are in cloud and we should be able to scale on-demand. Unfortunately traditional applications (web, enterprise, other) have always had a temporal component to them that causes us to provision for peak usage. A great example of this is the 1-800-Flowers and Dominos websites. Each of these has peak usages during certain times of year. For 1-800-Flowers, they get peak demand on Mother’s Day, Valentine’s Day, etc. Dominos gets peak demand on SuperBowl Sunday.
The third thing is that the act of limiting resources on a container has an overhead on performance (and scalability). As an example, limiting the CPU shares of a process should cause whatever operation is being performed to take longer. The assumption is that the operation is a CPU intensive operation. Then if you consider the overhead of limiting the CPU shares I question whether using a technology like cgroups to limit resources is a good thing. Why? The cost of performing the operation just increased because of the increase in duration and overhead of the resource manager.
I have an extensive background in writing high performance and parallel computing code. Because of this I am keenly aware of various types of OS schedulers, HPC schedulers, task-based parallel frameworks, and more. There are many things I have learned over the years. One of the more interesting lessons is that resource managers add overhead and can and often hinder applications. In many cases, the overhead of the resource manager and the impact on the applications can be worse than not using the resource manager at all.
Of course infrastructure people are probably cursing me right now. Don’t use resource managers? WAT? That is not what I said. There are perfectly justifiable scenarios where resource managers make sense. One of the biggest is to limit exposure to bad applications. Let’s face it. Not everyone writes great code all the time. Another area for using resource managers is ensuring quality-of-service and consistency of service. I am sure there are other scenarios as well.
All this comes down to is one question. What is a better way? We will continue this discussion in the next blog post. In the meantime, check out this article on gathering container metrics.