A DevOps approach for redesigning an aging product
When a brand new project is started from scratch, you are free to choose between any and all technologies, architecture, infrastructure and so on. But do you consider if these decisions will still be viable even a year, let alone five years down the line, after a product evolves? We often call this phenomenon of outdated systems “legacy systems”. It has become an excuse for any issue; we say that “we cannot do anything, it’s a legacy system”.
To this excuse, we say "no!" You can change a legacy system, especially when you hit a wall with system performance or hosting costs. In this case you have two choices: exit the market or engage in a product redesign.
Recently, I faced a similar situation with a project on which I am currently working. The project is a complex web and mobile solution hosted on the Windows Azure Cloud, and has already been live for three years. The project is successful and growing in all directions; the customer count is constantly increasing and new features are continually being developed according to customer needs. While new features were always priority number one, the natural complexity that comes with growth was slowing down the system and even causing outages.
These outages were the main reason to reprioritize tasks, resulting in the emergence of product redesign tasks on the top of the backlog. The system is very complicated, consisting of multiple MVC websites, a Web API layer, Android and iOS mobile apps, several Microsoft SQL databases and ElasticSearch cluster. The first question was, “where is the main bottleneck?”
Enable system visibility and analysis
When we initiated the redesign, our system was like a black box. We were utterly unable to tell which parts were performing and where the bottlenecks were. We looked for more visibility using a very handy tool called NewRelic. NewRelic enabled us to see what was going on with our system. In NewRelic, each transaction is logged, allowing us to analyze with sufficient detail and identify the component that was killing our system performance.
Also with NewRelic, we defined key transactions that are critical to our business. We’re able to keep an eye on these transactions via SLA Reports based on key transaction response time, error rate and computed measurements, as followed:
- Percent satisfied: the percentage of monitor results that are completed in a "satisfying" time. A satisfying time is defined as a monitor result that is completed in Apdex T or less.
- Percent tolerated: the percentage of monitor results that are completed in a "tolerable" time. A tolerable time is greater than Apdex T, but less than Apdex F (four times Apdex T).
- Percent frustrated: the percentage of monitor results which complete in a "frustrating" time. A frustrating time is greater than Apdex F (four times Apdex T).
Apdex T is a customizable variable (in milliseconds), and it can be set individually on each key transaction.
Analyzing these live measurements, it was easier to understand which direction the key transactions were moving each week. For slow transactions, NewRelic creates traces with breakdown tables in a more detailed view (the picture below); it’s all laid out right in front of your eyes. In our case, the main bottleneck was Microsft SQL Server. The server had to cope with a heavy load from the web applications and API.
We put in a couple of weeks worth of effort optimizing the load, but the breaking improvements were not achieved. We then started our journey scaling up the Azure SQL Database from a P2 to P11 instance size, and as a result, the hosting costs grew exponentially.
Set goals and choose an approach
Despite scaling up our Azure SQL Database instances, the system was still too slow and didn’t meet our expectations. Moreover, hosting expenses exceeded five percent of revenue. It had become clear: it was not worth it to invest in the same infrastructure anymore. We had to change our view and find a unique solution to the problem.
We decided to reorganize the infrastructure with a set of goals, as follows:
- Decrease hosting costs
- Increase scalability and flexibility
- Adopt an infrastructure to fit the current system load
- Provide more freedom for the team to control software and hardware configurations
We saw that Windows Azure Cloud didn’t fit our needs or goals anymore. We began to consider alternative hosting providers Rackspace and Amazon AWS cloud solutions. After cost calculations, we chose the Amazon AWS cloud.
Execute the plan
We categorized our tasks into four main areas where we saw the system should be improved:
The main focus was on the infrastructure area, but others were equally important, because they strengthen reliability and security.
We reviewed the cloud services infrastructure that we had on the Windows Azure cloud, and then measured the required as well as the maximum throughput of the system. With that information, we could decide what type of infrastructure and what components would be used for the redesigned product. The main components we chose are listed below. All components communicate via a 10Gbit network and use provisioned IOPS (SSD) where possible.
- 2 x EC2 Web servers (main), (c4.xlarge, 4 vCPU, 7.5 GB Memory)
- 2 x EC2 API servers (c4.xlarge, 4 vCPU, 7.5 GB Memory)
- 2 x EC2 Web servers (Customer portal, public website, admin), (c4.large, 2 vCPU, 3.75 GB Memory)
Our web servers are load-balanced, so it becomes complicated to keep EC2 instance configurations synchronized. All EC2 instances are connected to the domain via AWS Simple Directory, enabling us to use DFS. Changes made on any EC2 instance are reflected across the board, and all instances are up-to-date.
RDS with Microsoft SQL Server
- 2 x (r3.xlarge, 4 vCPU, 30.5 GB Memory)
- 1 x (r3.2xlarge, 8 vCPU, 61 GB Memory)
We ran performance tests on both alternatives: Microsoft SQL Server on EC2 and Microsoft SQL on RDS. The performance and costs analysis ratio showed that RDS was the better option. Additionally, Amazon RDS provides several database engines to choose from, including the well known Microsoft SQL Server, PostgreSQL, Oracle and MySQL. In particular, PostgreSQL attracted our attention. A PostgreSQL RDS implementation supports mirroring, with the ability to read from the mirror replica. We liked the idea of balancing the load with this implementation and pointing read-only queries (e.g. reports) to the mirror.
- 3 x EC2 ElasticSearch instances (for search), (r3.large, 2 vCPU, 15.25 GB Memory)
- 2 x EC2 ElasticSearch instances (for logs), (r3.large, 2 vCPU, 15.25 GB Memory)
Before the product redesign, ElasticSearch was running on Windows VMs. Even though as a team we are Windows guys, we decided to switch to Linux. Doing so allowed us to lower costs and achieve better performance.
- 2 x EC2 NAT Servers (m3.medium, 1 vCPU, 3.75 GB Memory)
Our system requires a static outbound traffic IP address, as some customers have configured their firewall rules to allow inbound traffic from our API servers. The system is scalable, so the API server count may vary depending on demand. Therefore, the API servers IPs’ are changing. For this purpose, we used Amazon NAT Server, which redirects all outbound traffic. To be more accurate, we use a couple of NAT servers, each with a unique availability zone. If any NAT servers go down, another would take over that traffic. We went with LInux for the NAT Server’s operating system for cost-saving purposes.
- 1 x Redis Cache cluster with 2 nodes (m3.medium, 1 vCPU, 2.78 GB Memory)
SQL Server suffered from a huge load, so caching is vital for our product - it makes life easier for the SQL Server. Before the product redesign, we used an AppFabric cache, but on AWS it was replaced with AWS Redis cache service. Redis is very efficient to store lookup data – each hit to the cache saves a round trip to SQL Server.
- 3 x Public load balancers for Web and API servers
- 2 x Internal load balancers for elastic search (search and logs servers)
Both the database and Elastic search backups are stored on S3. You can manage the lifecycle of these backups by using bucket lifecycle rules.
Amazon AWS Cloud has all the necessary services for security and communication. It's actually very similar to hosting an infrastructure on-premises.
AWS Simple AD simplifies the administration of all EC2 instances in VPC – a single login for a single user. It strengthens control and security, because it’s easier to audit what actions users did on each EC2 instance.Simple AD also supports commonly used features such as user accounts, group memberships, and domain-joining Amazon EC2 instances running Linux and Microsoft Windows.
Private and public subnets
Infrastructure resources were logically allocated into two subnet pairs per availability zone:
- Public subnet – all internet-facing EC2 instances (Web servers, API servers)
- Private subnet – all resources accessible within VPC (RDS, ElasticSearch servers, Redis cache)
A VPN was set up with AWS Cloud for developers in order to get access to private subnet resources if needed. For example, RDS, ElasticSearch and Redis were set up to be accessible only within the VPC, but sometimes developers have to connect to the production environment to elaborate upon support issues.
We’d had plenty of experience with system outages, even before we migrated. As a result, we had already set up system monitoring with Amazon CloudWatch. CloudWatch gives you the ability to set up alerts for various measures such as CPU, memory, disk available space, and many others. It’s better to get an email from Amazon AWS than from angry customer.
We had set up these alerts before migration, but the alert threshold had already been adjusted several times. We could only guess possible resource utilization in the testing stage.
If you have a load-balanced system, you need to find out how to deploy new changes to all VMs. To do so, we integrated the AWS CodeDeploy service. CodeDeploy coordinates application deployments to Amazon EC2 instances or on-premises instances, independently or simultaneously. There are several deployment strategies, for example, you can deploy to all EC2 instances at once, one at a time or half at a time. We deploy to one at a time because it allows us to deploy a solution without downtime.
This is what our workflow looked like:
- The TeamCity build server triggers a new build. That build uploads the new deployment package to a S3 storage.
- The AWS Lambda event-driven computer service is constantly looking for S3 bucket changes. Once a new deployment package is uploaded, CodeDeploy deployment initializes.
- CodeDeploy has three main steps: pre-deployment, deployment and post-deployment. It’s configurable; it’s up to you what you want to do in regard to CodeDeploy.
Two months have passed since we migrated our infrastructure to Amazon AWS. Here’s what we’ve seen:
- System is performing as expected
- Uptime is 100%
- Deployments are smooth without any end user interruption
- Infrastructure cost has decreased
- Customer complaints regarding never-ending loading screens have ceased
It took us three years in our project to get to a point where we needed a redesign. We know that this won’t be the last redesign, but for now, this was one big step forward with the live product. The project is moving forward and the infrastructure should not fall behind.