Create a Flexible Architecture
The application architecture can have long term and critical impacts on the performance and growth. It must be flexible to be able to deploy components on dedicated servers when needed. But flexibility has a development cost, a performance cost and on the other hand it is not always necessary. Carefully identifying the components is the key. For each component it is necessary to identify and know how they are used, what is their impact on performance on the overall application. If the architecture is not designed or studied correctly, it can be impossible to reorganize the deployment when issues arise.
Planzone is using a traditional multi-tier J2EE architecture. I have organized the architecture in 5 web applications (WAR) that can be deployed on the same server or on different servers. The web applications have different roles: the core application, API access, batch processing, ... These web applications are deployed on every server and we can activate them easily when necessary.
Deploying a new application or service should be made early even when only few users will use the application. By going in production sooner rather than later you get the opportunity to see problems when you have less traffic. You can learn and watch how your users are using the service. Last but not least, you are in a real situation and you are forced to identify and solve real problems immediately.
For our service, we launched the beta version of Planzone in December 2007 and let it used to our initial beta users (300 users). At this stage, we had no performance issue but we could collect good feedback on the product, identify missing features and get ideas to improve our service.
Monitor the application from the beginning
Monitoring is the key when the user's growth rate is unknown (and even after!). This must be put in place at the same time the service is deployed. A careful monitoring solution will help to identify early whether the application has performance issues or whether the infrastructure has to be changed because the user's growth requires it.
We put in place a simple monitoring solution based on Cacti and Nagios. But this was not enough because these tools only provide a coarse monitoring view of the application. I put in place a request monitoring within the application to identify the bottlenecks early (I'll describe it in another post).
Optimize when the monitoring says so
The Pareto principle states that 80% of events are caused by 20% of the causes. For software optimization, this 80-20 rule means that by optimizing 20% of the code, we solve 80% of performance problems. The monitoring solution must be used to identify the 20% of pages, or the 20% of database requests, etc which are the most used and are potentially causing a bottleneck. Because the system is in production, the monitoring data is real and not simulated. Therefore you know what to do.
As far as Planzone is concerned, I decided to optimize only one or two pages (over more than 200) and two or three database queries (over more than 180). The choice of which page had to be optimized and when, was defined by the monitoring result. With the team we kept an eye on the monitoring data and we decided to fix performance issues when they seem to appear (one or two times every 6 months).
Update as soon as possible
Optimization allows to solve problems detected by the monitoring. As soon as a solution is found and is functionally validated, updating the production is necessary. Do not wait! Waiting at this stage can aggravate the situation because more users can use the platform and the database will grow (anyway).
With Planzone, we decided to update the service on a regular basis, basically every two months in 2008 and 2009 and each month since the beginning of 2010 (without service interruption!). This helped us a lot in keeping a good quality for the service both on the performance side and on the functional side. Each update contains new (small) features, bug fixes and the performance improvements that are necessary (and no more).
Plan for load spikes
A careful monitoring of the application allows to know the infrastructure usage in terms of CPU, memory and disk loads. Most of the time you will see that the infrastructure is not used at the maximum of its capacity. Users don't use the service at the same time but since you don't control them you may observe intensive use during some periods. If the infrastructure is used at its maximum during normal usage, you have no bandwidth for these intensive usage.
For Planzone we have seen that we often get a load spike every Tuesday and at different hours during the week. Indeed, the load spikes correspond to users who need the service during their business hours. Even during these spikes, the service provides a very good reactivity for users. The load is below 20% in these cases and this gives us room for growth.
From a technical point of view, the architecture, the early deployment, the monitoring, the late optimization and continuous service update were the key in Planzone success.
At beginning of the project we also put in place an internal
benchlab infrastructure to make stress and
performance measurements. It turns out that production monitoring results were more interesting and valuable
than simulating high loads. Our
benchlab is now used only for functional validation.