Java 2 Ada

Rest API Benchmark comparison between Ada and Java

By stephane.carrez 3 comments

Arcadius Ahouansou from Menelic.com made an interesting benchmark to compare several Java Web servers: Java REST API Benchmark: Tomcat vs Jetty vs Grizzly vs Undertow, Round 3. His benchmark is not as broad as the TechEmpower Benchmark but it has the merit to be simple to understand and it can be executed very easily by everyone. I decided to make a similar benchmark for Ada Web servers with the same REST API so that it would be possible to compare Ada and Java implementations.

The goal is to benchmark the following servers and have an idea of how they compare with each others:

The first three are implemented in Ada and the last one in Java.

REST Server Implementation

The implementation is different for each server but they all implement the same REST GET operation accessible from the /api base URL. They return the same JSON content:

{"greeting":"Hello World!"}

Below is an extract of the server implementation for each server.

AWS Rest API Server

function Get_Api (Request : in AWS.Status.Data) return AWS.Response.Data is
begin
   return AWS.Response.Build ("application/json", "{""greeting"":""Hello World!""}");
end Get_Api;

ASF Rest API Server

procedure Get (Req    : in out ASF.Rest.Request'Class;
               Reply  : in out ASF.Rest.Response'Class;
               Stream : in out ASF.Rest.Output_Stream'Class) is
begin
   Stream.Start_Document;
   Stream.Write_Entity ("greeting", "Hello World!");
   Stream.End_Document;
end Get;

EWS Rest API Server

function Get (Request : EWS.HTTP.Request_P) return EWS.Dynamic.Dynamic_Response'Class is
   Result : EWS.Dynamic.Dynamic_Response (Request);
begin
   EWS.Dynamic.Set_Content_Type (Result, To => EWS.Types.JSON);
   EWS.Dynamic.Set_Content (Result, "{""greeting"":""Hello World!""}");
   return Result;
end Get;

Java Rest API Server

@Produces(APPLICATION_JSON_UTF8_VALUE)
@Path("/api")
@Component
public class ApiResource {
  public static final String RESPONSE = "{\"greeting\":\"Hello World!\"}";
  
  @GET
  public Response test() {
      return ok(RESPONSE).build();
  }
}

Benchmark Strategy and Results

The Ada and Java servers are started on the same host (one at a time), a Linux Ubuntu 14.04 64-bit powered by an Intel i7-33770S CPU @3.10Ghz with 8-cores. The benchmark is made by using Siege executed on a second computer running Linux Ubuntu 15.04 64-bit powered by an Intel i7-4720HQ CPU @2.60Ghz with 8-cores. Client and server hosts are connected through a Gigabit Ethernet link.

Siege makes an intensive use of network connections which results in exhaustion of TCP/IP port to connect to the server. This is due to the TCP TIME_WAIT that prevents the TCP/IP port from being re-used for future connections. To avoid such exhaustion, the network stack is tuned on both the server and the client hosts with the sysctl commands:

sudo sysctl -w net.ipv4.tcp_tw_recycle=1
sudo sysctl -w net.ipv4.tcp_tw_reuse=1

The benchmark tests are executed by running the run-load-test.sh script and then making GNUplot graphs using plot-perf.gpi script. The benchmark gives the number of REST requests which are made per second for different level of concurrency.

  • The Embedded Web Server targets embedded platforms and it uses only one task to serve requests. Despite this simple configuration, it gets some honorable results as it reaches 8000 requests per second.
  • The Ada Server Faces provides an Ada implementation of Java Server Faces. It uses the Ada Web Server. The benchmark shows a small overhead (arround 4%).
  • The Ada Web Server is the fastest server in this configuration. As for the Ada Server Faces it is configured to only have 8 tasks that serve requests. Increasing the number of tasks does not bring better performance.
  • The Java Grizzly server is the faster Java server reported by Arcadius's benchmark. It uses 62 threads. It appears to serve 7% less requests than the Ada Web Server.

ada-rest-api-benchmark.png

On the memory side, the process Resident Set Size (RSS) is measured once the benchmark test ends and graphed below. The Java Grizzly server uses arround 580 Mb, followed by Ada Server Faces that uses 5.6Mb, Ada Web Server 3.6Mb and the EWS only 1 Mb.

ada-rest-api-memory.png

Conclusion and References

The Ada Web Server has comparable performance with the Java Grizzly server (it is even a little bit faster). But as far a memory is concerned, Ada has a serious advantage since it cuts the memory size by a factor of 100. Ada has other advantages that make it an alternative choice for web development (safety, security, realtime capabilities, ...).

Sources of the benchmarks are available in the following two GitHub repositories:

3 comments
To add a comment, you must be connected. Login to add a comment

Optimization with Valgrind Massif and Cachegrind

By stephane.carrez

Memory optimization reveals sometimes some nice surprise. I was interested to analyze the memory used by the Ada Server Faces framework. For this I've profiled the unit tests program. This includes 130 tests that cover almost all the features of the framework.

Memory analysis with Valgrind Massif

Massif is a Valgrind tool that is used for heap analysis. It does not require the application to be re-compiled and can be used easily. The application is executed by using Valgrind and its tool Massif. The command that I've used was:

valgrind --tool=massif --threshold=0.1 \
   --detailed-freq=1 --alloc-fn=__gnat_malloc \
   bin/asf_harness -config test.properties

The valgrind tool creates a file massif.out.NNN which contains the analysis. The massif-visualizer is a graphical tool that reads the file and allows you to analyze the results. It is launched as follows:

massif-visualizer massif.out.19813

(the number is the pid of the process that was running, replace it accordingly).

The tool provides a graphical representation of memory used over the time. It allows to highlight a given memory snapshot and understand roughly where the memory is used.

Memory consumption with Massif [before]

While looking at the result, I was intrigued by a 1MB allocation that was made several times and then released (It creates these visual spikes and it correspond to the big red horizontal bar that appears visually). It was within the sax-utils.adb file that is part of the XML/Ada library. Looking at the implementation, it turns out that it allocates a hash table with 65536 entries. This allocation is done each time the sax parser is created. I've reduced the size of this hash table to 1024 entries. If you want to do it, change the following line in sax/sax-symbols.ads (line 99):

   Hash_Num : constant := 2**16;

by:

   Hash_Num : constant := 2**10;

After building, checking there is no regression (yes, it works), I've re-run the Massif tool and here are the results.

Memory consumption with Massif [after]

The peak memory was reduced from 2.7Mb to 2.0Mb. The memory usage is now easier to understand and analyse because the 1Mb allocation is gone. Other memory allocations have more importance now. But wait. There is more! My program is now faster!

Cache analysis with cachegrind

To understand why the program is now faster, I've used Cachegrind that measures processor cache performance. Cachegrind is a cache and branch-prediction profiler provided by Valgrind as another tool. I've executed the tool with the following command:

valgrind --tool=cachegrind \
    bin/asf_harness -config test.properties

I've launched it once before the hash table correction and once after. Similar to Massif, Cachegrind generates a file cachgrind.NNN that contains the analysis. You analyze the result by using either cg_annotate or kcachegrind. Having two Cachegrind files, I've used cg_diff to somehow get diff between the two executions.

cg_diff cachegrind.out.24198 cachegrind.out.23286 > cg.out.1
cg_annotate cg.out.1

Before the fix, we can see in Cachegrind report that the most intensive memory operations are performed by Sax.Htable.Reset operation and by the GNAT operation that initializes the Sax.Symbols.Symbol_Table_Record type which contains the big hash table. Dr is the number of data reads, D1mr the L1 cache read miss and Dw is the number of writes with D1mw representing the L1 cache write miss. Having a lot of cache miss will slow down the execution: L1 cache access requires a few cycles while main memory access could cost several hundreds of them.

--------------------------------------------------------------------------------
         Dr      D1mr          Dw      D1mw 
--------------------------------------------------------------------------------
212,746,571 2,787,355 144,880,212 2,469,782  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr      D1mr         Dw      D1mw  file:function
--------------------------------------------------------------------------------
25,000,929 2,081,943     27,672       244  sax/sax-htable.adb:sax__symbols__string_htable__reset
       508       127 33,293,050 2,080,768  sax/sax-htable.adb:sax__symbols__symbol_table_recordIP
43,894,931   129,786  7,532,775     8,677  ???:???
15,021,128     4,140  5,632,923         0  pthread_getspecific
 7,510,564     2,995  7,510,564    10,673  ???:system__task_primitives__operations__specific__selfXnn
 6,134,652    41,357  4,320,817    49,207  _int_malloc
 4,774,547    22,969  1,956,568     4,392  _int_free
 3,753,930         0  5,630,895     5,039  ???:system__task_primitives__operations(short,...)(long, float)

With a smaller hash table, the Cachegrind report indicates a reduction of 24,543,482 data reads and 32,765,323 data writes. The cache read miss was reduced by 2,086,579 (74%) and the cache write miss was also reduced by 2,056,247 (83% reduction!).

With a small hash table, the Sax.Symbols.Symbol_Table_Record gets initialized quicker and its cleaning needs less memory accesses, hence CPU cycles. By having a smaller hash table, we also benefit from less cache miss: using a 1Mb hash table flushes a big part of the data cache.

--------------------------------------------------------------------------------
         Dr    D1mr          Dw    D1mw 
--------------------------------------------------------------------------------
188,203,089 700,776 112,114,889 413,535  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr    D1mr        Dw   D1mw  file:function
--------------------------------------------------------------------------------
43,904,760 120,883 7,532,577  8,407  ???:???
15,028,328      98 5,635,623      0  pthread_getspecific
 7,514,164     288 7,514,164  9,929  ???:system__task_primitives__operations__specific__selfXnn
 6,129,019  39,636 4,305,043 48,446  _int_malloc
 4,784,026  18,626 1,959,387  3,261  _int_free
 3,755,730       0 5,633,595  4,390  ???:system__task_primitives__operations(short,...)(long, float)
 2,418,778      65 2,705,140     14  ???:system__tasking__initialization__abort_undefer
 3,839,603   2,605 1,283,289      0  malloc

Conclusion

Running massif and cachegrind is very easy but it may take some time to figure out how to understand and use the results. A big hash table is not always a good thing for an application. By creating cache misses it may in fact slow down the application. To learn more about this subject, I recommend the excellent document What Every Programmer Should Know About Memory written by Ulrich Drepper.

To add a comment, you must be connected. Login to add a comment

World IPv6 Day

By stephane.carrez

Today, June 8th 2011, is the World IPv6 day. Major organisations such as Google, Facebook, Yahoo! wil offer native IPv6 connectivity.

To check your IPv6 connectivity, you can run a test from your browser: Test your IPv6 connectivity.

If you install the ShowIP Firefox plugin, you will know the IP address of web sites while you browse and therefore quickly know whether you navigate using IPv4 or IPv6.

Below are some basic performance results between IPv4 and IPv6. Since most routers are tuned for IPv4, the IPv6 flow path is not yet as fast as IPv4. The (small) performance degradation has nothing to do with the IPv6 protocol.

Google IPv4 vs IPv6 ping

$ ping -n www.google.com
PING www.l.google.com (209.85.146.103) 56(84) bytes of data.
64 bytes from 209.85.146.103: icmp_seq=1 ttl=55 time=9.63 ms
$ ping6 -n www.google.com
PING www.google.com(2a00:1450:400c:c00::67) 56 data bytes
64 bytes from 2a00:1450:400c:c00::67: icmp_seq=1 ttl=56 time=11.6 ms

Yahoo IPv4 vs IPv6 ping

$ ping -n www.yahoo.com
PING fpfd.wa1.b.yahoo.com (87.248.122.122) 56(84) bytes of data.
64 bytes from 87.248.122.122: icmp_seq=1 ttl=58 time=25.7 ms
$ ping6 -n www.yahoo.com
PING www.yahoo.com(2a00:1288:f00e:1fe::3000) 56 data bytes
64 bytes from 2a00:1288:f00e:1fe::3000: icmp_seq=1 ttl=60 time=31.3 ms

Facebook IPv4 vs IPv6 ping

$ ping -n www.facebook.com
PING www.facebook.com (66.220.156.25) 56(84) bytes of data.
64 bytes from 66.220.156.25: icmp_seq=1 ttl=247 time=80.6 ms
$ ping6 -n www.facebook.com
PING www.facebook.com(2620:0:1c18:0:face:b00c:0:1) 56 data bytes
64 bytes from 2620:0:1c18:0:face:b00c:0:1: icmp_seq=1 ttl=38 time=98.6 ms
To add a comment, you must be connected. Login to add a comment

Thread safe cache updates in Java and Ada

By stephane.carrez 2 comments

Problem Description

The problem is to update a cache that is almost never modified and only read in multi-threaded context. The read performance is critical and the goal is to reduce the thread contention as much as possible to obtain a fast and non-blocking path when reading the cache.

Cache Declaration

Java Implementation

Let's define the cache using the HashMap class.

public class Cache {
   private HashMap<String,String> map = new HashMap<String, String>();
}

Ada Implementation

In Ada, let's instantiate the Indefinite_Hashed_Maps package for the cache.

with Ada.Strings.Hash;
with Ada.Containers.Indefinite_Hashed_Maps;
...
  package Hash_Map is
    new Ada.Containers.Indefinite_Hashed_Maps (Key_Type => String,
                       Element_Type => String,
                       Hash => Hash,
                       "=" => "=");

  Map : Hash_Map.Map;

Solution 1: safe and concurrent implementation

This solution is a straightforward solution using the language thread safe constructs. In Java this solution does not allow several threads to look at the cache at the same time. The cache access will be serialized. This is not a problem with Ada, where multiple concurrent readers are allowed. Only writing locks the cache object

Java Implementation

The thread safe implementation is protected by the synchronized keyword. It guarantees mutual exclusions of threads invoking the getCache and addCache methods.

   public synchronized String getCache(String key) {
      return map.get(key);
   }
   public synchronized void addCache(String key, String value) {
      map.put(key, value);
   }

Ada Implementation

In Ada, we can use the protected type. The cache could be declared as follows:

  protected type Cache is
    function Get(Key : in String) return String;
    procedure Put(Key, Value: in String);
  private
    Map : Hash_Map.Map;
  end Cache;

and the implementation is straightforward:

  protected body Cache is
    function Get(Key : in String) return String is
    begin
       return Map.Element (Key);
    end Get;
    procedure Put(Key, Value: in String) is
    begin
       Map.Insert (Key, Value);
    end Put;
  end Cache;

Pros and Cons

+: This implementation is thread safe.

-: In Java, thread contention is high as only one thread can look in the cache at a time.

-: In Ada, thread contention occurs only if another thread updates the cache (which is far better than Java but could be annoying for realtime performance if the Put operation takes time).

-: Thread contention is high as only one thread can look in the cache at a time.

Solution 2: weak but efficient implementation

The Solution 1 does not allow multiple threads to access the cache at the same time, thus providing a contention point. The second solution proposed here, removes this contention point by relaxing some thread safety condition at the expense of cache behavior.

In this second solution, several threads can read the cache at the same time. The cache can be updated by one or several threads but the update does not guarantee that all entries added will be present in the cache. In other words, if two threads update the cache at the same time, the updated cache will contain only one of the new entry. This behavior can be acceptable in some cases and it may not fit for all uses. It must be used with great care.

Java Implementation

A cache entry can be added in a thread-safe manner using the following code:

   private volatile HashMap<String, String> map = new HashMap<String, String>();
   public String getCache(String key) {
      return map.get(key);
   }
   public void addCache(String key, String value) {
      HashMap<String, String> newMap = new HashMap<String, String>(map);

      newMap.put(newKey, newValue);
      map = newMap;
   }

This implementation is thread safe because the hash map is never modified. If a modification is made, it is done on a separate hash map object. The new hash map is then installed by the map = newMap assignment operation which is atomic. Again this code extract does not guarantee that all the cache entries added will be part of the cache.

Ada Implementation

The Ada implementation is slightly more complex basically because there is no garbage collector. If we allocate a new hash map and update the access pointer, we still have to free the old hash map when no other thread is accessing it.

The first step is to use a reference counter to automatically release the hash table when the last thread finishes its work. The reference counter will handle memory management issues for us. An implementation of thread-safe reference counter is provided by Ada Util. In this implementation, counters are updated using specific instruction (See Showing multiprocessor issue when updating a shared counter).

with Util.Refs;
...
   type Cache is new Util.Refs.Ref_Entity with record
      Map : Hash_Map.Map;
   end record;
   type Cache_Access is access all Cache;

   package Cache_Ref is new Util.Refs.References (Element_Type => Cache,
                Element_Access => Cache_Access);

  C : Cache_Ref.Atomic_Ref;

Source: Util.Refs.ads, Util.Refs.adb

The References package defines a Ref type representing the reference to a Cache instance. To be able to replace a reference by another one in an atomic manner, it is necessary to use the Atomic_Ref type. This is necessary because the Ada assignment of an Ref type is not atomic (the assignment copy and the call to the Adjust operation to update the reference counter are not atomic). The Atomic_Ref type is a protected type that provides a getter and a setter. Their use guarantees the atomicity.

    function Get(Key : in String) return String is
      R : constant Cache_Ref.Ref := C.Get;
    begin
       return R.Value.Map.Element (Key); -- concurrent access
    end Get;
    procedure Put(Key, Value: in String) is
       R : constant Cache_Ref.Ref := C.Get;
       N : constant Cache_Ref.Ref := Cache_Ref.Create;
    begin
       N.Value.all.Map := R.Value.Map;
       N.Value.all.Insert (Key, Value);
       C.Set (N); -- install the new map atomically
    end Put;

Pros and Cons

+: high performance in SMP environments

+: no thread contention in Java

-: cache update can loose some entries

-: still some thread contention in Ada but limited to copying a reference (C.Set)

2 comments
To add a comment, you must be connected. Login to add a comment

Showing multiprocessor issue when updating a shared counter

By stephane.carrez

When working on several Ada concurrent counter implementations, I was interested to point out the concurrent issue that exists in multi-processor environment. This article explains why you really have to take this issue seriously in multi-tasks applications, specially because multi-core processors are now quite common.

What's the issue

Let's say we have a simple integer shared by several tasks:

Counter : Integer;

And several tasks will use the following statement to increment the counter:

  Counter := Counter + 1;

We will see that this implementation is wrong (even if a single instruction is used).

Multi task increment sample

To show up the issue, let's define two counters. One not protected and another protected from concurrent accesses by using a specific data structure provided by the Ada Util library.

with Util.Concurrent.Counters;
..
  Unsafe  : Integer := 0;
  Counter : Util.Concurrent.Counters.Counter;

In our testing procedure, let's declare a task type that will increment both versions of our counters. Several tasks will run concurrently so that the shared counter variables will experience a lot of concurrent accesses. The task type is declared in a declare block inside our procedure so that we will benefit from task synchronisation at the end of the block (See RM 7.6, and RM 9.3).

Each task will increment both counters in a loop. We should expect the two counters to get the same value at the end. We will see this is not the case in multi-processor environments.

declare
  task type Worker is
    entry Start (Count : in Natural);
  end Worker;

  task body Worker is
    Cnt : Natural;
  begin
      accept Start (Count : in Natural) do
        Cnt := Count;
      end;
      for I in 1 .. Cnt loop
        Util.Concurrent.Counters.Increment (Counter);
        Unsafe := Unsafe + 1;
      end loop;
  end Worker;

Now, in the same declaration block, we will define an array of tasks to show up the concurrency.

   type Worker_Array is array (1 .. Task_Count) of Worker;
   Tasks : Worker_Array;

Our tasks are activated and they are waiting to get the counter. Let's make our tasks count 10 million times.

begin
  for I in Tasks'Range loop
    Tasks (I).Start (10_000_000);
  end loop;
end;

Before leaving the declare scope, Ada will wait until the tasks have finished. (yes, there is no need to write any pthread_join code). After this block, we can just print out the value stored in the two counters and compare them:

Log.Info ("Counter value at the end       : " & Integer'Image (Value (Counter)));
Log.Info ("Unprotected counter at the end : " & Integer'Image (Unsafe));

The complete source is available in the Ada Util project in multipro.adb.

The Results

With one task, everything is Ok (Indeed!):

Starting  1 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  10000000

With two tasks, the problem appears:

Starting  2 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  8033821

And it aggravates as the number of tasks increases.

Starting  16 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  2496811

(The above results have been produced on an Intel Core Quad; Similar problems show up on Atom processors as well)

Explanation

On x86 processors, the compiler can use an incl instruction for the unsafe counter increment. So, one instruction for our increment. You thought it was thread safe. Big mistake!

  incl %(eax)

This instruction is atomic in a mono-processor environment meaning that it cannot be interrupted. However, in a multi-processor environment, each processor has its own memory cache (L1 cache) and will read and increment the value into its own cache. Caches are synchronized but this is almost always too late. Indeed, two processors can read their L1 cache, increment the value and save it at the same time (thus, loosing one increment). This is what is happening with the unprotected counter.

Let's see how to do the protection.

Protection with specific assembly instruction

To avoid this, it is necessary to use special instructions that will force the memory location to be synchronized and locked until the instruction completes. On x86, this is achieved by the lock instruction prefix. The following is guaranteed to be atomic on multi-processors:

  lock
  incl %(eax)

The lock instruction prefix introduces a delay to the execution of the instruction it protects. This delay increases slightly when concurrency occurs but it remains acceptable (up to 10 times slower).

For Sparc, Mips and other processors, the implementation requires to loop until either a lock is get (Spinlock) or it is guaranteed that no other processor has modified the counter at the same time.

Source: Util.Concurrent.Counters.ads, Util.Concurrent.Counters.adb

Protection with an Ada protected type

A safe and portable counter implementation can be made by using Ada protected types. The protected type allows to define a protected procedure Increment which provides an exclusive read-write access to the data (RM 9.5.1). The protected function Value will offer a concurrent read-only access to the data.

package Util.Concurrent.Counters is
    type Counter is limited private;
    procedure Increment (C : in out Counter);
    function Value (C : in Counter) return Integer;
private
  protected type Cnt is
      procedure Increment;
      function Get return Integer;
   private
      N : Integer := 0;
   end Cnt;
   type Counter is limited record
      Value : Cnt;
   end record;
end Util.Concurrent.Counters;

Source: Util.Concurrent.Counters.ads, Util.Concurrent.Counters.adb

To add a comment, you must be connected. Login to add a comment

Installing an SSD device on Ubuntu

By stephane.carrez

This article explains the steps for the installation of an SSD device on an existing Ubuntu desktop PC.

Disk Performances

First of all, let's have a look at the disk read performance with the hdparm utility. The desktop PC has three disks, /dev/sda being the new SSD device (an OCZ Vertex 2 SATA II 3.5" SSD).

$ sudo -i hdparm -t /dev/sda /dev/sdb /dev/sdc
The three disks have the following performance:

sda: OCZ-VERTEX2 3.5        229.47 MB/sec
sdb: WDC WD3000GLFS-01F8U0  122.29 MB/sec
sdc: ST3200822A             59.23 MB/sec

The SSD device appears to be 2 times faster than a 10000 rpm disk.

Plan for the move

The first step is to plan for the move and define what files should be located on the SSD device.

Identify files used frequently

To benefit of the high read performance, files used frequently could be moved to the SSD device. To identify them, you can use the find command and the -amin option. This option will not work if the file system is mounted with noatime. The -amin option indicates a number of minutes. To find the files that were accessed during the last 24 hours, you may use the following command:

$ sudo find /home -amin -1440
In most cases, files accessed frequently are the system files (in /bin, /etc, /lib, ..., /usr/bin, /usr/lib, /usr/share, ...) and users' files located in /home.

Identify Files that change frequently

Some people argue that files modified frequently should not be located on an SSD device (write endurance and write performance).

On a Linux system, the system files that are changed on regular basis are in general grouped together in the /var directory. Some configuration files are modified by system daemons while they are running. The list of system directories that changes can be limited to:

/etc    (cups/printers.conf.0, mtab,  lvm/cache, resolv.conf, ...)
/var    (log/*, cache/*, tmp/*, lib/*...)
/boot   (grub/grubenv modified after booting)

Temporary Files

On Linux temporary files are stored in one of the following directories. Several KDE applications are saving temporary files in the .kde/tmp-host directory for each user. These temporary files could be moved to a ram file system.

/tmp
/var/tmp
/home/$user/.kde/tmp-$host

Move plan

The final plan was to create one partition for the root file system and three LVM partitions for /usr, /var and /home directories.

Partition the drive

The drive must be partitioned with fdisk. I created one bootable partition and a second partition with what remains.

$ sudo fdisk -l /dev/sda

Disk /dev/sda: 120.0 GB, 120034123776 bytes

255 heads, 63 sectors/track, 14593 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x00070355



Device Boot Start End Blocks Id System

/dev/sda1 * 1 1295 10402056 83 Linux

/dev/sda2 1296 14593 106816185 83 Linux
To ease future management of partitions, it is useful to use LVM and create a volume group.

$ sudo vgcreate vg01 /dev/sda2

Volume group "vg01" successfully created
The partitions are then created by using lvcreate. More space can be allocated on them by using the lvextend utility.

$ sudo lvcreate -L 10G -n sys vg01

Logical volume "sys" created

$ sudo lvcreate -L 10G -n var vg01

Logical volume "var" created

$ sudo lvcreate -L 4G -n swap vg01

Logical volume "swap" created

$ sudo lvcreate -L 60G -n home vg01

Logical volume "home" created
The LVM partitions are available through the device mapper and they can be accessed by their name:

$ ls -l /dev/vg01/

total 0

lrwxrwxrwx 1 root root 19 2011-02-20 14:03 home -> ../mapper/vg01-home

lrwxrwxrwx 1 root root 19 2011-02-20 14:03 swap -> ../mapper/vg01-swap

lrwxrwxrwx 1 root root 18 2011-02-20 14:03 sys -> ../mapper/vg01-sys

lrwxrwxrwx 1 root root 18 2011-02-20 14:03 var -> ../mapper/vg01-var

Format the partition

Format the file system with ext4 as it integrates various improvements which are useful for the SSD storage (Extents, Delayed allocation). Other file systems will work very well too.

$ sudo mkfs -t ext4 /dev/vg01/sys

Move the files

To move files from one system to another place, it is safer to use the tar command instead of a simple cp. Indeed, the tar command is able to copy special files without problems while not all cp commands support the copy of special files.

$ sudo mount /dev/vg01/sys /target

$ sudo -i

# cd /usr

# tar --one-file-system -cf - . | (cd /target; tar xf -)
If the file system to move is located on another LVM partition, it is easier and safer to use the pvmove utility to move physical extents from one physical volume to another one.

Change the mount point

Edit the /etc/fstab file and change the old mount point to the new one. The noatime mount option tells the kernel to avoid updating the file access time when it is read.

/dev/vg01/sys  /usr  ext4 noatime  0 2
/dev/vg02/home /home ext4 noatime  0 2
/dev/vg01/var  /var  ext4 noatime  0 2

Tune the IO Scheduler

For the SSD drive, it is best to disable the Linux IO scheduler. For this, we will activate the noop IO scheduler. Other disks will use the default IO scheduler or another one. Add the following lines in /etc/rc.local file:

test -f /sys/block/sda/queue/scheduler &&
  echo noop > /sys/block/sda/queue/scheduler

References

LVM

ext4

http://www.ocztechnologyforum.com/forum/showthread.php?54379-Linux-Tips-tweaks-and-alignment

http://www.storagesearch.com/ssdmyths-endurance.html

To add a comment, you must be connected. Login to add a comment

Boost your php web site by installing eAccelerator

By stephane.carrez 1 comment

This article explains how to boost the performance of a PHP site by installing a PHP accelerator software.

Why is PHP slow

PHP is an interpreted language that requires to parse the PHP files for each request received by the server. With a compiled language such as Java or Ada, this long and error prone process is done beforehand. Even if the PHP interpretor is optimized, this parsing step can be long. The situation is worse when you use a framework (Symfony, CakePHP,...) that requires many PHP files to be scanned.

eAccelerator to the rescue

eAccelerator is a module that reduces this performance issue by introducing a shared cache for the PHP pre-compiled files. The module somehow compiles the PHP files in some internal compiled state and makes this available to the apache2 processes through a shared memory segment.

Installing eAccelerator

First get eAccelerator sources at http://eaccelerator.net/

Then extract the tar.bz2 file on your server:

$ tar xvjf eaccelerator-0.9.6.1.tar.bz2
eaccelerator-0.9.6.1/
eaccelerator-0.9.6.1/COPYING
...

Build eAccelerator module

Before building the module you must first run the phpize command to prepare the module before compilation:

$ cd eaccelerator-0.9.6.1/
$ phpize

Then, launch the configure script:

$ ./configure --enable-eaccelerator=shared \
    --with-php-config=/usr/bin/php-config

Finally build the module:

$ make

Install eAccelerator

Installation is done by the next steps:

$ sudo make install

Don't forget to copy the configuration file (have a look at its content but in most cases it works as is):

$ sudo cp eaccelerator.ini  /etc/php5/conf.d/

Restart Apache server

To make the module available, you have to restart the Apache server:

$ sudo /etc/init.d/apache2 restart

Performance improvements

What performance gain can you expect... That will depend on the PHP software and the page. It's easy to have an idea.

To measure the performance improvement, you can use the Apache benchmarking tool. Do a performance measurement on the web site before the installation and another one after. Be sure to benchmark the same page.

The following command will benchmark the http://mysite.mydomain.com/index.php page 100 times with only one connection.

$ ab -n 100 http://mysite.mydomain.com/index.php

Below is an extract of the percentage of the requests served within a certain time (ms) for one of my web page served by Dotclear:

         Without        with
        eAccelerator  eAccelerator
 50%       383           236
 66%       384           237
 75%       387           238
 80%       388           239
 90%       393           258
 95%       425           265
 98%       536           295
 99%       796           307
100%       796           307 (longest request)

The gain varies from 38% to 60% so it is quite interesting. The other benefit is that the variance is also smaller meaning that requests are served globally in the same time.

1 comment
To add a comment, you must be connected. Login to add a comment

Solving Linux system lock up when intensive disk I/O are performed

By stephane.carrez

When a system lock up occurs, we often blame applications but when you look carefully you may see that despite your multi-core CPU, your applications are sleeping! No cpu activity! So what happens then? Check the I/Os, it could be the root cause!

With Ubuntu 10.04, my desktop computer was freezing when the ReadyNAS Bacula backup was running. Indeed, the Bacula daemon was performing intensive disk operations (on a fast SATA hard disk). The situation was such that it was impossible to use the system, the interface was freezing for a several seconds then working for a few seconds and freezing again.

Linux I/O Scheduler

The I/O scheduler is responsible for organizing the order in which disk operations are performed. Some algorithms allow to minimize the disk head moves, other algorithms tend to anticipate read operations,

When I/O operations are not scheduled correctly, an interactive application such as a desktop or a browser can be blocked until its I/O operations are scheduled and executed (the situation can be even worse for those applications that use the O_SYNC writing mode).

By default, the Linux kernel is configured to use the Completely Fair Queuing scheduler. This I/O scheduler does not provide any time guarantee but it gives in general good performances. Linux provides other I/O schedulers such as the Noop scheduler, the Anticipatory scheduler and the Deadline scheduler.

The deadline scheduler puts an execution time limit to requests to make sure the I/O operation is executed before an expiration time. Typically, a read operation will wait at most 500 ms. This is the I/O scheduler we need to avoid the system lock up.

Checking the I/O Scheduler

To check which I/O scheduler you are using, you can use the following command:

$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

where sda is the device name of your hard disk (or try hda).

The result indicates the list of supported I/O scheduler as well as the current scheduler used (here the Completely Fair Queuing).

Changing the I/O Scheduler

To change the scheduler, you can echo the desired scheduler name to activate it (you must be root):

# echo deadline >  /sys/block/sda/queue/scheduler

To make sure the I/O scheduler is configured after each system startup, you can add the following lines to your /etc/rc.local startup script:

test -f /sys/block/sda/queue/scheduler &&
  echo deadline > /sys/block/sda/queue/scheduler

test -f /sys/block/sdb/queue/scheduler &&
   echo deadline > /sys/block/sdb/queue/scheduler

test -f /sys/block/hda/queue/scheduler &&
   echo deadline > /sys/block/hda/queue/scheduler

You may have to change the sda and sdb into hda and hdb if you have an IDE hard disk.

Conclusion

After changing the I/O scheduler to use the Deadline scheduler, the desktop was not freezing any more when backups are running.

To add a comment, you must be connected. Login to add a comment

Experience feedback in running a SaaS application

By stephane.carrez

Create a Flexible Architecture

The application architecture can have long term and critical impacts on the performance and growth. It must be flexible to be able to deploy components on dedicated servers when needed. But flexibility has a development cost, a performance cost and on the other hand it is not always necessary. Carefully identifying the components is the key. For each component it is necessary to identify and know how they are used, what is their impact on performance on the overall application. If the architecture is not designed or studied correctly, it can be impossible to reorganize the deployment when issues arise.

Planzone is using a traditional multi-tier J2EE architecture. I have organized the architecture in 5 web applications (WAR) that can be deployed on the same server or on different servers. The web applications have different roles: the core application, API access, batch processing, ... These web applications are deployed on every server and we can activate them easily when necessary.

Deploy Early

Deploying a new application or service should be made early even when only few users will use the application. By going in production sooner rather than later you get the opportunity to see problems when you have less traffic. You can learn and watch how your users are using the service. Last but not least, you are in a real situation and you are forced to identify and solve real problems immediately.

For our service, we launched the beta version of Planzone in December 2007 and let it used to our initial beta users (300 users). At this stage, we had no performance issue but we could collect good feedback on the product, identify missing features and get ideas to improve our service.

Monitor the application from the beginning

Monitoring is the key when the user's growth rate is unknown (and even after!). This must be put in place at the same time the service is deployed. A careful monitoring solution will help to identify early whether the application has performance issues or whether the infrastructure has to be changed because the user's growth requires it.

We put in place a simple monitoring solution based on Cacti and Nagios. But this was not enough because these tools only provide a coarse monitoring view of the application. I put in place a request monitoring within the application to identify the bottlenecks early (I'll describe it in another post).

Optimize when the monitoring says so

The Pareto principle states that 80% of events are caused by 20% of the causes. For software optimization, this 80-20 rule means that by optimizing 20% of the code, we solve 80% of performance problems. The monitoring solution must be used to identify the 20% of pages, or the 20% of database requests, etc which are the most used and are potentially causing a bottleneck. Because the system is in production, the monitoring data is real and not simulated. Therefore you know what to do.

As far as Planzone is concerned, I decided to optimize only one or two pages (over more than 200) and two or three database queries (over more than 180). The choice of which page had to be optimized and when, was defined by the monitoring result. With the team we kept an eye on the monitoring data and we decided to fix performance issues when they seem to appear (one or two times every 6 months).

Update as soon as possible

Optimization allows to solve problems detected by the monitoring. As soon as a solution is found and is functionally validated, updating the production is necessary. Do not wait! Waiting at this stage can aggravate the situation because more users can use the platform and the database will grow (anyway).

With Planzone, we decided to update the service on a regular basis, basically every two months in 2008 and 2009 and each month since the beginning of 2010 (without service interruption!). This helped us a lot in keeping a good quality for the service both on the performance side and on the functional side. Each update contains new (small) features, bug fixes and the performance improvements that are necessary (and no more).

Plan for load spikes

A careful monitoring of the application allows to know the infrastructure usage in terms of CPU, memory and disk loads. Most of the time you will see that the infrastructure is not used at the maximum of its capacity. Users don't use the service at the same time but since you don't control them you may observe intensive use during some periods. If the infrastructure is used at its maximum during normal usage, you have no bandwidth for these intensive usage.

For Planzone we have seen that we often get a load spike every Tuesday and at different hours during the week. Indeed, the load spikes correspond to users who need the service during their business hours. Even during these spikes, the service provides a very good reactivity for users. The load is below 20% in these cases and this gives us room for growth.

Conclusion

From a technical point of view, the architecture, the early deployment, the monitoring, the late optimization and continuous service update were the key in Planzone success.

At beginning of the project we also put in place an internal benchlab infrastructure to make stress and performance measurements. It turns out that production monitoring results were more interesting and valuable than simulating high loads. Our benchlab is now used only for functional validation.

To add a comment, you must be connected. Login to add a comment

How google analytics can alter your web performance

By stephane.carrez

When the Cookie Crumbles

The famous Yahoo!'s Exceptional Performance team studied and defined the best practices for designing performant web pages. Steve Souders turned this study into the famous High Performance Web Sites book (I strongly recommend this book!). The team continued his work and studied the impact of cookies on a web site. Tenni Theurer explains in Performance Research, Part 3: When the Cookie Crumbles that cookies impact the performance of requests. Indeed, cookies have to be sent by the browser for each request. Having lots of cookies or a big cookie part in the request will slow down each request.

Google Analytics

To track visits, GA sets several cookies on the domain (See From __utma to __utmz (Google Analytics Cookies)). If you look at those cookies, you'll be amazed of their number and their size.

Case study on Planzone

For Planzone, after browsing the marketing site and entering in the application, I've found that my browser was sending 583 bytes for cookies, 76% for Google analytics and 23% for the application itself.

The planzone application cookies have been optimized. The number of cookies is reduced to the minimum and their size is small. Each time an AJAX request is made, we must send arround 135 bytes for the application cookies (which is reasonably small).

Application Cookies: 135 bytes (23%)

(46 bytes) AID=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
(43 bytes) JSESSIONID=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;
(46 bytes) SPID=bb8001.121aee.p5/na8fcrGLz0YsfM04nSHET9w8;

The google analytics cookies are amazingly big: 8 cookies and a total size of 448 bytes.

GA Cookies: 448 bytes (76%)

(34 bytes) __qca=1171833534-90015634-27218739;
(63 bytes) __utma=68692688.1831257980.1215462364.1238622130.1238700076.158;
(103 bytes) __utmz=68692688.1238622130.157.47.utmccn=(organic)|utmcsr=google|utmctr=planzone+referal|utmcmd=organic;
(60 bytes) __utma=23092397.781561637.1237016736.1237016736.1238443952.2;
(123 bytes) __utmz=23092397.1238443952.2.2.utmccn=(referral)|utmcsr=planzone.com|utmcct=/planzone/f10-team.planzone.com|utmcmd=referral;
(15 bytes) __utmc=68692688;
(15 bytes) __qcb=295201554;
(15 bytes) __utmb=68692688; 

This cookie overhead is small and not visible when you display a page. For an AJAX request, you expect some interaction and you expect the request to be fast. In many cases, the AJAX request has been optimize to get a short request and a short response.

Explanation

The reason why Google Analytics pollutes the Planzone service is that it sets the qca, utma and utmz-like cookies on the domain. Since the marketing site and the application share the same domain, the cookies are passed.

Conclusion

  • Do not add Google Analytics within your AJAX application.
  • If your marketing site uses GA, consider removing those cookies once the user is logged.
To add a comment, you must be connected. Login to add a comment