Blog posts - Tag performance

Tag - performance

Ada, Java and Python database access

By Stephane Carrez2018-11-17 14:02:00

How does Ada, Java and Python compare with each other when they are used to connect to a database? This was the main motivation for me to write the SQL Benchmark and write this article.

To add a comment, you must be connected. Login to add a comment

Rest API Benchmark comparison between Ada and Java

By Stephane Carrez2017-03-21 22:55:00 3 comments

Arcadius Ahouansou from Menelic.com made an interesting benchmark to compare several Java Web servers: Java REST API Benchmark: Tomcat vs Jetty vs Grizzly vs Undertow, Round 3. His benchmark is not as broad as the TechEmpower Benchmark but it has the merit to be simple to understand and it can be executed very easily by everyone. I decided to make a similar benchmark for Ada Web servers with the same REST API so that it would be possible to compare Ada and Java implementations.

3 comments

To add a comment, you must be connected. Login to add a comment

World IPv6 Day

By Stephane Carrez2013-12-31 14:31:15

Today, June 8th 2011, is the World IPv6 day. Major organisations such as Google, Facebook, Yahoo! wil offer native IPv6 connectivity.

To check your IPv6 connectivity, you can run a test from your browser: Test your IPv6 connectivity.

If you install the ShowIP Firefox plugin, you will know the IP address of web sites while you browse and therefore quickly know whether you navigate using IPv4 or IPv6.

Below are some basic performance results between IPv4 and IPv6. Since most routers are tuned for IPv4, the IPv6 flow path is not yet as fast as IPv4. The (small) performance degradation has nothing to do with the IPv6 protocol.

Google IPv4 vs IPv6 ping

$ ping -n www.google.com
PING www.l.google.com (209.85.146.103) 56(84) bytes of data.
64 bytes from 209.85.146.103: icmp_seq=1 ttl=55 time=9.63 ms

$ ping6 -n www.google.com
PING www.google.com(2a00:1450:400c:c00::67) 56 data bytes
64 bytes from 2a00:1450:400c:c00::67: icmp_seq=1 ttl=56 time=11.6 ms

Yahoo IPv4 vs IPv6 ping

$ ping -n www.yahoo.com
PING fpfd.wa1.b.yahoo.com (87.248.122.122) 56(84) bytes of data.
64 bytes from 87.248.122.122: icmp_seq=1 ttl=58 time=25.7 ms

$ ping6 -n www.yahoo.com
PING www.yahoo.com(2a00:1288:f00e:1fe::3000) 56 data bytes
64 bytes from 2a00:1288:f00e:1fe::3000: icmp_seq=1 ttl=60 time=31.3 ms

Facebook IPv4 vs IPv6 ping

$ ping -n www.facebook.com
PING www.facebook.com (66.220.156.25) 56(84) bytes of data.
64 bytes from 66.220.156.25: icmp_seq=1 ttl=247 time=80.6 ms

$ ping6 -n www.facebook.com
PING www.facebook.com(2620:0:1c18:0:face:b00c:0:1) 56 data bytes
64 bytes from 2620:0:1c18:0:face:b00c:0:1: icmp_seq=1 ttl=38 time=98.6 ms

To add a comment, you must be connected. Login to add a comment

Optimization with Valgrind Massif and Cachegrind

By Stephane Carrez2013-03-02 22:51:00

Memory optimization reveals sometimes some nice surprise. I was interested to analyze the memory used by the Ada Server Faces framework. For this I've profiled the unit tests program. This includes 130 tests that cover almost all the features of the framework.

Memory analysis with Valgrind Massif

Massif is a Valgrind tool that is used for heap analysis. It does not require the application to be re-compiled and can be used easily. The application is executed by using Valgrind and its tool Massif. The command that I've used was:

valgrind --tool=massif --threshold=0.1 \
   --detailed-freq=1 --alloc-fn=__gnat_malloc \
   bin/asf_harness -config test.properties

The valgrind tool creates a file massif.out.NNN which contains the analysis. The massif-visualizer is a graphical tool that reads the file and allows you to analyze the results. It is launched as follows:

massif-visualizer massif.out.19813

(the number is the pid of the process that was running, replace it accordingly).

The tool provides a graphical representation of memory used over the time. It allows to highlight a given memory snapshot and understand roughly where the memory is used.

While looking at the result, I was intrigued by a 1MB allocation that was made several times and then released (It creates these visual spikes and it correspond to the big red horizontal bar that appears visually). It was within the sax-utils.adb file that is part of the XML/Ada library. Looking at the implementation, it turns out that it allocates a hash table with 65536 entries. This allocation is done each time the sax parser is created. I've reduced the size of this hash table to 1024 entries. If you want to do it, change the following line in sax/sax-symbols.ads (line 99):

   Hash_Num : constant := 2**16;

by:

   Hash_Num : constant := 2**10;

After building, checking there is no regression (yes, it works), I've re-run the Massif tool and here are the results.

The peak memory was reduced from 2.7Mb to 2.0Mb. The memory usage is now easier to understand and analyse because the 1Mb allocation is gone. Other memory allocations have more importance now. But wait. There is more! My program is now faster!

Cache analysis with cachegrind

To understand why the program is now faster, I've used Cachegrind that measures processor cache performance. Cachegrind is a cache and branch-prediction profiler provided by Valgrind as another tool. I've executed the tool with the following command:

valgrind --tool=cachegrind \
    bin/asf_harness -config test.properties

I've launched it once before the hash table correction and once after. Similar to Massif, Cachegrind generates a file cachgrind.NNN that contains the analysis. You analyze the result by using either cg_annotate or kcachegrind. Having two Cachegrind files, I've used cg_diff to somehow get diff between the two executions.

cg_diff cachegrind.out.24198 cachegrind.out.23286 > cg.out.1
cg_annotate cg.out.1

Before the fix, we can see in Cachegrind report that the most intensive memory operations are performed by Sax.Htable.Reset operation and by the GNAT operation that initializes the Sax.Symbols.Symbol_Table_Record type which contains the big hash table. Dr is the number of data reads, D1mr the L1 cache read miss and Dw is the number of writes with D1mw representing the L1 cache write miss. Having a lot of cache miss will slow down the execution: L1 cache access requires a few cycles while main memory access could cost several hundreds of them.

--------------------------------------------------------------------------------
         Dr      D1mr          Dw      D1mw 
--------------------------------------------------------------------------------
212,746,571 2,787,355 144,880,212 2,469,782  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr      D1mr         Dw      D1mw  file:function
--------------------------------------------------------------------------------
25,000,929 2,081,943     27,672       244  sax/sax-htable.adb:sax__symbols__string_htable__reset
       508       127 33,293,050 2,080,768  sax/sax-htable.adb:sax__symbols__symbol_table_recordIP
43,894,931   129,786  7,532,775     8,677  ???:???
15,021,128     4,140  5,632,923         0  pthread_getspecific
 7,510,564     2,995  7,510,564    10,673  ???:system__task_primitives__operations__specific__selfXnn
 6,134,652    41,357  4,320,817    49,207  _int_malloc
 4,774,547    22,969  1,956,568     4,392  _int_free
 3,753,930         0  5,630,895     5,039  ???:system__task_primitives__operations(short,...)(long, float)

With a smaller hash table, the Cachegrind report indicates a reduction of 24,543,482 data reads and 32,765,323 data writes. The cache read miss was reduced by 2,086,579 (74%) and the cache write miss was also reduced by 2,056,247 (83% reduction!).

With a small hash table, the Sax.Symbols.Symbol_Table_Record gets initialized quicker and its cleaning needs less memory accesses, hence CPU cycles. By having a smaller hash table, we also benefit from less cache miss: using a 1Mb hash table flushes a big part of the data cache.

--------------------------------------------------------------------------------
         Dr    D1mr          Dw    D1mw 
--------------------------------------------------------------------------------
188,203,089 700,776 112,114,889 413,535  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr    D1mr        Dw   D1mw  file:function
--------------------------------------------------------------------------------
43,904,760 120,883 7,532,577  8,407  ???:???
15,028,328      98 5,635,623      0  pthread_getspecific
 7,514,164     288 7,514,164  9,929  ???:system__task_primitives__operations__specific__selfXnn
 6,129,019  39,636 4,305,043 48,446  _int_malloc
 4,784,026  18,626 1,959,387  3,261  _int_free
 3,755,730       0 5,633,595  4,390  ???:system__task_primitives__operations(short,...)(long, float)
 2,418,778      65 2,705,140     14  ???:system__tasking__initialization__abort_undefer
 3,839,603   2,605 1,283,289      0  malloc

Conclusion

Running massif and cachegrind is very easy but it may take some time to figure out how to understand and use the results. A big hash table is not always a good thing for an application. By creating cache misses it may in fact slow down the application. To learn more about this subject, I recommend the excellent document What Every Programmer Should Know About Memory written by Ulrich Drepper.

To add a comment, you must be connected. Login to add a comment

Thread safe cache updates in Java and Ada

By Stephane Carrez2011-04-28 22:01:14 2 comments

Problem Description

The problem is to update a cache that is almost never modified and only read in multi-threaded context. The read performance is critical and the goal is to reduce the thread contention as much as possible to obtain a fast and non-blocking path when reading the cache.

Cache Declaration

Java Implementation

Let's define the cache using the HashMap class.

public class Cache {
   private HashMap<String,String> map = new HashMap<String, String>();
}

Ada Implementation

In Ada, let's instantiate the Indefinite_Hashed_Maps package for the cache.

with Ada.Strings.Hash;
with Ada.Containers.Indefinite_Hashed_Maps;
...
  package Hash_Map is
    new Ada.Containers.Indefinite_Hashed_Maps (Key_Type => String,
                       Element_Type => String,
                       Hash => Hash,
                       "=" => "=");

  Map : Hash_Map.Map;

Solution 1: safe and concurrent implementation

This solution is a straightforward solution using the language thread safe constructs. In Java this solution does not allow several threads to look at the cache at the same time. The cache access will be serialized. This is not a problem with Ada, where multiple concurrent readers are allowed. Only writing locks the cache object

Java Implementation

The thread safe implementation is protected by the synchronized keyword. It guarantees mutual exclusions of threads invoking the getCache and addCache methods.

   public synchronized String getCache(String key) {
      return map.get(key);
   }
   public synchronized void addCache(String key, String value) {
      map.put(key, value);
   }

Ada Implementation

In Ada, we can use the protected type. The cache could be declared as follows:

  protected type Cache is
    function Get(Key : in String) return String;
    procedure Put(Key, Value: in String);
  private
    Map : Hash_Map.Map;
  end Cache;

and the implementation is straightforward:

  protected body Cache is
    function Get(Key : in String) return String is
    begin
       return Map.Element (Key);
    end Get;
    procedure Put(Key, Value: in String) is
    begin
       Map.Insert (Key, Value);
    end Put;
  end Cache;

Pros and Cons

+: This implementation is thread safe.

-: In Java, thread contention is high as only one thread can look in the cache at a time.

-: In Ada, thread contention occurs only if another thread updates the cache (which is far better than Java but could be annoying for realtime performance if the Put operation takes time).

-: ~~Thread contention is high as only one thread can look in the cache at a time.~~

Solution 2: weak but efficient implementation

The Solution 1 does not allow multiple threads to access the cache at the same time, thus providing a contention point. The second solution proposed here, removes this contention point by relaxing some thread safety condition at the expense of cache behavior.

In this second solution, several threads can read the cache at the same time. The cache can be updated by one or several threads but the update does not guarantee that all entries added will be present in the cache. In other words, if two threads update the cache at the same time, the updated cache will contain only one of the new entry. This behavior can be acceptable in some cases and it may not fit for all uses. It must be used with great care.

Java Implementation

A cache entry can be added in a thread-safe manner using the following code:

   private volatile HashMap<String, String> map = new HashMap<String, String>();
   public String getCache(String key) {
      return map.get(key);
   }
   public void addCache(String key, String value) {
      HashMap<String, String> newMap = new HashMap<String, String>(map);

      newMap.put(newKey, newValue);
      map = newMap;
   }

This implementation is thread safe because the hash map is never modified. If a modification is made, it is done on a separate hash map object. The new hash map is then installed by the map = newMap assignment operation which is atomic. Again this code extract does not guarantee that all the cache entries added will be part of the cache.

Ada Implementation

The Ada implementation is slightly more complex basically because there is no garbage collector. If we allocate a new hash map and update the access pointer, we still have to free the old hash map when no other thread is accessing it.

The first step is to use a reference counter to automatically release the hash table when the last thread finishes its work. The reference counter will handle memory management issues for us. An implementation of thread-safe reference counter is provided by Ada Util. In this implementation, counters are updated using specific instruction (See Showing multiprocessor issue when updating a shared counter).

with Util.Refs;
...
   type Cache is new Util.Refs.Ref_Entity with record
      Map : Hash_Map.Map;
   end record;
   type Cache_Access is access all Cache;

   package Cache_Ref is new Util.Refs.References (Element_Type => Cache,
                Element_Access => Cache_Access);

  C : Cache_Ref.Atomic_Ref;

Source: Util.Refs.ads, Util.Refs.adb

The References package defines a Ref type representing the reference to a Cache instance. To be able to replace a reference by another one in an atomic manner, it is necessary to use the Atomic_Ref type. This is necessary because the Ada assignment of an Ref type is not atomic (the assignment copy and the call to the Adjust operation to update the reference counter are not atomic). The Atomic_Ref type is a protected type that provides a getter and a setter. Their use guarantees the atomicity.

    function Get(Key : in String) return String is
      R : constant Cache_Ref.Ref := C.Get;
    begin
       return R.Value.Map.Element (Key); -- concurrent access
    end Get;
    procedure Put(Key, Value: in String) is
       R : constant Cache_Ref.Ref := C.Get;
       N : constant Cache_Ref.Ref := Cache_Ref.Create;
    begin
       N.Value.all.Map := R.Value.Map;
       N.Value.all.Insert (Key, Value);
       C.Set (N); -- install the new map atomically
    end Put;

Pros and Cons

+: high performance in SMP environments
+: no thread contention in Java
-: cache update can loose some entries
-: still some thread contention in Ada but limited to copying a reference (C.Set)

2 comments

To add a comment, you must be connected. Login to add a comment

Showing multiprocessor issue when updating a shared counter

By Stephane Carrez2011-03-06 09:52:43

When working on several Ada concurrent counter implementations, I was interested to point out the concurrent issue that exists in multi-processor environment. This article explains why you really have to take this issue seriously in multi-tasks applications, specially because multi-core processors are now quite common.

What's the issue

Let's say we have a simple integer shared by several tasks:

Counter : Integer;

And several tasks will use the following statement to increment the counter:

  Counter := Counter + 1;

We will see that this implementation is wrong (even if a single instruction is used).

Multi task increment sample

To show up the issue, let's define two counters. One not protected and another protected from concurrent accesses by using a specific data structure provided by the Ada Util library.

with Util.Concurrent.Counters;
..
  Unsafe  : Integer := 0;
  Counter : Util.Concurrent.Counters.Counter;

In our testing procedure, let's declare a task type that will increment both versions of our counters. Several tasks will run concurrently so that the shared counter variables will experience a lot of concurrent accesses. The task type is declared in a declare block inside our procedure so that we will benefit from task synchronisation at the end of the block (See RM 7.6, and RM 9.3).

Each task will increment both counters in a loop. We should expect the two counters to get the same value at the end. We will see this is not the case in multi-processor environments.

declare
  task type Worker is
    entry Start (Count : in Natural);
  end Worker;

  task body Worker is
    Cnt : Natural;
  begin
      accept Start (Count : in Natural) do
        Cnt := Count;
      end;
      for I in 1 .. Cnt loop
        Util.Concurrent.Counters.Increment (Counter);
        Unsafe := Unsafe + 1;
      end loop;
  end Worker;

Now, in the same declaration block, we will define an array of tasks to show up the concurrency.

   type Worker_Array is array (1 .. Task_Count) of Worker;
   Tasks : Worker_Array;

Our tasks are activated and they are waiting to get the counter. Let's make our tasks count 10 million times.

begin
  for I in Tasks'Range loop
    Tasks (I).Start (10_000_000);
  end loop;
end;

Before leaving the declare scope, Ada will wait until the tasks have finished. (yes, there is no need to write any pthread_join code). After this block, we can just print out the value stored in the two counters and compare them:

Log.Info ("Counter value at the end       : " & Integer'Image (Value (Counter)));
Log.Info ("Unprotected counter at the end : " & Integer'Image (Unsafe));

The complete source is available in the Ada Util project in multipro.adb.

The Results

With one task, everything is Ok (Indeed!):

Starting  1 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  10000000

With two tasks, the problem appears:

Starting  2 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  8033821

And it aggravates as the number of tasks increases.

Starting  16 tasks
Expected value at the end      :  10000000
Counter value at the end       :  10000000
Unprotected counter at the end :  2496811

(The above results have been produced on an Intel Core Quad; Similar problems show up on Atom processors as well)

Explanation

On x86 processors, the compiler can use an incl instruction for the unsafe counter increment. So, one instruction for our increment. You thought it was thread safe. Big mistake!

  incl %(eax)

This instruction is atomic in a mono-processor environment meaning that it cannot be interrupted. However, in a multi-processor environment, each processor has its own memory cache (L1 cache) and will read and increment the value into its own cache. Caches are synchronized but this is almost always too late. Indeed, two processors can read their L1 cache, increment the value and save it at the same time (thus, loosing one increment). This is what is happening with the unprotected counter.

Let's see how to do the protection.

Protection with specific assembly instruction

To avoid this, it is necessary to use special instructions that will force the memory location to be synchronized and locked until the instruction completes. On x86, this is achieved by the lock instruction prefix. The following is guaranteed to be atomic on multi-processors:

  lock
  incl %(eax)

The lock instruction prefix introduces a delay to the execution of the instruction it protects. This delay increases slightly when concurrency occurs but it remains acceptable (up to 10 times slower).

For Sparc, Mips and other processors, the implementation requires to loop until either a lock is get (Spinlock) or it is guaranteed that no other processor has modified the counter at the same time.

Source: Util.Concurrent.Counters.ads, Util.Concurrent.Counters.adb

Protection with an Ada protected type

A safe and portable counter implementation can be made by using Ada protected types. The protected type allows to define a protected procedure Increment which provides an exclusive read-write access to the data (RM 9.5.1). The protected function Value will offer a concurrent read-only access to the data.

package Util.Concurrent.Counters is
    type Counter is limited private;
    procedure Increment (C : in out Counter);
    function Value (C : in Counter) return Integer;
private
  protected type Cnt is
      procedure Increment;
      function Get return Integer;
   private
      N : Integer := 0;
   end Cnt;
   type Counter is limited record
      Value : Cnt;
   end record;
end Util.Concurrent.Counters;

Source: Util.Concurrent.Counters.ads, Util.Concurrent.Counters.adb

To add a comment, you must be connected. Login to add a comment

Installing an SSD device on Ubuntu

By Stephane Carrez2011-02-20 17:29:17

This article explains the steps for the installation of an SSD device on an existing Ubuntu desktop PC.

Disk Performances

First of all, let's have a look at the disk read performance with the hdparm utility. The desktop PC has three disks, /dev/sda being the new SSD device (an OCZ Vertex 2 SATA II 3.5" SSD).

$ sudo -i hdparm -t /dev/sda /dev/sdb /dev/sdc

The three disks have the following performance:

sda: OCZ-VERTEX2 3.5        229.47 MB/sec
sdb: WDC WD3000GLFS-01F8U0  122.29 MB/sec
sdc: ST3200822A             59.23 MB/sec

The SSD device appears to be 2 times faster than a 10000 rpm disk.

Plan for the move

The first step is to plan for the move and define what files should be located on the SSD device.

Identify files used frequently

To benefit of the high read performance, files used frequently could be moved to the SSD device. To identify them, you can use the find command and the -amin option. This option will not work if the file system is mounted with noatime. The -amin option indicates a number of minutes. To find the files that were accessed during the last 24 hours, you may use the following command:

$ sudo find /home -amin -1440

In most cases, files accessed frequently are the system files (in /bin, /etc, /lib, ..., /usr/bin, /usr/lib, /usr/share, ...) and users' files located in /home.

Identify Files that change frequently

Some people argue that files modified frequently should not be located on an SSD device (write endurance and write performance).

On a Linux system, the system files that are changed on regular basis are in general grouped together in the /var directory. Some configuration files are modified by system daemons while they are running. The list of system directories that changes can be limited to:

/etc    (cups/printers.conf.0, mtab,  lvm/cache, resolv.conf, ...)
/var    (log/*, cache/*, tmp/*, lib/*...)
/boot   (grub/grubenv modified after booting)

Temporary Files

On Linux temporary files are stored in one of the following directories. Several KDE applications are saving temporary files in the .kde/tmp-host directory for each user. These temporary files could be moved to a ram file system.

/tmp
/var/tmp
/home/$user/.kde/tmp-$host

Move plan

The final plan was to create one partition for the root file system and three LVM partitions for /usr, /var and /home directories.

Partition the drive

The drive must be partitioned with fdisk. I created one bootable partition and a second partition with what remains.

$ sudo fdisk -l /dev/sda

Disk /dev/sda: 120.0 GB, 120034123776 bytes

255 heads, 63 sectors/track, 14593 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x00070355

Device Boot Start End Blocks Id System

/dev/sda1 * 1 1295 10402056 83 Linux

/dev/sda2 1296 14593 106816185 83 Linux

To ease future management of partitions, it is useful to use LVM and create a volume group.

$ sudo vgcreate vg01 /dev/sda2

Volume group "vg01" successfully created

The partitions are then created by using lvcreate. More space can be allocated on them by using the lvextend utility.

$ sudo lvcreate -L 10G -n sys vg01

Logical volume "sys" created

$ sudo lvcreate -L 10G -n var vg01

Logical volume "var" created

$ sudo lvcreate -L 4G -n swap vg01

Logical volume "swap" created

$ sudo lvcreate -L 60G -n home vg01

Logical volume "home" created

The LVM partitions are available through the device mapper and they can be accessed by their name:

$ ls -l /dev/vg01/

total 0

lrwxrwxrwx 1 root root 19 2011-02-20 14:03 home -> ../mapper/vg01-home

lrwxrwxrwx 1 root root 19 2011-02-20 14:03 swap -> ../mapper/vg01-swap

lrwxrwxrwx 1 root root 18 2011-02-20 14:03 sys -> ../mapper/vg01-sys

lrwxrwxrwx 1 root root 18 2011-02-20 14:03 var -> ../mapper/vg01-var

Format the partition

Format the file system with ext4 as it integrates various improvements which are useful for the SSD storage (Extents, Delayed allocation). Other file systems will work very well too.

$ sudo mkfs -t ext4 /dev/vg01/sys

Move the files

To move files from one system to another place, it is safer to use the tar command instead of a simple cp. Indeed, the tar command is able to copy special files without problems while not all cp commands support the copy of special files.

$ sudo mount /dev/vg01/sys /target

$ sudo -i

# cd /usr

# tar --one-file-system -cf - . | (cd /target; tar xf -)

If the file system to move is located on another LVM partition, it is easier and safer to use the pvmove utility to move physical extents from one physical volume to another one.

Change the mount point

Edit the /etc/fstab file and change the old mount point to the new one. The noatime mount option tells the kernel to avoid updating the file access time when it is read.

/dev/vg01/sys  /usr  ext4 noatime  0 2
/dev/vg02/home /home ext4 noatime  0 2
/dev/vg01/var  /var  ext4 noatime  0 2

Tune the IO Scheduler

For the SSD drive, it is best to disable the Linux IO scheduler. For this, we will activate the noop IO scheduler. Other disks will use the default IO scheduler or another one. Add the following lines in /etc/rc.local file:

test -f /sys/block/sda/queue/scheduler &&
  echo noop > /sys/block/sda/queue/scheduler

References

LVM

ext4

http://www.ocztechnologyforum.com/forum/showthread.php?54379-Linux-Tips-tweaks-and-alignment

http://www.storagesearch.com/ssdmyths-endurance.html

To add a comment, you must be connected. Login to add a comment

Boost your php web site by installing eAccelerator

By Stephane Carrez2010-10-23 06:29:00 1 comment

This article explains how to boost the performance of a PHP site by installing a PHP accelerator software.

Why is PHP slow

PHP is an interpreted language that requires to parse the PHP files for each request received by the server. With a compiled language such as Java or Ada, this long and error prone process is done beforehand. Even if the PHP interpretor is optimized, this parsing step can be long. The situation is worse when you use a framework (Symfony, CakePHP,...) that requires many PHP files to be scanned.

eAccelerator to the rescue

eAccelerator is a module that reduces this performance issue by introducing a shared cache for the PHP pre-compiled files. The module somehow compiles the PHP files in some internal compiled state and makes this available to the apache2 processes through a shared memory segment.

Installing eAccelerator

First get eAccelerator sources at http://eaccelerator.net/

Then extract the tar.bz2 file on your server:

$ tar xvjf eaccelerator-0.9.6.1.tar.bz2
eaccelerator-0.9.6.1/
eaccelerator-0.9.6.1/COPYING
...

Build eAccelerator module

Before building the module you must first run the phpize command to prepare the module before compilation:

$ cd eaccelerator-0.9.6.1/
$ phpize

Then, launch the configure script:

$ ./configure --enable-eaccelerator=shared \
    --with-php-config=/usr/bin/php-config

Finally build the module:

$ make

Install eAccelerator

Installation is done by the next steps:

$ sudo make install

Don't forget to copy the configuration file (have a look at its content but in most cases it works as is):

$ sudo cp eaccelerator.ini  /etc/php5/conf.d/

Restart Apache server

To make the module available, you have to restart the Apache server:

$ sudo /etc/init.d/apache2 restart

Performance improvements

What performance gain can you expect... That will depend on the PHP software and the page. It's easy to have an idea.

To measure the performance improvement, you can use the Apache benchmarking tool. Do a performance measurement on the web site before the installation and another one after. Be sure to benchmark the same page.

The following command will benchmark the http://mysite.mydomain.com/index.php page 100 times with only one connection.

$ ab -n 100 http://mysite.mydomain.com/index.php

Below is an extract of the percentage of the requests served within a certain time (ms) for one of my web page served by Dotclear:

         Without        with
        eAccelerator  eAccelerator
 50%       383           236
 66%       384           237
 75%       387           238
 80%       388           239
 90%       393           258
 95%       425           265
 98%       536           295
 99%       796           307
100%       796           307 (longest request)

The gain varies from 38% to 60% so it is quite interesting. The other benefit is that the variance is also smaller meaning that requests are served globally in the same time.

1 comment

To add a comment, you must be connected. Login to add a comment

Solving Linux system lock up when intensive disk I/O are performed

By Stephane Carrez2010-08-28 08:02:43

When a system lock up occurs, we often blame applications but when you look carefully you may see that despite your multi-core CPU, your applications are sleeping! No cpu activity! So what happens then? Check the I/Os, it could be the root cause!

With Ubuntu 10.04, my desktop computer was freezing when the ReadyNAS Bacula backup was running. Indeed, the Bacula daemon was performing intensive disk operations (on a fast SATA hard disk). The situation was such that it was impossible to use the system, the interface was freezing for a several seconds then working for a few seconds and freezing again.

Linux I/O Scheduler

The I/O scheduler is responsible for organizing the order in which disk operations are performed. Some algorithms allow to minimize the disk head moves, other algorithms tend to anticipate read operations,

When I/O operations are not scheduled correctly, an interactive application such as a desktop or a browser can be blocked until its I/O operations are scheduled and executed (the situation can be even worse for those applications that use the O_SYNC writing mode).

By default, the Linux kernel is configured to use the Completely Fair Queuing scheduler. This I/O scheduler does not provide any time guarantee but it gives in general good performances. Linux provides other I/O schedulers such as the Noop scheduler, the Anticipatory scheduler and the Deadline scheduler.

The deadline scheduler puts an execution time limit to requests to make sure the I/O operation is executed before an expiration time. Typically, a read operation will wait at most 500 ms. This is the I/O scheduler we need to avoid the system lock up.

Checking the I/O Scheduler

To check which I/O scheduler you are using, you can use the following command:

$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

where sda is the device name of your hard disk (or try hda).

The result indicates the list of supported I/O scheduler as well as the current scheduler used (here the Completely Fair Queuing).

Changing the I/O Scheduler

To change the scheduler, you can echo the desired scheduler name to activate it (you must be root):

# echo deadline >  /sys/block/sda/queue/scheduler

To make sure the I/O scheduler is configured after each system startup, you can add the following lines to your /etc/rc.local startup script:

test -f /sys/block/sda/queue/scheduler &&
  echo deadline > /sys/block/sda/queue/scheduler

test -f /sys/block/sdb/queue/scheduler &&
   echo deadline > /sys/block/sdb/queue/scheduler

test -f /sys/block/hda/queue/scheduler &&
   echo deadline > /sys/block/hda/queue/scheduler

You may have to change the sda and sdb into hda and hdb if you have an IDE hard disk.

Conclusion

After changing the I/O scheduler to use the Deadline scheduler, the desktop was not freezing any more when backups are running.

To add a comment, you must be connected. Login to add a comment

Experience feedback in running a SaaS application

By Stephane Carrez2010-07-14 16:02:10

When you go in production for a new service you may not know whether your application will have the necessary performance to serve your customer. Can the application support the growth? Should you deploy early? What do you do if you reach performance pr

To add a comment, you must be connected. Login to add a comment

How google analytics can alter your web performance

By Stephane Carrez2009-04-28 19:55:27

Google analytics is often used by Marketing teams to have a feedback of the web site usage, track visits, entry and leave points. Google analytics is easy to use but it has some drawbacks that you don't see at the beginning. Altering the performance of yo

To add a comment, you must be connected. Login to add a comment

Tag - performance

Google IPv4 vs IPv6 ping

Yahoo IPv4 vs IPv6 ping

Facebook IPv4 vs IPv6 ping

Memory analysis with Valgrind Massif

Cache analysis with cachegrind

Conclusion

Problem Description

Cache Declaration

Java Implementation

Ada Implementation

Solution 1: safe and concurrent implementation

Java Implementation

Ada Implementation

Pros and Cons

Solution 2: weak but efficient implementation

Java Implementation

Ada Implementation

Pros and Cons

What's the issue

Multi task increment sample

The Results

Explanation

Protection with specific assembly instruction

Protection with an Ada protected type

Disk Performances

Plan for the move

Identify files used frequently

Identify Files that change frequently

Temporary Files

Move plan

Partition the drive

Format the partition

Move the files

Change the mount point

Tune the IO Scheduler

References

Why is PHP slow

eAccelerator to the rescue

Installing eAccelerator

Build eAccelerator module

Install eAccelerator

Restart Apache server

Performance improvements

Linux I/O Scheduler

Checking the I/O Scheduler

Changing the I/O Scheduler

Conclusion

Tags

Subscribe