Restoring a complete system after a hard disk failure: bacula to the rescue!!!

By Stephane Carrez

Last day the main disk of by computer stopped to work. My Western Digital 150Gb raptor hard disk was no longer recognized by the system: it was simply dead after one year of work. The 10000 rpm di

Step 1: Boot on your Ubuntu 8.04 CD

Since the disk that crashed contained the system, my computer was not even able to boot. A first step for me was to boot on the Ubuntu CDrom without installing Ubuntu again. After booting I was able to check my other disks, look at the kernel logs to realize that the disk was really completely dead without any hope to recover anything. By looking at my second hard disk, I was able to evaluate what was lost and needed to be recovered. If you have no other disk, you have to setup a new disk to proceed. Booting on the CD also helped me discover some room on my second disk where I would install a new system.

Step 2: Install the system

If the system has gone, you may have to re-install it from scratch. This is what I had to do. Having found an old debian partition on my second hard disk, I decided to install Ubuntu 8.0.4 Desktop on it. After 15 minutes, my computer was working again, running Ubuntu 8.0.4 as before. Still, my data were lost.

Step 3: Restore with bacula

Bacula is a great network backup solution that I put in place 2 years ago. Every night my bacula server is creating an incremental, differential or full backup of my computer (zebulon). It is the first time thought that I had to recover a full content. For the recovery, you have to use the Bacula Console and use the restore command.

ciceron $ bconsole

Every action made in bacula creates a job that is recorded in the database. The first thing is to identify those jobs that did the full, differential and incremental backups.

* __list jobs__
 | JobId | Name      | StartTime           | Type | Level | JobFiles  | JobBytes       | JobStatus |
 |   877 | Zebulon   | 2007-12-02 02:22:27 | B    | F     | 1,245,258 | 31,026,036,274 | T         |
 | 1,067 | Zebulon   | 2008-02-03 00:52:18 | B    | F     |         0 |              0 | f         |
 | 1,319 | Zebulon   | 2008-04-26 22:28:29 | B    | D     |   207,801 |  6,048,511,830 | T         |
 | 1,328 | Zebulon   | 2008-04-29 22:17:04 | B    | I     |         0 |              0 | E         |
 | 1,331 | Zebulon   | 2008-04-30 22:17:04 | B    | I     |     1,025 |    761,323,545 | T         |
 | 1,511 | Zebulon   | 2008-06-29 22:47:57 | B    | I     |    77,997 |  9,050,108,256 | T         |
 | 1,514 | Zebulon   | 2008-06-30 22:16:40 | B    | I     |       968 |    613,957,318 | T         |
 | 1,517 | Zebulon   | 2008-07-01 22:16:38 | B    | I     |    16,710 |    866,232,575 | T         |
 | 1,520 | Zebulon   | 2008-07-02 22:17:00 | B    | I     |    11,530 |    887,021,057 | T         |

In result above is just an extract of the list command. Job 877 is a full backup (level F) and I had no other recent full backups than this one. It must be restored first. Since bacula has pruned the files, it has lost all the information about its contain (my backup could have been improved). Anyway, it is possible to restore completely this full backup. Jobs 1067 and 1328 cannot be used because they were in errors (I had many of them because the computer is off when the daily backup is started or for some other reasons). This is not a problem, bacula just ignores those jobs for the restore. To restore the full backup use the restore command:

  * __restore__
 
  First you select one or more JobIds that contain files
  to be restored. You will be presented several methods
  of specifying the JobIds. Then you will be allowed to
  select which files from those JobIds are to be restored.

After this, the bacula restore command prompts for a restore method. You can restore a files selectively, find files or restore a complete job or complete client. For me, I had to restore the full backup (job 877) so I selected the Enter list of comma separated JobIds to select method with my full backup job id:

 To select the JobIds, you have the following choices:
   1: List last 20 Jobs run
   2: List Jobs where a given File is saved
   3: Enter list of comma separated JobIds to select
   4: Enter SQL list command
   5: Select the most recent backup for a client
   6: Select backup for a client before a specified time
   7: Enter a list of files to restore
   8: Enter a list of files to restore before a specified time
   9: Find the JobIds of the most recent backup for a client
  10: Find the JobIds for a backup for a client before a specified time
  11: Enter a list of directories to restore for found JobIds
  12: Cancel
 Select item:  (1-12): __3__
   Enter JobId(s), comma separated, to restore: __877__
   You have selected the following JobId: 877
   
   Building directory tree for JobId 877 ...
   There were no files inserted into the tree, so file selection
   is not possible.Most likely your retention policy pruned the files
   
   Do you want to restore all the files? (yes|no):     __yes__

After this step, bacula searches which volumes (backup files, DVD, tapes) contain the backup:

   Bootstrap records written to /var/lib/bacula/janus-dir.restore.12.bsr
   
   The job will require the following
     Volume(s)            Storage(s)                SD Device(s)
   ===================================================
   
     Full-0013            File                      FileStorage
     Full-0014            File                      FileStorage
     Full-0015            File                      FileStorage
     Full-0016            File                      FileStorage
     Full-0017            File                      FileStorage
     Full-0035            File                      FileStorage
     Full-0036            File                      FileStorage
     Full-0037            File                      FileStorage
   
   1,245,258 files selected to be restored.

Now, I had to choose the client for the restore. For some reasons, I had to choose my crashed computer (zebulon):

   Defined Clients:
       1: janus-fd
       2: zebulon-fd
   Select the Client (1-2):    __2__

Bacula describes the restore job and you have a chance to change some parameters. In general, the restore process is made by the bacula daemon on the computer that you want to restore (ie, the client). This is natural, your computer X crashed and you want to recover on it. In my case, I wanted to recover on bacula server (called janus).

    Run Restore job
    JobName:         RestoreFiles
    Bootstrap:       /var/lib/bacula/janus-dir.restore.13.bsr
    Where:           /tmp/bacula-restores
    Replace:         always
    FileSet:         Janus Files
    Backup Client:   zebulon-fd
    Restore Client:  zebulon-fd
    Storage:         File
    When:            2008-07-05 14:16:28
    Catalog:         MyCatalog
    Priority:        10
    OK to run? (yes/mod/no): __mod__
    Parameters to modify:
     1: Level
     2: Storage
     3: Job
     4: FileSet
     5: Restore Client
     6: When
     7: Priority
     8: Bootstrap
     9: Where
    10: File Relocation
    11: Replace
    12: JobId
    Select parameter to modify (1-12): __5__
    The defined Client resources are:
     1: janus-fd
     2: zebulon-fd
   Select Client (File daemon) resource (1-2):__ 1__
    Run Restore job
    JobName:         RestoreFiles 
    Bootstrap:       /var/lib/bacula/janus-dir.restore.13.bsr
    Where:           /tmp/bacula-restores
    Replace:         always
    FileSet:         Janus Files
    Backup Client:   zebulon-fd
    Restore Client:  janus-fd
    Storage:         File
    When:            2008-07-05 14:16:28
    Catalog:         MyCatalog
    Priority:        10
    OK to run? (yes/mod/no):  __yes__

The restore process runs in background and a message and an email are sent after the restore job has finished. In my case, the files were restored on my bacula server in a /tmp/bacula-restores directory. When the restore process finished, that directory contained all my files.... back in December 2007. The differential backup was restored in the same say because the files were pruned too. Other jobs were restored as follows, using the same restore command:

    * __restore__
   
    First you select one or more JobIds that contain files
    to be restored. You will be presented several methods
    of specifying the JobIds. Then you will be allowed to
    select which files from those JobIds are to be restored.
   
    To select the JobIds, you have the following choices:
     1: List last 20 Jobs run
     2: List Jobs where a given File is saved
     3: Enter list of comma separated JobIds to select
     4: Enter SQL list command
     5: Select the most recent backup for a client
     6: Select backup for a client before a specified time
     7: Enter a list of files to restore
     8: Enter a list of files to restore before a specified time
     9: Find the JobIds of the most recent backup for a client
    10: Find the JobIds for a backup for a client before a specified time
    11: Enter a list of directories to restore for found JobIds
    12: Cancel
    Select item:  (1-12):__ 3__
    Enter JobId(s), comma separated, to restore: __1331,1511,1514,1517,1520__
    You have selected the following JobIds: 1331,1511,1514,1517,1520
   
    Building directory tree for JobId 1331 ...
    Building directory tree for JobId 1511 ...  +++++++++++++++++++++++++++++++++
    Building directory tree for JobId 1517 ...  +++++++++++++++++++++++++ 
    Building directory tree for JobId 1520 ...  +++++++++++++++++++++++++++++
    5 Jobs, 75,552 files inserted into the tree.
   
    You are now entering file selection mode where you add (mark) and
    remove (unmark) files to be restored. No files are initially added, unless
    you used the "all" keyword on the command line.
    Enter "done" to leave this mode.
   
    cwd is: /
    $ __mark *__
    79,536 files marked.
    $ __done__
    Bootstrap records written to /var/lib/bacula/janus-dir.restore.14.bsr
   
    The job will require the following
   Volume(s)            Storage(s)                SD Device(s)
    ======================================================
   
   Incr-0002            File                      FileStorage
   Incr-0005            File                      FileStorage
   Incr-0001            File                      FileStorage
   Incr-0006            File                      FileStorage
   
   79,536 files selected to be restored.

After the restore jobs finished, all my files were restored back to July 2nd 2008.

Lesson learned and conclusion

  1. Backup is vital in computer world. You don't want to loose your photos, emails and documents. When you loose one of them, you just cry. When you loose everything, you....die.
  2. My bacula configuration is not perfect. In particular it should do a full backup every 3 or 6 months. In the past I only used some file recovery but I've never tested a full recovery. This was an error (without bad consequences hopefully). Every change in bacula configuration must be followed by a full recovery test.
  3. The system partitions (/ and /usr) were not backup. Even if we can restore them with an installation, this may not be a good idea. You loose the configuration files and the knowledge of all the packages you have installed. Loosing this is not a big deal but it is a matter of time.
  4. It is necessary to test on a regular basis that we can recover from the backup. The problem is absolutely not the software itself. The problem is the backup configuration and backup needs that change over the time.

I am very thankful to the Bacula development team for their software. It is really a professional backup solution. I knew that for sure but now I can say I tested it in real situation. The hard disk failure only costs me time: time to install, time to recover the backup and time to write this story....