Preserving Your Data
Presentation to MWVLUG 12/2/03
Backup philosphy / How does data loss occur
Here are the ways in which computers typically lose data, listed with the most common causes first:
- User Error (accidental file deletion or modification)
- Software Bugs (program bug corrupts it's data files, or worse, other files on the system
- Hardware Failure NOTE: though this is among the least common this is generally the most catastrophic since it often causes loss of all data.
- Hardware Data Corruption -- memory / processor bit errors (this may be more common than hardware failures, particularly if you use memory systems without ECC -- which make this very difficult to prevent or detect on standard PC's)
Possible Solutions:
- Mirror drives and some other forms of RAID -- can be local or network
Advantages- Up to the moment backup
- Minimal/zero down time (for types of loss which this works for)
Disadvantages
- Only protects against one of the least common causes of data loss
- Hardware failures where drive destruction is secondary (such as power supply failures) will usually wipe out the mirror drive unless it is elsewhere on the network, and may destroy it even then (such as spikes on a shared power line, or fed through the network cabling).
- Archival backups on a periodic basis to another mass storage device can be local or network
Advantages- better handles the most common types of data loss
Disadvantages
- Longer time to recover from loss
- Virtually no chance it will be completely up to date at the time of data loss
Tapes
- Very slow or very expensive
- Sequential access only
- Difficult/time consuming to recover individual files
- Incremental backups make separate updated copies of changed data
Hard Disks
- Faster (possibly much faster) than tape for full backup
- Random access
- Very quick/easy to recover individual files
- Incremental backups update the master copy of changed data (faster recovery)
- Possible choices:
- Firewire - 50 (100) MBps
- USB 2.0 - 60 MBps experienced on 2.4 kern. w/hdparm - 10 MBps (same drive gives 27 MBps on ATA-100 interface) 2.6 kernel may significantly increase, though design of USB case hardware will also affect
- SATA - ??
- Actual data rates probably much lower
CD/DVD
- Some advantages of both hard disks and tapes, random access, inexpensive media.
- Not suited to large backups without using expensive disk changing hardware or alot of user intervention
Choosing an optimal backup solution
- Archival backups - protects you against the most likely source of data loss
- Mirror/RAID - useful supplemental solution, but inadequate on it's own
- Both - best approach
Approaches / Issues
- Rotate media - maintain multiple backup copies. This is important because if a failure occurs during your backup and you maintain only one backup copy, it's possible you have just destroyed your only backup copy at the time the data is lost. It is also important because it may take some time before you realize that the data has been corrupted or lost, and if you overwrite the last good backup of the data with the corrupt version, there is nothing to recover.
- Rotate media and periodically permanently archive current backup. This is similart to the preceeding one, except that on a periodic basis, one backup is removed from the rotation and permanently archived. This is particularly useful for recovering when it takes weeks or months before corruption or data loss is discovered.
- Full backup w/incrementals
Dangers:- If using time tags as basis of incrementals, full and incremental backups must be tagged with date stamp of START time of backup and following incrementals must include everthing changed since start of prev. backup (VERY common scripting error, even in commercial products).
- New files added to system after last backup often contain old date stamps (new software installed often preserves creation dates of files in package), this means these files WILL NOT be backed up by a date/time based incremental!
because of item #2, incremental backup systems may not correctly restore a system to its state as of the last backup.
- Full backup with regular updates
- Any backup of a live system may result in corruption of data being modified during backup unless specifically addressed (databases, open documents being worked on, etc.).
Possible Backup Requirements
- Network computers (local net)
- Network computers (internet possibly through low speed link)
- Local computer
- Support backup through secure connections
- Needs to support ALL file types that may need to be recovered including: links (soft/hard), pipes, devices, etc.
- Needs to support ALL file permission/ownership information
- Freeze file system -- stop programs / Logical Volume Manager (LVM)
rsync
Rsync and hard disks eliminate the following problems:
- Eliminates the problems caused by incremental backups
- Can be used to maintain a mirror disk that could be dropped directly into a machine that has had a drive failure, minimizing down time
- Supports all major modes of backup: local network computers, remote computers on low speed links, and the local computer.
- Supports backup through secure connections
- Supports all Linux/UNIX file types, ownership and permissions (that I am aware of)
Rsync Notes:
- How it works -- comparison based on (your choice of): size, date stamp, specialized checksum algorithm
- Some Features of Interest
- Transfer of only the file differences
- Can compress the data transferred for low speed links
- Supports sparse files (sloooowwwww)
- Supports use of alternate "rsh" compatible shell such as "ssh"
- Supports all the standard linux/unix file types
- An rsync server can be setup to provide (potentially) efficient mirroring of an archive by many machines
- Problems/warnings
- Directory "/joe/bob" does not mean the same thing as "/joe/bob/" which can be confusing
- Different versions are often incompatible and fail without any meaningful/helpful error
Some options of interest (there are many others):
- "-c, --checksum" - always checksum, this is very slow, but useful if you are concerned that there may be corruption in a backup or original copy of a file.
- "-a, --archive" - archive mode, this turns on most of the options you are likely to want in a full backup, the primary one it doesn't turn on is "-H" (preserve hard links, which slows the backups)
- "-b, --backup" - make backups (default ~ suffix), creates backup copies of files at the destination which have been changed allowing you to create a primitive form of file versioning as an integral part of your backups. May want to use with: "--backup-dir" (directory to store backup files in) and "--suffix=SUFFIX" to provide a custom suffix for each copy of the backup file.
- "-u, --update" - update only (don't overwrite newer files), useful for updating between two computers where files may have been modified on either one (like the "briefcase" provided by a certain nameless evil corporation :-)
- "-H, --hard-links" - preserve hard links, if you wish to preserve hard links in your backups you will need this.
- "-S, --sparse" - handle sparse files efficiently, rsync appears to create sparse files under some circumstances without this, but not always, this is very slow, but useful if you have large files that are mostly empty (such as disk images for virtual machines). Probably a good idea to only use on a limited subset of your total backup.
- "-n, --dry-run" - show what would have been transferred. This is the MOST IMPORTANT option to use, particularly when first learning rsync, since you can easily wipe out all the files on your source or destination if you make a mistake.
- "-W, --whole-file" - copy whole files, no incremental checks, useful for transfers on a local network as it can be much faster if you have the bandwidth
- "-e, --rsh=COMMAND" - specify rsh replacement, allows you to specify an alternate shell to use for the link (such as ssh). NOTE: don't use ssh unless you really need it, since encryption / decryption can use a tremendous amount of CPU and significantly slow down a transfer (particularly with older hardware). May want to use with: "--rsync-path=PATH" - specify path to rsync on the remote machine.
- "--delete" - delete files that don't exist on the sending side, necessary for creating an exact backup, but can be very dangerous, use with "--dry-run" until you are sure of what you are doing.
- "--delete-excluded" - also delete excluded files on the receiving side. Same dangers as "--delete".
- "-z, --compress" - compress file data, useful for backups across the internet or other lower speed links.
- "--exclude=PATTERN" - exclude files matching PATTERN.
- "--exclude-from=FILE" - exclude patterns listed in FILE.
Methods of recovery:
- Use rsync to mirror the data from a backup disk, back to the drive that is missing data or contains corrupted data.
- Backup to duplicate "mirror" drive, in event of failure, drop in mirror drive and reboot (depending on how it was created, may need to boot with rescue floppy and run LILO to make hard drive bootable).
- Boot with rescue disk set that has rsync and required libraries on it, or boot with stock rescue disk set and then install floppy containing rsync and required libraries, then initialize network or local disk drivers (depending on where data is stored) and transfer the data back to the appropriate locations
Sample rsync command line to backup (or restore data):
rsync --rsh ssh -aH --sparse --delete --delete-excluded \ --exclude-from=/tmp/exclude_me \ root@vogon:/ /home/backups/vogon/
Sample exclude file
*/swapfile **/.gimp/gimpswap* **/.netscape_cache/* **/.netscape/cache/* **/.opera/cache4/* /tmp/* /var/tmp/* /dev/pts/* /proc/* /mnt/* /cdrom/* /floppy/* /crypt/* /var/spool/squid/* /home/public/ /home/nobackup/