Sam Doran

My little corner of the Internet

Time Machine-like Backups Using rsync

Time Machine is one of the most significant features to be added to any desktop operating system in recent years. Before Time Machine, the percentage of users that regularly backed up their data was alarmingly small. Today, I would argue that a significant number of Mac users backup their data regularly. Windows users still have no built in backup solution that matches the elegance and simplicity of Time Machine.

The most wonderful things about Time Machine are that it uses hard links to save disk space and backups are simply folders and files on disk; Time Machine does not imprison your data in a proprietary container that can only be read by the program that created it (here’s looking at you, Acronis). The drive can easily be mounted on any Mac and you can easily browse the snapshots using the Finder. The snapshot folders are named YYYY-MM-DD-HHMMSS making it very easy to see when the snapshot was made.

Why not just use rsnapshot?

Actually, I do. I use rsnapshot currently to backup the files on a Time Capsule that is used as a NAS for sharing files. Here is the backup script I wrote that backups up the Time Capsule:

While it works quite well, my main complaint with rsnapshot is that it does not name the snapshots using the date. I understand this makes things easier from a snapshot rotation point of view, but it makes browsing the snapshots less intuitive. To work around this, I found datestamp_backups.py written by Terry Hancock. I had to make a few modifications to get it working properly:

datestamp_backups.pylink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/env python
# Copyright (C) 2009 by Terry Hancock
#---------------------------------------------------------------------------------------
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundatioe, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#---------------------------------------------------------------------------------------

#-- Datestamp Index for Backups --

from datetime import date
import os, glob, time

RSNAP_ROOT = '[backup directory]'
RSNAP_INTS = 'hourly', 'daily', 'weekly', 'monthly'

date_dir = os.path.join(RSNAP_ROOT, 'date')

# get the date
today = date.today()
stamp = today.isoformat()

# put date in the top level of the current backup (daily.0)
#os.remove(os.path.join(RSNAP_ROOT, 'daily.0', 'DATE'))
#open(os.path.join(RSNAP_ROOT, 'daily.0', 'DATE'), 'wt').write(stamp+'\n')

# delete the old index entries
os.system('rm -r %s' % os.path.join(RSNAP_ROOT, 'date'))
os.mkdir(os.path.join(RSNAP_ROOT, 'date'))

#for symlink in os.listdir(date_dir):
#    os.remove(os.path.join(date_dir, symlink))

# read dates in all backups and write out symlinks for them as new index
append = 'bcdefghijklmnopqrstuvwxyz'
for interval in RSNAP_INTS:
    for dirname in glob.glob(os.path.join(RSNAP_ROOT, '%s.*'%interval)):
        directory = os.path.join(RSNAP_ROOT, dirname)
  datestamp = time.strftime('%Y-%m-%d-%H%M%S', time.localtime(os.path.getmtime(directory)))
#        datestamp = open(os.path.join(directory, 'DATE'),'rt').read().strip()
        target = os.path.join(date_dir, datestamp)
        i=0
        while os.path.exists(target):
            target = os.path.join(date_dir, datestamp + append[i])
            i += 1
        os.symlink(directory, target)

While rsnapshot does accomplish the main goal of making snapshots using hard links to save space, it wasn’t quite what I was looking for. So I wrote my own backup script that I hope is helpful to others.

What works for me

I created a project on GitHub called rsync-time-machine.sh. I have tested and used this on OS X 10.8 and RHEL 5. It is not as robust as rsnapshot and it does not have any disk free space checking (maybe I’ll add that later), but it is very simple to get up and running: define the SOURCE and DESTINATION variables and run it. That’s it.

I have used this on production RHEL servers to backup the entire system disk, with a few exclusions, to another internal drive. I plan to use it on my Mac at home for rotating off-site backups once I get around to buying more hard drives.

While Time Machine in Mountain Lion does have the ability to use multiple disks, I don’t backup everything using Time Machine. Specifically, my Aperture Vaults and Virtual Machines are excluded from Time Machine backups. Plus, I like being able to manually create snapshots or schedule them using cron or launchd.

By default, rsync-time-machine.sh will keep all backups from the current month and monthly backups for a year. You can comment out this functionailty if you would rather manually manage backup removal.

I’m sure there is a lot of room for improvement which is why I created a project on Github. Please feel free to contribute.


Update 2013-06-10: I finally got around to buying two 3TB hard drives for rotating off site backups. While this script worked great when I used it for RHEL 5 server backups to separate internal drives, things are not quite as rosy on OS X with a drive toaster.

The major problem I’m running into is that it takes about twelve hours to take a snapshot. I have about 900GB of data spread across approximately 600,000 files. Based on my research, the number of files is what is really killing me because rsync has to build a massive index before it does any copying. The version of rsync that comes with OS X is 2.6.9, which has some known performance issues with large numbers of files. In my experience, my computer is unusably slow during the snapshot.

The 3.x branch specifically addressed performance issues with large numbers of files. I compiled version 3.0.9 following these excellent instructions and will say that it no longer brings the system to a crawl when it is running — but the snapshot still takes about twelve hours to complete even if nothing substantial has changed.

It does not appear that rsync is memory or CPU constrained, so the bottleneck must be I/O or RAM1. I will admit, I am running this on a 2007 iMac Core 2 Duo with a measly 4GB of RAM connected to an external drive via FireWire 800, so this is far from an ideal setup. I was expecting the initial snapshot to take forever and subsequent snapshots to be relatively fast as long as I didn’t add a bunch of data. It seems that it’s all just very slow. I’m not sure how much of this has to do with the fact that I am using --link-dest to hard link files that have not changed, or that I am transferring the files locally and not over a network. It could also just be a limit to what rsync can do and there may be no way to do fast transfers of hundreds of thousands of files even if only one has changed.

The advantage Time Machine and utilities like Chronosyc have over rsync is that they keep track of what has changed since the last backup; rsnyc has to figure this out each time it is run. I am not sure why this worked so well on my RHEL servers (I don’t work there anymore) and performs so poorly on my OS X box. If anyone has any ideas on how to make this perform better, please let me know.


Update 2013-06-25 I have figured out the cause of my problems and rsync-time-machine is working much better now. I used the following parameters to aid in my troubleshooting:

1
2
3
4
5
6
--stats   # Shows a summary at the end of the transfer
-i        # Itemize changes, showing what changed and why
-v        # Verbose output
-A        # Preserve ACLs
-X        # Preserve extended attributes
-E        # Preserve executability

With --stats enabled, I was able to see the reason snapshots were taking so long to complete even after the inital transfer (it should only copy new files and hard link to files that have not changed): all 900GB was being copied every single time.

Enabling the -i option showed me exactly why rsync felt it needed to copy all the files again. The owner and group were different between the source and destination, so rsync dutifully copied all the files over again. This reminded me that OS X defaults to ignoring owership on external volumes, which makes for a better user experience 99% of the time. This is that 1%2.

In order to keep the owner and group permissions intact, run the following:

1
2
sudo diskutil list  # take note of the external disk number
sudo diskutil enableOwnership [disk]

One thing to note about this configuration change is that the ownership settings are not stored on the external disk, but rather on the OS X system drive in /var/db/volinfo.database. If you connect the drive to another OS X system, it will ignore ownership on the drive unless you run the above cammands again3. If you want more background on how this works, look at the man diskutil and man vsdbutil.

I left the above parameters enabled once I got everything working properly and decided to save the output to a log file. With OS X, I believe the -X and -E are import for preserving some of the special behavior of things like iPhoto libraries and application bundles.

/Library/LaunchDaemons/rsync-time-machine.plist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC -//Apple Computer//DTD PLIST 1.0//EN
http://www.apple.com/DTDs/PropertyList-1.0.dtd >
<plist version="1.0">
<dict>
    <key>Disabled</key>
    <false/>
    <key>Label</key>
    <string>rsync-time-machine</string>
    <key>UserName</key>
    <string>root</string>
    <key>GroupName</key>
    <string>admin</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/rsync-time-machine.sh</string>
    </array>
    <key>StartCalendarInterval</key>
        <dict>
            <key>Hour</key>
            <integer>2</integer>
            <key>Minute</key>
            <integer>0</integer>
        </dict>
    <key>StandardOutPath</key>
    <string>/var/log/rsync-time-machine.log</string>
    <key>StandardErrorPath</key>
    <string>/var/log/rsync-time-machine.log</string>
    <key>Debug</key>
    <true/>
    <key>AbandonProcessGroup</key>
    <true/>
</dict>
</plist>

# Load this job with
sudo launchctl load /Library/LaunchDaemons/[jobname].plist

If you would like to rotate the log files and specify permissions, use this config file:

/etc/newsyslog.d/rsyslog-time-machine.conf
1
2
# logfilename                      [owner:group]    mode count size when  flags [/pid_file] [sig_num]
/var/log/rsync-time-machine.log      root:admin       640  7    *    $D0   J

Well this has been a long journey and this post is now about ten kilometers long! I now have a solid offsite backup program in place and sleep better at night. I hope this helps someone else in their backup endeavors.

  1. I ran iostat -w 1 but was unable to monitor the drive toaster. I must be doing something wrong. The internal disk doesn’t look like it’s anywhere near max output, though.

  2. It’s not easy being a nerd.

  3. You can also disable owneship by right-clicking the drive, clicking Get Info, and unchecking “Ignore ownership on this volume”, but where’s the fun in that? The command line is much more fun!