I have looked at this question and this question, but they do not seem to address the symptoms I am seeing.

I have a large log file (around 600 MB) that I am trying to transfer across a cellular network. Because it is a log file it is just appended to (although it is actually in an SQLite database with only INSERT being performed, so it isn't quite as simple as that, but with the exception of the last 4k page (or maybe a few) the file is identical each time. It is important that only the changes (and whatever checksums need to be transmitted) actually get sent, because the data connection is metered.

Yet when I perform a test across an unmetered connection (e.g. free wifi hotspot) I do not see an speed-up or reduced data transfer observed or reported. Over a slow WiFi connection I see on the order of 1MB/s or less, reporting that the transfer is going to take nearly 20 minutes. Over a fast WiFi connection I see a uniform faster speed, but no report of speedup, and a second attempt to transfer (which now should be faster because the two files are identical) does now show any difference.

The (sanitized to remove sensitive info) command I am using is:

rsync 'ssh -p 9999' --progress LogFile michael@my.host.zzz:/home/michael/logs/LogFile

The output I get at the end looks like this:

LogFile
    640,856,064 100%   21.25MB/s   0:00:28 (xfr$1, to-chk=0/1)

There is no mention of any kind of speedup.

I suspect the problem may be one of the following:

  • I am missing some command line option. However, re-reading the man page seems to suggest that delta transfers are enabled by default: I only see options for disabling them.
  • I'm using rsync over ssh (on a non-standard port even) due to the server being behind a firewall that only allows ssh. I haven't seen anything explicitly saying delta transfers won't work if the rsync daemon isn't running, though. I tried using the "::" notation instead of ":" but the man page isn't very clear about what a "module" is, and my command is rejected for specifying an invalid module.

I have ruled out the following:

  • delta transfers not performed on a local network. Ruled out because I am trying to perform the transfer across the internet
  • overhead due to checksum calculation. I have seen this behavior both on a fast and slow Wifi connection and the transfer rate doesn't seem to be compute bound.
share|improve this question
  • 1
    but with the exception of the last 4k page (or maybe a few) the file is identical each time. Did you actually verify that with cmp? Or better, with xdelta or something? If you really want to minimize transfer size, keep the old and new versions locally, so you can compute a minimal binary diff locally (with something other than rsync) and just send that without having to send checksums over the metered connection. Doing this at the database-record level instead of the binary-file level is probably even better, like derobert suggests. – Peter Cordes Jul 26 '16 at 5:02
  • 1
    Also, you could have used rsync --stats, and also -v -v to get even more verbose stats. Rsync will tell you how much matched vs. unmatched data there was. – Peter Cordes Jul 26 '16 at 5:03
up vote 26 down vote accepted

Summary

Databases tend to keep a lot of metadata, organizational data, etc. An insert is very unlikely to be a simple append, like it would be with a text file. Testing SQLite shows it behaves that way, in both WAL and non-WAL modes. This leads to rsync having to sync a lot more data than you'd expect. You can reduce this overhead somewhat by using a low --block-size (at the cost of more overhead computing and transferring checksums).

A better approach is probably to dump out new records as a SQL dump, compress it, and transfer that. Alternatively, there appear to be several replication solutions for SQLite, you could use one of those.

roaima suggests at a bare minimum you could probably do a full SQL dump, compress it using gzip --rsyncable, and then rsync that. Worth a test, I suppose, to see if that's a small enough delta.

Details

What you're trying should work. I'd personally add --partial to your rsync options, just in case it's somehow detecting the growing file as a partial transfer. You can also get better transfer stats with --stats.

The second thing to check is if SQLite is really only touching a few pages—honestly, I wouldn't be surprised if it's writing pages all over the file. One quick way to check would be to use cmp -l on two versions—see if there are changes to pages other than the final few. Remember that rsync's idea of a "page"/block is different than SQLite's; you can change rsync's via --block-size. Reducing it might help.

Edit: I did a quick test with SQLite. Even with 32k pages, adding a bunch of log entries scribbled on every page. Details below.

Edit 2: It appears to be better in WAL mode, though you still take a huge amount of overhead, probably from the checkpoint.

Edit 3: It is also better the more data you're adding per transfer—I guess it probably scribbles certain blocks over and over again. So you're transferring that same set of blocks regardless of whether it wrote to them once or a hundred times.

BTW: For minimizing transfer, you can probably do much better than rsync. For example, a SQL dump of new records since the last transfer run through xz --best (or even gzip) would probably be a fair bit smaller.

Quick SQLite Test

Schema:

CREATE TABLE log (id integer primary key not null, ts integer not null, app text not null, message text not null);
CREATE INDEX log_ts_idx on log(ts);
CREATE INDEX log_app_idx on log(app);

Perl program:

use 5.022;
use DBI;

my $DBH = DBI->connect('dbi:SQLite:test.db', '', '', {RaiseError => 1, AutoCommit => 0})
    or die "connect...";

my @apps = (
    '[kthreadd]',        '[ksoftirqd/0]',
     # there were 191 of these
    '[kworker/5:0H]',
);

my @messages = <DATA>;

(my $curr_time) = $DBH->selectrow_array(<<QUERY);
    SELECT COALESCE(MAX(ts),978307200) FROM log
QUERY

my $n_apps = @apps;
my $n_msgs = @messages;
say "Apps: $n_apps";
say "Messages: $n_msgs";
say 'Start time: ', scalar gmtime($curr_time), ' UTC';

my $sth = $DBH->prepare(<<QUERY);
    INSERT INTO log(ts, app, message) VALUES (?, ?, ?)
QUERY

for (my $i = 0; $i < 10_000; ++$i) {
    $sth->execute(int($curr_time), $apps[int rand $n_apps], $messages[int rand $n_msgs]);
    $curr_time += rand 0.1;
}
$DBH->commit;

__DATA__
microcode: CPU0 microcode updated early to revision 0x19, date = 2013-06-21
Linux version 4.5.0-2-amd64 (debian-kernel@lists.debian.org) (gcc version 5.3.1 20160528 (Debian 5.3.1-21) ) #1 SMP Debian 4.5.5-1 (2016-05-29)

There were a lot more example log messages (2076).

Checking for which pages changed:

cp test.db test.db.old
perl test.pl
cmp -l test.db.old test.db | perl -n -E '/^\s*(\d+) / or die "wtf"; $bucket{int $1/32768} = 1; END { say join "\n", sort( { $a <=> $b } keys %bucket) }'
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.