Note that a copy instruction never copies bytes from the (partial) target file itself. So a diff from an empty reference file will not compress the target file as some other tools do.
The diff file is produced by a greedy scan of the target file in
sequential order. Starting at the first byte of the target try to find
a matching subsequence of the reference file. The subsequence should
be as long as possible but not smaller than some lower bound. This
lower bound is specified by the variable minimal_match in
the script, which is 48 by default. If a match is found a
copy instruction is added to the diff file and the matching sequence in
the target file is skipped. If no match is found we skip a single byte
in the target file and search for a match again. Once a match is found
or the end of the target file is reached we add an append instruction
for the subsequence in the target file for which no match was found.
This whole process repeats until the end of the target file is reached.
Note that any two append instructions in the resulting diff file are
always separated by at least one copy instruction.
The index algorithm in ldiff is O(N) in the length of the reference
file to index. Instead of indexing the complete suffixes for each file
offset ldiff indexes “fingerprints” at only those offsets
that are a multiple of some fixed stride. This stride is specified by
the index_step variable in ldiff.lua and is 8
by default. Here “fingerprint” denotes the following
concept.
If S is the complete text of the reference file and
p is a file position then the fingerprint of
S at offset p is the sequence S[p +
fi] for i from 1..19 with
f the Fibonacci (like) sequence 0, 1, 2, 4, 7, .., 6764.
This sequence is chosen rather arbitrarily but has the property that it
becomes sparser at higher indices. Since each fingerprint covers a
subsequence of length 6765 by examining only 19 positions we can still
expect to find large subsequence matches when they exist but at a much
lower price than indexing full suffixes.
Although it is theoretically possible to build a full suffix trie of the reference file in O(N) (by Ukkonen's algorithm) the implementation will be much more involved than the current fingerprint approach that also performs quite well.