We've got a bunch of files to backup. We've got changes to these files through time. And we need an efficient method to store them and their changes.
The obvious solution comes first. Identify a file by path+name, its modification time, and then compress and backup it. To know if it has changed we can compare modification times. If the file to preserve is newer, we repeat the procedure. It works. It also is as far from optimal as it can get.
A first upgrade comes with a proper identification of the file. If we compute the hash code for its contents and use it to identify the file, we won't have to backup it again everytime we rename or move it. This helps with deduplication too. If the file exists more than once in the filesystem, we will be able now to do with just one copy of it. It may seem unimportant when we are backing up just one person's files, but we are dealing with an entire net of backups. Chances are that, if people from the same family share the program, they will share files too (photographs, i.e.). It doesn't mean that they know what others have in their computer. They don't and I will explain why and how it happens in the next post. If this procedure is applied to a company this is even more important. Different employees tend to have a huge list of duplicated files.
Another optimization are blocks. When a file is changed usually only a part of it is modified. Splitting it in pieces (blocks) we can store only parts of the file that are changed, saving lots of space. Also, sometimes users store different versions of a file, only slightly modified. Blocks can save space and bandwith here too.
So what we are identifying by their hash codes and storing are blocks, not files. Files are then composed by its metadata (path+name, permissions...) and a list of blocks.
EDIT (12/02/2012): Last night a friend asked me how is deduplication possible when it means sharing data between users whose backups came from different nodes (with different cipher keys). One of the reasons of encrypting data is to keep it private. Privacy vs deduplication has been discussed in the Dropbox
case.
Of course, he's right. I've re read the
pStore paper and it achieves deduplication by encrypting blocks with a key derived from its hash code (convergent encryption). While it's unlikely that a user might get from another node a private or shared block which doesn't belong to him (even sha-1 has a 160 bit length), it seems quite plausible that a user could decrypt the whole content of his own hard drive, which consists of other people's backups. So for now "convergent encryption" is out and so is deduplication outside of own user's scope.
This only affects to backups made from different nodes. If a company has an unified backup, it can still benefit from deduplication. In this case the user (with a single cipher key) is the company, not its employees.
EDIT 2 (12/02/2012). I'm wrong again. Convergent encryption seems safe enough
(though see attack here). I just didn't know how it's used. You can read about it
here and
here. In a nutshell: you calculate the hash of a file. Then generate a symmetric encryption key with it and encrypt the file. Then re-calculate the hash (this time of the encrypted file) and use it as the identifier or locator for the file. You must keep the first hash in order to decrypt the file. The only way of calculating the decryption key is having the original file.