Peerbackup: December 2012

Saturday, December 22, 2012

Typing time

Despite having almost zero time for peerbackup, I resist to stall its development. My exams finish in mid February, but there're some things that I can do meanwhile.

I need to test the concepts I've been learning and explaining on previous posts, so I think it's time for writing some actual code.

The plan is: pick a file and split it in blocks. For these blocks, calculate their adler32 (rolling checksum) and hash codes, encrypt them and copy the pieces to another directory. Once this is done, try to restore the file. Then modify it and update the backup.

What I want is to develop and test locally as much functionality as possible. I expect the networking part of peerbackup to be the most difficult one, so it will have to wait. I will be updating the repository as new functionaly is added.

Thursday, December 6, 2012

Rolling checksums and a pinch of salt

There's a problem with the idea of blocks explained on the last post. Let's say that we've got a very large text file. We split it in blocks and store them. Supose that the next day we add a line of text at the begining of this file. What happens then? We split it again, but every block is different from previous ones. The beginning of each block has been displaced. That's not what we want, because we should be able to find that some bytes have been inserted and store only these changes.

Here's when rolling checksums become handy. We can compute a rolling checksum of a block as if it were a hash code. But it's possible to displace it afterwards. Adding next byte's contribution to the rolling checksum and substracting first block's byte contribution, we calculate the rolling checksum of a block displaced one byte from the previous one. In other words, if we have the rolling checksum of bytes X₁...X_n, it's cheap to calculate the checksum of bytes X₂...X_n+1. Even more, it's done fast and efficiently. Here's a better explanation of its use in rsync.

This type of checksums are called "weak". It means that there're more probabilities of a false match than with cryptographic methods. This is why when a match is found we compute a cryptographic checksum (more time and resource consuming) to confirm the match.

Let's go now for some salt. I'm talking about an idea the Tahoe filesystem team had about convergent encryption. Convergent encryption has a well known privacy problem which allows to check if someone has a file if you have it too. It's simple: you calculate its hash (once encrypted) and compare it with hashes that you store in your harddrive. I expect that it doesn't affect Peerbackup, as blocks will hopefully be located only by hash, not by propietary. But, just in case, how can the effects of this attack be avoided? Tahoe team has come to a good solution. They allow to add salt to the hash of the block. In cryptography, the word salt is used for a group of bits that are added to a plain text in order to make salted ciphertext different from unsalted one. Depending on the salt used, encrypted result will vary. To use this idea, salt text should be defined by each user, so deduplication would happen among every user with the same salt text. Let's say that the default behaviour is no salt text. Only if you define a salt phrase you will be protected from the attack and excluded from deduplication with other's files. If you trust someone, or a group of people, you could all use the same salt phrase and benefit from mutual deduplication.
I like the idea and plan to borrow it for Peerbackup.

Saturday, December 1, 2012

Files, blocks and deduplication

We've got a bunch of files to backup. We've got changes to these files through time. And we need an efficient method to store them and their changes.

The obvious solution comes first. Identify a file by path+name, its modification time, and then compress and backup it. To know if it has changed we can compare modification times. If the file to preserve is newer, we repeat the procedure. It works. It also is as far from optimal as it can get.

A first upgrade comes with a proper identification of the file. If we compute the hash code for its contents and use it to identify the file, we won't have to backup it again everytime we rename or move it. This helps with deduplication too. If the file exists more than once in the filesystem, we will be able now to do with just one copy of it. It may seem unimportant when we are backing up just one person's files, but we are dealing with an entire net of backups. Chances are that, if people from the same family share the program, they will share files too (photographs, i.e.). It doesn't mean that they know what others have in their computer. They don't and I will explain why and how it happens in the next post. If this procedure is applied to a company this is even more important. Different employees tend to have a huge list of duplicated files.

Another optimization are blocks. When a file is changed usually only a part of it is modified. Splitting it in pieces (blocks) we can store only parts of the file that are changed, saving lots of space. Also, sometimes users store different versions of a file, only slightly modified. Blocks can save space and bandwith here too.

So what we are identifying by their hash codes and storing are blocks, not files. Files are then composed by its metadata (path+name, permissions...) and a list of blocks.

EDIT (12/02/2012): Last night a friend asked me how is deduplication possible when it means sharing data between users whose backups came from different nodes (with different cipher keys). One of the reasons of encrypting data is to keep it private. Privacy vs deduplication has been discussed in the Dropbox case.
Of course, he's right. I've re read the pStore paper and it achieves deduplication by encrypting blocks with a key derived from its hash code (convergent encryption). While it's unlikely that a user might get from another node a private or shared block which doesn't belong to him (even sha-1 has a 160 bit length), it seems quite plausible that a user could decrypt the whole content of his own hard drive, which consists of other people's backups. So for now "convergent encryption" is out and so is deduplication outside of own user's scope.
This only affects to backups made from different nodes. If a company has an unified backup, it can still benefit from deduplication. In this case the user (with a single cipher key) is the company, not its employees.
EDIT 2 (12/02/2012). I'm wrong again. Convergent encryption seems safe enough (though see attack here). I just didn't know how it's used. You can read about it here and here. In a nutshell: you calculate the hash of a file. Then generate a symmetric encryption key with it and encrypt the file. Then re-calculate the hash (this time of the encrypted file) and use it as the identifier or locator for the file. You must keep the first hash in order to decrypt the file. The only way of calculating the decryption key is having the original file.