Thursday, December 6, 2012

Rolling checksums and a pinch of salt

There's a problem with the idea of blocks explained on the last post. Let's say that we've got a very large text file. We split it in blocks and store them. Supose that the next day we add a line of text at the begining of this file. What happens then? We split it again, but every block is different from previous ones. The beginning of each block has been displaced. That's not what we want, because we should be able to find that some bytes have been inserted and store only these changes. 
Here's when rolling checksums become handy. We can compute a rolling checksum of a block as if it were a hash code. But it's possible to displace it afterwards. Adding next byte's contribution to the rolling checksum and substracting first block's byte contribution, we calculate the rolling checksum of a block displaced one byte from the previous one. In other words, if we have the rolling checksum of bytes X1...Xn, it's cheap to calculate the checksum of bytes X2...Xn+1. Even more, it's done fast and efficiently. Here's a better explanation of its use in rsync.
This type of checksums are called "weak". It means that there're more probabilities of a false match than with cryptographic methods. This is why when a match is found we compute a cryptographic checksum (more time and resource consuming)  to confirm the match.
Let's go now for some salt. I'm talking about an idea the Tahoe filesystem team had about convergent encryption. Convergent encryption has a well known privacy problem which allows to check if someone has a file if you have it too. It's simple: you calculate its hash (once encrypted) and compare it with hashes that you store in your harddrive. I expect that it doesn't affect Peerbackup, as blocks will hopefully be located only by hash, not by propietary. But, just in case, how can the effects of this attack be avoided? Tahoe team has come to a good solution. They allow to add salt to the hash of the block. In cryptography,  the word salt is used for a group of bits that are added to a plain text in order to make salted ciphertext different from unsalted one. Depending on the salt used, encrypted result will vary. To use this idea, salt text should be defined by each user, so deduplication would happen among every user with the same salt text. Let's say that the default behaviour is no salt text. Only if you define a salt phrase you will be protected from the attack and excluded from deduplication with other's files. If you trust someone, or a group of people, you could all use the same salt phrase and benefit from mutual deduplication.
I like the idea and plan to borrow it for Peerbackup.

No comments:

Post a Comment