Saturday, December 22, 2012

Typing time

Despite having almost zero time for peerbackup, I resist to stall its development. My exams finish in mid February, but there're some things that I can do meanwhile.
I need to test the concepts I've been learning and explaining on previous posts, so I think it's time for writing some actual code. 
The plan is: pick a file and split it in blocks. For these blocks, calculate their adler32 (rolling checksum) and hash codes, encrypt them and copy the pieces to another directory. Once this is done, try to restore the file. Then modify it and update the backup. 
What I want is to develop and test locally as much functionality as possible. I expect the networking part of peerbackup to be the most difficult one, so it will have to wait. I will be updating the repository as new functionaly is added.

Thursday, December 6, 2012

Rolling checksums and a pinch of salt

There's a problem with the idea of blocks explained on the last post. Let's say that we've got a very large text file. We split it in blocks and store them. Supose that the next day we add a line of text at the begining of this file. What happens then? We split it again, but every block is different from previous ones. The beginning of each block has been displaced. That's not what we want, because we should be able to find that some bytes have been inserted and store only these changes. 
Here's when rolling checksums become handy. We can compute a rolling checksum of a block as if it were a hash code. But it's possible to displace it afterwards. Adding next byte's contribution to the rolling checksum and substracting first block's byte contribution, we calculate the rolling checksum of a block displaced one byte from the previous one. In other words, if we have the rolling checksum of bytes X1...Xn, it's cheap to calculate the checksum of bytes X2...Xn+1. Even more, it's done fast and efficiently. Here's a better explanation of its use in rsync.
This type of checksums are called "weak". It means that there're more probabilities of a false match than with cryptographic methods. This is why when a match is found we compute a cryptographic checksum (more time and resource consuming)  to confirm the match.
Let's go now for some salt. I'm talking about an idea the Tahoe filesystem team had about convergent encryption. Convergent encryption has a well known privacy problem which allows to check if someone has a file if you have it too. It's simple: you calculate its hash (once encrypted) and compare it with hashes that you store in your harddrive. I expect that it doesn't affect Peerbackup, as blocks will hopefully be located only by hash, not by propietary. But, just in case, how can the effects of this attack be avoided? Tahoe team has come to a good solution. They allow to add salt to the hash of the block. In cryptography,  the word salt is used for a group of bits that are added to a plain text in order to make salted ciphertext different from unsalted one. Depending on the salt used, encrypted result will vary. To use this idea, salt text should be defined by each user, so deduplication would happen among every user with the same salt text. Let's say that the default behaviour is no salt text. Only if you define a salt phrase you will be protected from the attack and excluded from deduplication with other's files. If you trust someone, or a group of people, you could all use the same salt phrase and benefit from mutual deduplication.
I like the idea and plan to borrow it for Peerbackup.

Saturday, December 1, 2012

Files, blocks and deduplication

We've got a bunch of files to backup. We've got changes to these files through time. And we need an efficient method to store them and their changes.
The obvious solution comes first. Identify a file by path+name, its modification time, and then compress and backup it. To know if it has changed we can compare modification times. If the file to preserve is newer, we repeat the procedure. It works. It also is as far from optimal as it can get.
A first upgrade comes with a proper identification of the file. If we compute the hash code for its contents and use it to identify the file, we won't have to backup it again everytime we rename or move it. This helps with deduplication too. If the file exists more than once in the filesystem, we will be able now to do with just one copy of it. It may seem unimportant when we are backing up just one person's files, but we are dealing  with an entire net of backups. Chances are that, if people from the same family share the program, they will share files too (photographs, i.e.). It doesn't mean that they know what others have in their computer. They don't and I will explain why and how it happens in the next post. If this procedure is applied to a company this is even more important. Different employees tend to have a huge list of duplicated files.
Another optimization are blocks. When a file is changed usually only a part of it is modified. Splitting it in pieces (blocks) we can store only parts of the file that are changed, saving lots of space. Also, sometimes users store different versions of a file, only slightly modified. Blocks can save space and bandwith here too.
So what we are identifying by their hash codes and storing are blocks, not files. Files are then composed by its metadata (path+name, permissions...) and a list of blocks.

EDIT (12/02/2012): Last night a friend asked me how is deduplication possible when it means sharing data between users whose backups came from different nodes (with different cipher keys). One of the reasons of encrypting data is to keep it  private. Privacy vs deduplication has been discussed in the Dropbox case.
Of course, he's right. I've re read the pStore paper and it achieves deduplication by encrypting blocks with a key derived from its hash code (convergent encryption). While it's unlikely that a user might get from another node a private or shared block which doesn't belong to him (even sha-1 has a 160 bit length), it seems quite plausible that a user could decrypt the whole content of his own hard drive, which consists of other people's backups. So for now "convergent encryption" is out and so is deduplication outside of own user's scope.
This only affects to backups made from different nodes. If a company has an unified backup, it can still benefit from deduplication. In this case the user (with a single cipher key) is the company, not its employees.
EDIT 2 (12/02/2012). I'm wrong again. Convergent encryption seems safe enough (though see attack here). I just didn't know how it's used. You can read about it here and here. In a nutshell: you calculate the hash of a file. Then generate a symmetric encryption key with it and encrypt the file. Then re-calculate the hash (this time of the encrypted file) and use it as the identifier or locator for the file. You must keep the first hash in order to decrypt the file. The only way of calculating the decryption key is having the original file.

Thursday, November 29, 2012

Deathstar and reinventing the wheel

I've spent most of the week reading papers about techniques for distributed backups. It seems to exist lots of theory about it, but no free software application in active development.
I only have found existing code for DIBS (Distributed Internet Backup System), but last commit seems to date back from 2009. I'll study its code to learn about the problem I'm trying to solve.
After this little research done I've come to conclusion that the program I'm writting can be compared with building a deathstar, not a kennel as I first thought. So, even sticking to the initial idea of "evolutive development", I need to gather quite a lot of information before producing any actual code.
Other issue that keeps bothering my mind is the idea that I can be reinventing the wheel. It seems that the rsync protocol is well suited for the problem of transmitting data between peers. If that's the case, I will try to use it. Even more, I will try to find an existing implementation or library in Go and use it. I need to make an effort to remind to myself: It's free software. Reuse when possible. This mantra should be extended to the whole development process. Try to rely in existing standards and working implementations as much as possible.
So, for a while, expect activity in this blog, but not much it the repository. It's time for reading and summarizing information.

Saturday, November 24, 2012

Directory syncing

Directory syncing, as well as finding time, has proven  to be harder than I expected three weeks ago. Short version: I failed in my first two week goal. 
But the project is alive and i have learned quite a lot of things. Right now the program can compare two folders prompting which files should be copied, which ones are ok and which ones should be erased. This functionality isn't finished, because it doesn't check modification dates yet.
I'm using filepath.Walk to traverse the directory tree. Two independent coroutines are launched and the main function reads information from both of them. As path.FileWalk returns files ordered alphabetically it's possible for the main function to know which action should be take with each file.
But this implementation isn't good enough for various reasons. 
First of all, filepath.Walk behaviour is perfect for copying and creating directories and files, but not appropriated for deleting files and directories. We must delete first files and afterwards their parent directory. This problem can be fixed by keeping a list of files to be deleted and performing the deletion at the end. But honestly, I don't think this would be a good solution. So, a more customized version of path.FileWalk is needed.
Secondly, once the program has evolved and the remote encrypted copy is made, it will be inefficient, if not impossible, to check the remote tree as we do now. A local database with information of the last backup made must be kept. The local program should verify files using this database. Looking after remote files will be a task of the remote daemon.
I'm finishing this implementation in order to detect more pitfalls and learn file manipulation in Go. When it's finished I will start work with the database.

Friday, November 2, 2012

About Peerbackup

Almost forgot about it. What's this blog about?
I'm writing a program to backup remotely files between peers. The idea is simple: I want to backup my files automatically to some remote place and I don't want to rely in private companies to store them.
Here's how it should work: a group of people install the program and agree in dedicating an amount of hard drive space to store each others encrypted files. The program runs in background mode, as a daemon, and detects changes in original files synchronizing them in the remote backup.
So, i need to learn about communications, cryptography and filesystems.
The project is free software and resides in github (https://github.com/libercv/peerbackup)
Don't expect it to be useful in near future. It's a project for learning, so I'm sure I will make lots of silly mistakes.
Of course, help and guidance would be much appreciated.

Thursday, November 1, 2012

Two week goal

A friend of mine told me last week that in his new job they use a two week planification system. They set goals in order to be achived in two weeks. When this two weeks have past, they repeat the process.
I think it's a good way to start, so i'll set the first two week goal.
Deadline: Nov 18th 2012.
Achievement: Synchronize files between two directories

 Let's go!