Wednesday, January 7, 2015

Defunct

This project is officially dead. I haven't got the time to work on it and it doesn't look like this will change in the foreseeable future.

Sunday, March 2, 2014

Show me your data structures

I've added two data structures to the project located into the "doc" directory. The first of them, "settings.json", contains general settings for the program: folders and files to backup.
The second, "network.json", describes information about each group of peers.
The idea now is to add some more data structures,  file metadata, blobs, snapshots...
Which data am I going to need? How am I going to structure the program?

Saturday, February 8, 2014

Storage

Storing a file in its current state is easy. Take small portions of the file, calculate its hash, compress and encrypt it and save every resulting piece  using the hash as a name.
What about its metadata? You also want to store other two structs: a ordered list of chunks (or blocks or blobs) in order to restore the contents and metadata (name, creation time, permissions...)
This leads to, at least, three types of object and two categories.
First category: contents. We store them remotely and don't use them again until we have to restore the file.
Second category: metadata. In addition to storing them remotely we need to keep metadata locally in order to detect changes to the filesystem and keep track of these changes.
It seems that a clear architecture separation can be made. Backup logic doesn't need to know about storage and vice versa.
Groups, replication and communication with peers should be done by the storage subsystem.
So... first decision made. Create a storage subsystem. It could be a program on its own. It should only know about blobs an locations. You should be able to configure in it groups of peers and when tell it to store a blob in a group. Should restauration be done through it too? Not really sure, but it seems just fine. The same subsystem could be asked to get a blob and know where to find it. Not sure yet if the name of the blob should be enough or the group that stores it should be specified. Shouldn't all that group stuff be invisible to the backup subsystem? I think so, but it'll require further consideration.
It would be nice to explore storage subsystem used by camlistore. Could I develop a storage server as a backend which could be pluged to camlistore? A local storage subsystem could be developed too. Or maybe use camlistore storage subsystem?
Maybe many instances of storage subsystem could exist. Each one of them associated with a group of peers. The outcome is that the storage subsystem doesn't know anything about groups. The backup must handle this.
As I said, I need to think about it a little bit longer.

Friday, February 7, 2014

Requirements

The Wheel of Time turns, and Ages come and pass, leaving memories that become legend. Legend fades to myth, and even myth is long forgotten when I have time to work in peerbackup again.


Enough said.


Let's suppose that a software capable of backing your documents up in a distributed and encrypted manner existed. What would be the requirements for such a program?

Backups and snapshots
You want your data copied. And you want to be able to return to any point in the past. You should be presented with a list of changes made through time and be able to chose any point in that history. Either at the whole of your data, a branch in your directory tree or a file.
If your whole disk crashes you want to recover all the data and the history of changes.
You must be able to restore data in an alternative location so you won't overwrite existing versions.

Peer to peer and encryption
Data and metadata must be encrypted, accesible only to the person who owns it. It must be distributed in order to achieve greater protection against hardware failures, just like RAID does.
Also, you want to chose your peers. You want to trust only in people you really trust. Not because they would try to access your data (remember it's encrypted), but because you can phone them if their node is down. Also, you want to benefit from deduplication.
You could create one group with your family (lots of photographs would benefit from deduplication) and join a different one with your friends. Every group can define a salt phrase and every directory can be assigned to a group. 
You must be able to add people to a group in a similar way to bittorrent sync.

Interface and communication
The program must run as a daemon in order to maintain communication with other nodes, and must be able to communicate with different interfaces. Via web, command line or a native program. A simple protocol must be defined and used.

At least now I know what I want to create. Following posts would describe in greater detail every topic of this post. No coding until key requirements are set.

Wednesday, October 2, 2013

Hell of a year

It's been a hell of a year. Personal and professional circumstances have converged and prevented me from working on peerbackup.
What can I say? It's still vaporware.
I've discovered recently a great project from which I intend to borrow code and ideas. Its name is Camlistore and some of the concepts it uses are going to be extremely useful in peerbackup.
Next step: store metadata using sqlite and try make a backup and restore it.
Lots of ideas and so little time.

Sunday, May 26, 2013

Quick and dirty tests

I'm starting to test hash (sha256 and adler32)  and compression (gzip and lzma) algorithms. Right now the program breaks its executable file into pieces, and writes them compressed with gzip using their sha256 hash as a name.
Not much, but a beginning.
Now I'm thinking about how metadata should be stored, how to test for new files and modifications inside a file already backed up.
Next, I'll try to back a file up storing its metadata and restore it in other directory.
In order to detect changes in a file already backed up I need a rolling checksum algorithm. I've found an implementation and asked for permission for using it, giving proper credit, of course.

Sunday, May 19, 2013

Lowering expectations

It's time to admit that peerbackup will never be anything else than vaporware blogware if I don't lower my expectations about it (and start producing some actual code).
The best way of getting a first version done might be simplifying its network requirements. In theory it would be great to have a truly decentralized and distributed backup system, but in real life it might work as well a reduced version of it.
To put it bluntly: node management won't be automatic. A node will add another peer manually. The number of nodes will remain low and known. I'm thinking about an scenario where a group of people (friends, family, co-workers) agree to set a backup network. 
It makes the project infinitely more boring. No real p2p, no real anonymity. 
What remains then? Once the network is set up, a distributed encrypted backup. It will be more similar to an array of disks (or a RAID) than to a bittorrent network.
It may have some benefits, though. First of all, you should be able to tell when a node is down and warn its propietary. It also solves the problem of trusting unknown nodes. You know where your data is. 
But truth to be told, despite some potential benefit, the decision is to sacrifice functionality in order to get anything at all done. If it works and I get it done, there's always the possibility of a better second version.