Sunday, March 2, 2014

Show me your data structures

I've added two data structures to the project located into the "doc" directory. The first of them, "settings.json", contains general settings for the program: folders and files to backup.
The second, "network.json", describes information about each group of peers.
The idea now is to add some more data structures,  file metadata, blobs, snapshots...
Which data am I going to need? How am I going to structure the program?

Saturday, February 8, 2014

Storage

Storing a file in its current state is easy. Take small portions of the file, calculate its hash, compress and encrypt it and save every resulting piece  using the hash as a name.
What about its metadata? You also want to store other two structs: a ordered list of chunks (or blocks or blobs) in order to restore the contents and metadata (name, creation time, permissions...)
This leads to, at least, three types of object and two categories.
First category: contents. We store them remotely and don't use them again until we have to restore the file.
Second category: metadata. In addition to storing them remotely we need to keep metadata locally in order to detect changes to the filesystem and keep track of these changes.
It seems that a clear architecture separation can be made. Backup logic doesn't need to know about storage and vice versa.
Groups, replication and communication with peers should be done by the storage subsystem.
So... first decision made. Create a storage subsystem. It could be a program on its own. It should only know about blobs an locations. You should be able to configure in it groups of peers and when tell it to store a blob in a group. Should restauration be done through it too? Not really sure, but it seems just fine. The same subsystem could be asked to get a blob and know where to find it. Not sure yet if the name of the blob should be enough or the group that stores it should be specified. Shouldn't all that group stuff be invisible to the backup subsystem? I think so, but it'll require further consideration.
It would be nice to explore storage subsystem used by camlistore. Could I develop a storage server as a backend which could be pluged to camlistore? A local storage subsystem could be developed too. Or maybe use camlistore storage subsystem?
Maybe many instances of storage subsystem could exist. Each one of them associated with a group of peers. The outcome is that the storage subsystem doesn't know anything about groups. The backup must handle this.
As I said, I need to think about it a little bit longer.

Friday, February 7, 2014

Requirements

The Wheel of Time turns, and Ages come and pass, leaving memories that become legend. Legend fades to myth, and even myth is long forgotten when I have time to work in peerbackup again.


Enough said.


Let's suppose that a software capable of backing your documents up in a distributed and encrypted manner existed. What would be the requirements for such a program?

Backups and snapshots
You want your data copied. And you want to be able to return to any point in the past. You should be presented with a list of changes made through time and be able to chose any point in that history. Either at the whole of your data, a branch in your directory tree or a file.
If your whole disk crashes you want to recover all the data and the history of changes.
You must be able to restore data in an alternative location so you won't overwrite existing versions.

Peer to peer and encryption
Data and metadata must be encrypted, accesible only to the person who owns it. It must be distributed in order to achieve greater protection against hardware failures, just like RAID does.
Also, you want to chose your peers. You want to trust only in people you really trust. Not because they would try to access your data (remember it's encrypted), but because you can phone them if their node is down. Also, you want to benefit from deduplication.
You could create one group with your family (lots of photographs would benefit from deduplication) and join a different one with your friends. Every group can define a salt phrase and every directory can be assigned to a group. 
You must be able to add people to a group in a similar way to bittorrent sync.

Interface and communication
The program must run as a daemon in order to maintain communication with other nodes, and must be able to communicate with different interfaces. Via web, command line or a native program. A simple protocol must be defined and used.

At least now I know what I want to create. Following posts would describe in greater detail every topic of this post. No coding until key requirements are set.