Hydranode Bittorrent Module Design
Hydranode Bittorrent Module

 May 2005
Alo Sarv

Preamble

Bittorrent protocol includes several rather unique features that go beyond the way other networks handle downloading. The most obvious, and most complex to handle in a multi-network application is the way Bittorrent handles files. Namely, at the protocol level, there are no files, there is only one big “virtual” file called a torrent. The torrent data is the data of all files, in a single row. Each piece of size X has it’s own hash (SHA-1), but since the data of all files is in a single row, piece boundaries may exceed file boundaries. This introduces some difficulties in handling the files in a generic, multi-network client such as Hydranode.


Part I - Overview



1. Hashing

Hydranode hashing engine is designed to handle single-files. PartData submits request to hash a range in a file, or a full file, which hasher then does, and sends back the results. However, this is not sufficient for Bittorrent module.
            Thanks to the extendable interface of both PartData and Hasher API’s, Bittorrent module can over-ride the default hashing system, and implement the required hashing methods within the module. 


2. Multi-file handling, subdirs handling

Since a torrent may contain more than one file, possibly in a number of  subdirs, implementation should make it possible for User Interfaces to display the torrent contents in such a hierarchical way. For that, we have Object Hierarchy system (also called “Hydranode Virtual Filesystem” sometimes). Bittorrent module can create the requested amount of PartData objects, one for each file, but override the default parent (FilesList) with a custom Object-derived class, which will then indicate “a folder”. This has no effect on the other modules possibly downloading the same files, since FilesList internal structures do not depend on Object Hierarchy system.
            Folders in the torrent hierarchy can implement additional features, such as displaying the overall speed and completeness of the PartData’s under them.
            The top-level “folder” in the hierarchy needs to be a PartData-derived object as well, since that is where Bittorrent will handle it’s pieces, hashes and so on. As mentioned earlier, PartData internals rely on the fact that one download is one file, the top-level, “virtual” file will have sum of the size of all files in the torrent, and will override the default flushBuffer() method, forwarding the data to the actual, non-virtual file.
            Additional care must be taken in order to keep ChunkMaps in sync in virtual files (the top-level file, as well as sub-folders), to avoid multiple modules downloading the same data chunk. As such, additional signaling system should be implemented in PartData to indicate that new data has arrived, and the location of the data. Existing solution, where EVT_PD_DATAADDED is emitted is in-sufficient, since the event cannot carry reference data with it.


3. Saving and resuming

While starting a torrent is performed by reading a .torrent file, the format of the file is in-sufficient for saving the overall state of the torrent. PartData has it’s own format for saving/loading it’s state, which the top-level “virtual” file can use. Since each PartData object is accompanied by a reference MetaData object, additional, custom fields should be implemented in the MetaData object in order to store the information about the torrent – most importantly, reference ID’s of the PartData objects belonging to this torrent, as well as the directory structure.
            While there is no ReferenceID  used in PartData API (there was in the early design phases, but it was removed), something else, unique, must be used. The most obvious solution would be to use the randomly-generated PartData file-name, which is unique (also used for .tmp / .tmp.dat file names). For the directory structure, additional custom fields in MetaData structure shall be used.


Part II - Design


1. Torrent structure and layout in memory

  Torrent
+--- File1
+--- File2
+--- SubFolder1
| +------ File3
| +------ File4
+--- File5

As is known, for each PartData object, there must be a corresponding SharedFile object. This is required for partial downloads to be uploaded back to network (if appropriate hashes are available). This does, however, introduce an additional, somewhat complex, variable into the design.

Following that design principle, two objects need to be created for a torrent - TorrentFile and PartialTorrent.

TorrentFile represents a torrent, which has one or more files in it in a directory structure. Whether or not the torrent is complete (e.g. in seeding state) or not is irrelevant, just as SharedFile doesn't care if it is partial or not. TorrentFile object is derived from SharedFile, has parent object set to FilesList, and overrides virtual function read(). TorrentFile keeps reference to the corresponding PartialTorrent object, as well as a list of SharedFile objects contained within this torrent.

PartialTorrent keeps track of the information related to torrents in downloading state. It is derived from PartData, and implements virtual functions write(), doWrite() and flushBuffer(). PartialTorrent, on it's own, is a "virtual" file, as it does not represent a physical file itself, but rather the sum of all files in the torrent. It maintains a list of child PartData objects, re-directing the implemented virtual functions to the corresponding PartData objects, and scheduling chunk-hashing as neccesery.

The files contained in the torrent can be implemented using normal SharedFile and PartData objects, without over-riding any virtual functions (thus, derivation is no longer needed there), but it is neccesery to over-ride the default parent Object for those, pointing to the parent object in the torrent structure. This is required to allow User Interfaces to display the torrent in a properly structured hierarchy.

Since at Bittorrent protocol level, the actual location of a file in the hierarchy is no longer relevant (only the order of files is relevant), TorrentFile and PartialTorrent can simply keep a vector of files, without needing to implement any additional hierarchy-handling - this is all taken care by Object hierarchy already.

An additional object, called TorrentFolder is required for each sub-folder inside the torrent. This helper object could be derived from from TorrentFile, (or PartialTorrent, at the PartData level), but it does not need to keep any hashes, nor perform any data forwarding, since that's already handled by top-level folder. The only purpose for this object is to have some kind of an object for setting as parent for lower-level SharedFile / PartData objects. If deriving from TorrentFile is proved to be too expensive in terms of memory, or other similar issues arise during implementation, any other Object-derived, custom class is sufficient to fill the place, however using TorrentFile/PartialTorrent as parent would possibly lower the amount of duplicate code related to keeping track of children's overall download-rate, size, completed chunks, and so on, since those are already implemented by SharedFile/PartData classes.

With all this in place, the above structure can be visualized like this:
 TorrentFile
|
+--- SharedFile "File1"
| +---- PartData "File1"
+--- SharedFile "File2"
| +---- PartData "File2"
+--- TorrentFolder
| +---- SharedFile "File3"
| | +----- PartData "File3"
| +---- SharedFile "File4"
| +----- PartData "File4"
+--- SharedFile "File5"
+---- PartData "File5"

Messages passing:

» Reading
TorrentFile::read()
|
V
Look up correct SharedFile (based on read start offset)
|
V
SharedFile::read() (reads actual data from disk, returning it)

» Writing
PartialTorrent::write() (called by driver code)
|

Look up corrent PartData (based on write start offset)
|
V
PartData::write() (writes actual data, possibly calling flushBuffer())
|
V
signal(dataWritten) (signals back up that data has been added)
|
V
PartialTorrent::onChildDataWritten() (updates internal chunkmaps)


2. Synopsis

// All Bittorrent module classes reside in this namespace
namespace BT {

/**
* Base class implements module initialization/destruction,
* configuration storage, as well as additional features required
* by implementation.
*/
class BTBase : public ModuleBase {
public:
/**
* Called on module initialization
*
* @returns True on successful startup, false otherwise
*/
virtual bool onInit();

/**
* Called when module is unloaded and/or app is exiting
*
* @returns 0 on successful exit, nonzero exit code otherwise
*/
virtual int onExit();

/**
* Creates a new torrent out of the files
*
* @param fiels Files to create the torrent of
* @returns The newly-created torrent object
*
* \remarks The files are hashed asynchronously, after this function
* returns, thus the returned object may not be fully usable
* before hashing is finished.
*/
TorrentFile* createTorrent(
const std::vector<boost::filesystem::path&> files
);

/**
* Starts new torrent download
*
* @param refData The contents of .torrent file, containing reference
* data for this torrent
* @returns The newly-created torrent download
*
* \note This method also takes care of registering the created objects
* at FilesList and/or additional locations.
*/
TorrentFile* downloadTorrent(const std::string &refData);
};

/**
* TorrentFile represents one single torrent, which may be either "seeded" or
* "leeched", e.g. partial or complete in Hydranode terms.
*
* Upon construction, this object should be registred with FilesList class,
* as while acting as "virtual" file (there is no single underlying file, there
* may be more than one), various listings (in User Interfaces) may still want
* to see this object.
*
* Since this is a "virtual" file, all I/O calls are forwarded to the actual
* implementation object of type SharedFile.
*/
class TorrentFile : public SharedFile {
public:
/**
* Creates a "seeded" torrent, out of the files listed
*
* @param files Files to create the torrent of
*
* \throws std::runtime_error if any of the files isn't "readable"
* \note All files will be hashed (asyncronously) before this torrent
* will be available for publishing
*/
TorrentFile(const std::vector<boost::filesystem::path&> files);

/**
* Overrides default "read" method, since this is a "virtual" file.
* Forwards the call to the corresponding SharedFile object (listed in
* m_children member), based on the begin offset
*
* @param begin Begin offset to begin reading
* @param end End offset to stop reading
* @returns The data read
*
* \throws Utils::ReadError if reading fails for any reason
*/
virtual std::string read(uint64_t begin, uint64_t end);
private:
//! Shall contain pointers to all child objects
implementation_defined<SharedFile*> m_children;
};

/**
* PartialTorrent object handles everything related to downloading a torrent.
* PartialTorrent may never exist without a corresponding TorrentFile parent
* object. Since this is a "virtual" file, all I/O calls are forwarded to the
* actual implementation objects of type PartData.
*/
class PartialTorrent : public PartData {
public:
/**
* Create a "leeched" torrent to be downloaded. The torrent information
* is read from the passed parent pointer.
*
* @param parent Parent torrent object
*/
PartialTorrent(TorrentFile *parent);
protected:
/**
* Since this is a "virtual" file, and the physical files are found in
* m_children member, this method forwards the call to the correct child
* object(s), based on the start offset.
*
* @param offset Start offset for data
* @param data Data to be written
*
* \throws std::runtime_error if something goes wrong
*/
virtual void write(uint64_t offset, const std::string &data);

private:
//! Shall contain pointers to all child objects
implementation_defined<PartData*> m_children;

/**
* Handles dataAdded() signals from child objects, and updates internal
* maps accordingly.
*
* @param obj PartData that was added data to
* @param offset Begin offset where data was written
* @param len Length of data that was written
*/
void onDataAdded(PartData *obj, uint64_t offset, uint64_t len);

/**
* Handler for addSourceMask() signals from child objects; updates
* m_chunks member in this object accordingly.
*
* @param obj PartData that emitted the signal
* @param chunkSize Size of a chunk
* @param chunks Boolean vector containing "true" value for each
* chunk the client has, and "false" otherwise.
*/
void onSourceMaskAdded(
PartData *obj, uint32_t chunkSize,
const std::vector<bool> &chunks
);
};

} // end namespace BT

Appendix A - Required changes to the Engine