Alo Sarv
lead developer

Donate via

Latest Builds

version 0.3
tar.gz tar.bz2
Boost 1.33.1 Headers
MLDonkey Downloads Import Module Development
Payment completed
Development in progress.
Developer's Diary

Wednesday, October 12, 2005

Introducing BT chunk-cacher engine

"You blow up one sun, and everyone expects you to walk on water." - Samantha Carter, "StarGate SG-1"

Well, boys and girls, I'm glad to announce that I solved all of the problems mentioned in last nights blog post. Here are the gory details:

The fundamental problem was that we need to have access to all data in chunks crossing file borders, even if the files themselves are no longer present. Hence, the basic idea is that we simply duplicate all downloaded data into so-called "cache", from where we can access the data later on, if we need it. Now, I thought long and hard on what exactly should the cache files format be, as well as how much data are we going to cache at all.

As it turns out, the best solution on the file format is to create one (or two - see below) physical files for each chunk that crosses file boundaries. In reality, it's slight more complex. The simple case is when a chunk starts in file1 and ends in file2 - in that case, we need to cache the last bytes of file1 (into one cachefile), and the first bytes of file2 (into second cachefile). It gets more interesting, however, when the chunk crosses multiple files - in that case, we need to cache the ending of the file where the chunk began, all intermediate files, and the beginning of the file where the chunk ends.

Why this specific format? TorrentHasher (custom hasher object for hashing data across file boundaries) accepts list of paths to files as input. With this style of cache we can feed TorrentHasher a mixed set of real files, and cache files, and it works seamlessly.

It doesn't end here ofcourse - the data that was written to cache must also be checksummed, each chunk against the corresponding hash. That was also kinda tricky to get right, but it works correctly now - if the real-data hash fails, the cache also fails, and if the real-data hash succeeds, the cache also succeeds. Middle-cases aren't handled right now, however if one succeeds, and other fails, then this can only mean either code error, or hardware error (or out of disk space), so I have mixed feelings on how (if at all) to handle this scenario.

How much does this cache system take disk space? Well, the worst-case scenario is the same amount as the entire torrent, however this is an extreme case (happens when all files in torrent are smaller than chunksize * 2, thus all chunks cross file boundaries, and need to be cached). The average-case scenario is much, much lower though - preliminary testing showed cache sizes ranging from 0% to 2% of the total size of the torrent, depending on chunksize vs average filesize ratio.

On an unrelated side-note, I also fixed a bug introduced by the delayed disk-allocation mechanism introduced few weeks ago - namely, if chunk size was smaller than PartData's internal file buffer size (500kb by default), the first chunkhash verification was performed after the allocation (correct), but before PartData was able to flush the data to disk (first flush is done after allocation is finished). This didn't affect ed2k (where chunksize is 9500kb), but caused first chunk of each file to be corrupted in BT when chunksize was smaller than 500kb. This no longer happens - the verification jobs are delayed until the allocation, and flushing is finished. Internally in BT, this introduced a somewhat more complex situation, where the job had to wait until ALL the files that the hashjob affected were finished flushing, but that works properly now as well.

For those, who want to read the code to see what I'm talking about, hncore/bt/files.cpp and hncore/bt/files.h contain all the nifty stuff.

Madcat, ZzZz

"Next step - parting the Red Sea." - Samantha Carter, "StarGate SG-1"

it seem a good idea to me :)

i'm impressed by ur code... very very clean a real exception from what i usually see in OS project!

I'm amazed u do it all alone!

Glad you sorted it out.

Now, what I don't understand is why we cache the whole fist / last chunk of a file that is surrounded by 'gaps'. Couldn't we just cache the data we need to complete the chunks in question?

I hope this funky ASCII art clears up what I mean :D
Imagine this is the first chunk of this file:
+ - - - - - +-----------------------------+
| Cached | Written to file 2 on HD |
+ - - - - - +-----------------------------+

The reason is that this guarantees we can do the "right thing" under all possible configurations. Consider what happens if user cancels one file of a torrent download? We'd have to copy the bordering chunks to cache. What if user deletes a (completed) file from within a torrent from the disk (possibly while we'r not running anymore)? We'd have to re-download the missing pieces in order to complete the bordering files. And so on.

This solution avoids a whole set of problems that could be even more complex and error-prone to handle, and the cost of this solution is still very low.

So... Madcat can make pigs fly?
Madcat can do a lot more than that, he can do a open source "Modular MultiPlatform MultiNetwork P2P Client Framework" running correctly and really well coded, I thinked it was not possible before knowing HN...

Madcat, thanks a lot for your project, it will be the best P2P app ever! ;)
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?