"You blow up one sun, and everyone expects you to walk on water." - Samantha Carter, "StarGate SG-1"
Well, boys and girls, I'm glad to announce that I solved all of the problems mentioned in last nights blog post. Here are the gory details:
The fundamental problem was that we need to have access to all data in chunks crossing file borders, even if the files themselves are no longer present. Hence, the basic idea is that we simply duplicate all downloaded data into so-called "cache", from where we can access the data later on, if we need it. Now, I thought long and hard on what exactly should the cache files format be, as well as how much data are we going to cache at all.
As it turns out, the best solution on the file format is to create one (or two - see below) physical files for each chunk that crosses file boundaries. In reality, it's slight more complex. The simple case is when a chunk starts in file1 and ends in file2 - in that case, we need to cache the last bytes of file1 (into one cachefile), and the first bytes of file2 (into second cachefile). It gets more interesting, however, when the chunk crosses multiple files - in that case, we need to cache the ending of the file where the chunk began, all intermediate files, and the beginning of the file where the chunk ends.
Why this specific format? TorrentHasher (custom hasher object for hashing data across file boundaries) accepts list of paths to files as input. With this style of cache we can feed TorrentHasher a mixed set of real files, and cache files, and it works seamlessly.
It doesn't end here ofcourse - the data that was written to cache must also be checksummed, each chunk against the corresponding hash. That was also kinda tricky to get right, but it works correctly now - if the real-data hash fails, the cache also fails, and if the real-data hash succeeds, the cache also succeeds. Middle-cases aren't handled right now, however if one succeeds, and other fails, then this can only mean either code error, or hardware error (or out of disk space), so I have mixed feelings on how (if at all) to handle this scenario.
How much does this cache system take disk space? Well, the worst-case scenario is the same amount as the entire torrent, however this is an extreme case (happens when all files in torrent are smaller than chunksize * 2, thus all chunks cross file boundaries, and need to be cached). The average-case scenario is much, much lower though - preliminary testing showed cache sizes ranging from 0% to 2% of the total size of the torrent, depending on chunksize vs average filesize ratio.
On an unrelated side-note, I also fixed a bug introduced by the delayed disk-allocation mechanism introduced few weeks ago - namely, if chunk size was smaller than PartData's internal file buffer size (500kb by default), the first chunkhash verification was performed after the allocation (correct), but before PartData was able to flush the data to disk (first flush is done after allocation is finished). This didn't affect ed2k (where chunksize is 9500kb), but caused first chunk of each file to be corrupted in BT when chunksize was smaller than 500kb. This no longer happens - the verification jobs are delayed until the allocation, and flushing is finished. Internally in BT, this introduced a somewhat more complex situation, where the job had to wait until ALL the files that the hashjob affected were finished flushing, but that works properly now as well.
For those, who want to read the code to see what I'm talking about,
hncore/bt/files.cpp and
hncore/bt/files.h contain all the nifty stuff.
Madcat, ZzZz
"Next step - parting the Red Sea." - Samantha Carter, "StarGate SG-1"