Alo Sarv
lead developer

Donate via
MoneyBookers

Latest Builds

version 0.3
tar.gz tar.bz2
Boost 1.33.1 Headers
MLDonkey Downloads Import Module Development
Payment completed
Development in progress.
Developer's Diary
irc.hydranode.com/#hydranode

Thursday, December 01, 2005

Smart Data Structures - Revisited

You all know how much I love smart data structures [and dumb code], and how much I love using Boost MultiIndex library to solve problems. ClientManager and ChunkSelector are just a few of many places MultiIndex is being used in Hydranode. Tonight, I'v managed to take the concept even further, by re-structuring the ChunkSelector (which, as the name says, selects chunks for downloading) to be even dumber. Bear with me as I walk through the concept.

The first thing about ChunkSelector is to define what exectly do we want it to do. The logical ordering of things would be:
In addition to that, whatever the ChunkSelector chooses, must be compared against the peer's available chunks listing, since we cannot requset chunks the peer doesn't have.

Now, formerly this logic was split into two functions, and the container had three different indexes - one based on availability, one based on partial status, and one based on use-count. The first function would thus get the list of partial chunks (partial == true) from the first index, iterate on that, looking at usecounts and availabilities, selecting lowest use-count/availability combination. If that fails, the second function would then do a similar operation, this time walking on availability index (which, btw, also contained completed chunks), skip past completed chunks, and look at use-counts, selecting least-available chunk with usecount 0, or non-zero if no 0-usecount chunk was found. All this occupied actually three functions originally, over 200 lines of code, but was reduced to about 120 lines of code and two functions few weeks ago.

But that data structure logic was actually written about a year ago, when we were just starting out with MultiIndex, and weren't aware of additional features of the library. Now, with the knowledge I'v gathered during that time, I managed to come up with a better solution.

The first logical step is to add support for selecting all chunks that aren't completed:
boost::multi_index_container<
Chunk,
indexed_by<
ordered_non_unique<
composite_key<
Chunk,
member<Chunk, bool, &Chunk::m_complete>
>
>
>
> container;
// returns all chunks with m_complete=false
container.equal_range(boost::make_tuple(false));
Now, we also don't want to select any chunks that have zero availability, since they aren't available at all from any client. However, the m_avail member is an integer, so we need to convert that to bool, using bool Chunk::hasAvail() { return m_avail; } helper method. Thus, the container becomes:
boost::multi_index_container<
Chunk,
indexed_by<
ordered_non_unique<
composite_key<
Chunk,
member<Chunk, bool, &Chunk::m_complete>
const_mem_fun<Chunk, bool, &Chunk::hasAvail>
>
>
>
> container;
container.equal_range(boost::make_tuple(false, true));
Now that we have a listing of chunks that are incomplete, and available, we want to make sure that chunks with lower use-count are selected over chunks that are already being used (e.g. downloaded) by other peers. We also want to select partial chunks over not-at-all-downloaded chunks, AND we want to select rare chunks first, so the next three indexes logically become:
  member<Chunk, uint32_t, &Chunk::m_useCnt>,
member<Chunk, bool, &Chunk::m_partial>,
member<Chunk, bool, &Chunk::m_avail>
However, since the default ordering of STL (and thus MI) containers is std::less, it means false values are in the listing BEFORE true values, which is not what we want here. Thankfully, all STL containers (and thus also MI) allow overriding the comparison predicate, but with composite_key, we must then specify comparison predicates for ALL sub-keys. The final container thus becomes:
boost::multi_index_container<
Chunk,
indexed_by<
ordered_non_unique<
composite_key<
Chunk,
member<Chunk, bool, &Chunk::m_complete>,
const_mem_fun<Chunk, bool, &Chunk::hasAvail>,
member<Chunk, uint32_t, &Chunk::m_useCnt>,
member<Chunk, bool, &Chunk::m_partial>,
member<Chunk, uint32_t, &Chunk::m_avail>
>,
composite_key_compare<
std::less<bool>,
std::less<bool>,
std::less<uint32_t>,
std::greater<bool>,
std::less<uint32_t>
>
>
>
> container;
The result is that when we do:
std::pair ret = container.equal_range(boost::make_tuple(false, true));
We get a listing of chunks that match all required criteria (incomplete and nonzero availability), and they are already sorted by all predicates that we need:

First come partial chunks with zero usecount, ordered by availability. Then come impartial chunks with zero usecount, ordered by availability. Then come partial chunks with usecount 1, ordered by availability, then impartial chunks with usecount=1 and ordered by availability, and so on. And the old two functions totaling 120+ lines of code and doing very clumsy looping now simply becomes:
  CMSelectIndex &idx = m_chunks->get<ID_Selector>();
std::pair r = idx.equal_range(
boost::make_tuple(false, true)
);
for (SIter i = r.first; i != r.second; ++i) {
if (pred(*i)) {
UsedRangePtr ret(new UsedRange(this, i));
if (ret->getLock(1)) {
return ret;
}
}
}
return UsedRangePtr();
On average, that loop runs less than 10 times (on a file with 1500 chunks), very often the chunk is selected in first 1-2 hops through the loop. Dumb code and smart data structures DO work a lot better than the other way around :)

An interesting side-effect of this is that the very first and very last chunks are always selected first (all other factors being equal). I'm unsure why this happens; while I could explain the first chunk being selected by the additional positional index in the actual container, which places the chunks in an ordered fashion, I have no explanation why the very last chunk is selected as second choice. The final code is in hncore/partdata.cpp, if anyone wants to take a look.

Special thanks to Joaquín M López Muñoz for writing the wonderful library in the first place, and for the idea of using composite_key in the ChunkSelector :)

Madcat, ZzZz



Comments:
> An interesting side-effect of this is that the very first and very last chunks are always selected first

This is actually a very good thing if it happens for all the files in a torrent (Azureus even has an option to prioritize the download of the first and last chuncks of the files in a torrent). This helps you preview the files you are downloading: you can start viewing video files sooner (at least what you have already downloaded) and you can even list the contents of an archive (since they usually have the index stored at the end).
 
It may be helpful for file preview.
But it interferes with the concept of even/random distibuted chunk-selection to ensure file availability.
 
I might not be so well for ed2k.
I think for a short time prioritized first and last time chunk selection was standard in some rather early version of emule... the effect is that files spread way slower than normal.

Everyone then requests the first or last chunk, also from somebody with the full file, regardless whether some hundred other people already got these. Progress was finally made when everyone had the first and last chunk and "normal" random selection began. Normal distribution was slowed, exponentially to the number of connected peers (first & last got requested again and again, noone had or requested other chunks).

Might be different though, when the first chunk is only requested from the first peer and the others then recognize that the first chunk is not rare anymore and request the next least available chunk which is not the first or last one.
 
The first/last chunks are requested using this engine ONLY when all other settings are equal (e.g. all chunks have equal availability and usecount, most probably from the very first client which we contact). As soon as we start downloading the chunk from one peer, OR there are two or more peers for the file already, other variables kick in, since all chunks are no longer equal.

As for prioritizing the first/last chunks of each file in a torrent, that doesn't happen right now. This (unintended) priotizing of first/last chunks affects the very first and very last chunks of the file, and in case of torrents, this means the first/last pieces of the TORRENT (since a Torrent is a virtual file wrapper around real files).

Madcat.
 
And to add to the above the first and last pieces of a torrent are:
1. Too small for a preview - can be as small as 128k even.
2. First file can be some nfo file rather than the actual video/audio.
All in all doesn't sound like a "feature" to ignore so it's worth finding out why it acts this way. One can suppose why for the first chunk (natural order in the container) but why the last right after that?!?
GC
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?