tl;dr:
Either, you walk the data in memory order, and switch on type, or you group
the data in type order, and do all types separately.
In the first case, you get branch mis-predictions, which can be expensive,
and you also get higher I-cache pressure and even D-cache pressure because
you have to keep a bunch of different "subsystems" (type handlers?) in
memory.
In the second case, you have to map xyz->object (and object->xyz) in some
way other than the three-dimensional array, such as a sparse array
implemented as a hash table. That will cause more cache misses when locating
the item for any particular element, but it will let you walk elements of
the same type using linear memory access, and have less I and D cache
pressure and better branch prediction.
Which one is faster depends on whether you do mostly "update all of type X"
or do mostly "find element by xyz" operations per frame. You'll simply have
to come up with realistic workloads and profile it both ways.
Sincerely,
jw
--
Americans might object: there is no way we would sacrifice our living
standards for the benefit of people in the rest of the world. Nevertheless,
whether we get there willingly or not, we shall soon have lower consumption
rates, because our present rates are unsustainable.
On Fri, Oct 15, 2010 at 7:11 PM, OvermindDL1 <***@gmail.com> wrote:
> On Mon, Oct 11, 2010 at 12:40 PM, Veikko Eeva <***@iki.fi> wrote:
> >
> > Jon Watte wrote:
> >>
> >> There are a few differing points of view here.
> >> Clearly, you can't super-optimize all code you write up front, because
> >> you don't know that that's going to be the part that matters.
> >> [...]
> >
> > This isn't about games, but I couldn't resist on providing a link to a
> > Velocity 2010 conference presentation notes by Theo Schlossnagle of
> Scalable
> > Internet Architectures fame about, well, Scalable
> > Internet Architectures:
> >
> http://assets.en.oreilly.com/1/event/44/Scalable%20Internet%20Architectures%20Presentation%202.pdf
> >
> > After the intial hand-waving, the presentation is much about
> understanding
> > the problem in the context of performance and scalability trade-offs.
> >
> >
> > Cheers,
> > Veikko E.
> >
> >
> > P.s.
> > Velocity conference 2010 Slides/Notes
> > http://www.royans.net/arch/all-velocity-conference-2010-slidesnotes/
> >
> > Continuous deployments may not be for everyone: Culture
> >
> http://www.royans.net/arch/continuous-deployments-may-not-be-for-everyone-culture/
>
>
> Continuing this line of questioning, this is actually my use-case.
>
> I have a multi-dimensional array, say typedef myType[SIZE][SIZE][SIZE]
> myTypeArr;, the SIZE is set right now but I can change it to whatever
> allows best access. Right now to fit in the L2 cache properly on most
> CPU's I have it SIZE set to 8, which as I recall lets it be about a
> 24k block.
>
> myType itself, in essence, is a discriminated union, you can think of
> it as a tag and a data, like this:
> struct myType {
> size_unsigned short tag;
> void *data;
> }
> (In reality is reserves enough internal space to hold the data of the
> 'tag' for most types, others have a void* at the end of the allocated
> area for additional data, where the most accessed is in the main area,
> not through the pointer, I think this was 31bytes in size, maybe 63,
> would have to look, but that can be changed as well to a completely
> different way if necessary.)
>
> The tag either references a built-in type, or it 0 it references a
> scripted type (which I usually later plan to make into internal types
> for speed reason, but the scripted type will remain for third-party
> extensions). There are two 'major' passes which are called the most
> often, each of which access two different types of data inside the
> data element, so I could easily create a function to work over all of
> these in quick turn. However, there is a lot more sporadic access,
> for example, say tag type 5 needs to have a certain function run over
> it when a certain event happens, or more simply, every one whole
> second. It is not just tag type 5, but a lot of tag types that will
> need to perform actions every once in a while, and usually on *all* of
> a certain tag type within bounds in the overall structure. I figured
> that the brute-force method would be to have a pointer for each
> possible tag type, and as a tag is allocated just have the main
> structure keep the 'coords' of it with a pointer to that index of the
> tag type in another allocated structure, something like this:
> struct myTypeArr {
> myType arr[SIZE][SIZE][SIZE]; // where mytype is a tag and an index into:
>
> void **internalTags[TAGCOUNT];
> }
> and access it like:
> myTypeArr a;
> a.internalTags[a.arr[3][6][1].tag][a.arr[3][6][1].idx].data // or
> .location or whatever
> Or something nasty like that. problem is, that only uses memory for
> an item when it exists and is marked zero otherwise, but this will
> still eat a *LOT* more memory then otherwise due to the multiple void*
> indirections, and of course each pointer is going to be using 4/8
> bytes of memory each, although since the tag size of 'mostly' known
> (very few have variable sizes), that could simplify memory allocation
> somewhat.
>
>
> The overall structure as-is (although it can be pretty easily changed,
> the back-end is pretty well abstracted out from the front) is an
> octree, the octree can currently handle a max size of 32bits x 32bits
> x 32bits, although reduced a little based on the above SIZE constant
> (if size is 8 then the 32 bits is reduced to 29 bits or so).
> Basically, I need to index into a *massive*, not always loaded array
> that is 32bits x 32bits x 32bits in size *just* for indexing, not
> including data, and the data per cell may be variable size, but in
> general nearby chunks tend to be the same. So it is abstracted out as
> an octree, if all blocks in a level of the octree are all the same
> then the lower levels are deleted from memory and it is given as just
> a single block representing all blocks in the upper level, so on and
> so forth.
>
> The way the data is generated comes from a generator function. Upon
> first access of a data block, its 'chunk' is loaded, and the chunk is
> loaded by giving the location and dimension of the chunk to a
> generator function that fills it up with data (dynamically generated,
> scripted, who knows, the function is a black-box). When the data
> chunk is no longer needed for a time period then it is unloaded from
> memory by being serialized out to disk (whether by just writing a file
> with a special filename based on location and block size, or being
> written to a DB, or, how it is done now, being written to a
> distributed filesystem using Sector/Sphere, all subject to change
> based on if a better method is found). Some chunks may almost always
> be loaded into memory due to rapid access on them, potentially growing
> large to to a large number of chunks being loaded. Currently I am
> writing it into Sector/Sphere so I can distribute it, but thus far I
> have only distributed the file reading/writing of chunks and not of
> processing.
>
> I am guessing that I could distribute the processing and let each
> Sector/Sphere slave work on entire chunks at a time and just do a
> giant switch on the tag and work on each block inside the chunks, a
> couple problems with that though, the primary two being that blocks
> can affect other nearby blocks (including nearby ones in nearby
> chunks), could probably work around this by either overlapping chunks
> and having 'out of range' chunks be read-only, but still affect the
> nearby internal ones to the chunk, that would duplicate processing
> though, and I would have to make sure that data access was only
> affected by chunks of a very limited range while also ensuring that
> they only propagate outwards and not inwards, or have all affects
> propogate outwards in another pass to handle such things, thus slowing
> it down by causing another access pass, although that could be
> mitigated by just running it concurrently with the next 'tick' pass
> inside it, but would could potentially double memory usage (which I
> guess is easier to access then faster CPU usage...).
>
> So, I want to minimize latency in the tick processing, including
> minimizing reading from the drive, should I try to figure out a way to
> maximize concurrent processing on this 6-core server, or perhaps
> should I go ahead and scale out to Sector/Sphere to let slave
> computers do a lot of the processing.
>
> Hmm, an idea, I could do all of the processing of the nearby nodes to
> the players on the 6-core server itself, and do the further nodes on
> the slaves, I do not need it to be deterministic (although it would be
> *VERY* nice if it were)...
>
> I could just use Sector/Sphere for storage and create multiple client
> server to work on the data in tandom, say in specific in-game
> geographical areas, and have clients connect to the server that
> handles the area closest to them, perhaps be connected the the
> multiple nearest servers at the same time for quick access...
>
> This might be the best bet as it will not be handling just 'one' world
> either, hmm...
>
> I still need to figure out how to optimize the primary server work
> though, is the best method just to work on a block by block basis
> while switching based on the tag, or should that try to be optimized
> somehow to minimize branches in the functions? Perhaps just a master
> array list and go over those? But I do not want to update every block
> in every chunk in every tick. For example, most things will only need
> to update near every tick in only the chunks that are near players,
> further away ones can update slower if necessary. Optimally
> everything would update every tick, but that will no doubt be
> infeasible when it gets to a larger dataset. Thoughts? It seems like
> a problem that should be very simple to use such a data-oriented
> pattern, except that there are going to be a rather large amount of
> tag types (a short should handle it well, but it is not altogether
> impossible for me to need to increase it to a uint32 later), each of
> which may operate differently.
>
> Hmm, another idea, most blocks will require no internal state if I
> split their small amount of stated into new block types, but that will
> quickly balloon the block type count, and there will still be a
> comparatively small amount of blocks that do require state, very few
> with rather large state.
> _______________________________________________
> Sweng-Gamedev mailing list
> Sweng-***@lists.midnightryder.com
> http://lists.midnightryder.com/listinfo.cgi/sweng-gamedev-midnightryder.com
>