Beautiful Pixels: Multi-Platform Multi-Core Architecture Comparison (PC, Wii, Xbox 360, PS3, CUDA, Larrabee)

Saturday, August 23, 2008

Multi-Platform Multi-Core Architecture Comparison (PC, Wii, Xbox 360, PS3, CUDA, Larrabee)

I just gave a presentation at the Game Connection Developers Conference in Leipzig. It dealt with Multi-Platform support for Multi-Core development... which we've solved at Emergent with Floodgate.

I've presented on this before, but what I added this time was a series of architecture block diagrams to illustrate the wide range of systems out there. They specifically focus on the memory topology relevant for code.

Some quick notes:

Sizes and distances between boxes don't have meaning in these diagrams, just the topology.
There are simplifications (e.g. I haven't added EDRAM on the 360). However, the high level structure of the systems is valuable to contrast, and I've focused on what general processing typically accesses. If I've goofed something, let me know, but also perhaps I omitted it to keep things simpler.
R stands for Registers, L1 and L2 for caches, Mem for Memory, GMem for graphics memory

We start with simple PCs and Multi-Core PCs. Memory is cached, but even with multi-core systems the programmer doesn't have to worry about consistency. As long as synchronization primitives are used to avoid race conditions, the systems take care of getting the right data when you fetch it. (This takes some work, since invalid data could be in an L1 cache that should be replaced by data currently in a write queue from another CPU.)

Getting into consoles, we start with the Wii. There are two types of memory, both accessible by CPU and GPU. However, what's really interesting is the ability to lock a portion of the L1 cache and explicitly manage it with DMA transfers. In one test case, we saw 2.5 times performance improvement by explicitly managing Floodgate transfers with the locked cache!

The Xbox 360 looks quite a bit like a multi-core PC, with multiple hardware threads per core. The main thing to note is the single memory used for "system" and graphical resources. Also, the GPU happens to be the memory controller, and has access to L2, but programmers needed concern themselves with this and only a few developers take advantage of GPU L2 access.

The PlayStation 3 (CELL processor) is the earliest architecture that really rocked the boat. A series of co-processors named SPUs have dedicated memory for instructions and data called Local Stores. These must be managed explicitly by DMA transfers. PlayStation 3 is why we built Floodgate, but as you'll see, it's not the only system that can benefit.

nVidia's CUDA is certainly an interesting architecture. It differs significantly from other systems, being a large collection of fairly small microprocessors. Each microprocessor block has a shared register file, and a large number of threads that are very efficiently switched by a hardware implemented scheduler. Each block also has a shared memory cache that must be explicitly managed by code.

The left side of the diagram is the CPU of the system, I left it as a dual-core just for an example.

Intel's Larrabee looks like a many core system in many ways. Again, I left a generic dual-core CPU on the left side. The architecture feature to note is that the L2 cache has been broken up and a portion dedicated to each core of 4 hardware threads. However, there is a high speed ring bus that provides access to any L2 from any core. The caches maintain coherency so programmers need only worry about race conditions, but not data barriers, write queues, and caches. However, high performance code will take advantage of the faster access of "local L2 cache".

Some things to summarize:

There a wide variety of machine types currently on the market, or about to be here.
Some architectures have non-uniform memory, and many require explicit memory management.
Systems that don't require explicit memory management still benefit from it. e.g.:

Wii with Locked Cache
CUDA with Shared Memory
360 with prefetching
Larrabee with "right sized" "local L2 cache" data

Large numbers of computing elements are coming. CUDA already exposes a very high count, but so does Larrabee. These systems will require efficient blends of both functional decomposition and data decomposition

Ed Holzwarth and I designed Floodgate in 2005/2006 to deal with many of these issues on PS3 & Xbox 360. I'm pleased to find our approach has positioned us well for upcoming hardware architectures we didn't know about then (CUDA, Larrabee). If you'd like more info on Floodgate, for now I'll just send you to some marketing material and a white paper. Also, much credit to those who actually implemented and maintain the system: David Asbell, Stephen Chenney, Michael Noland, Dan Amerson, & Joel Bartley (sorry if I missed someone).

9 comments:

NicoAugust 23, 2008 at 10:33 PM
Nice overview, very educating!
ReplyDelete
Replies
moradinAugust 25, 2008 at 12:10 AM
I was there, thank you for the presentation. It was one of the bests in the development track.
ReplyDelete
Replies
UnknownAugust 25, 2008 at 5:26 AM
WOw, nice post, the simplicity teaches easily.
ReplyDelete
Replies
AnonymousAugust 25, 2008 at 5:36 AM
Thanks for the info. Quick and to the point. With CUDA/PS3 etc the market for HPC is going to be wonderful as well the graphics which come out of Nvidia & AMD chips.
ReplyDelete
Replies
AnonymousFebruary 27, 2009 at 10:28 AM
Thank you very much for the overview! Very interesting and informative. However, seeing all these different parallel architectures at once, I am even less convinced that the graphics / game / HPC community will be able to single-handedly solve the Parallel Programming Problem, which academic computer science has not managed to solve over decades. Cross-platform libraries providing synchronization primitives etc. may facilitate the development process, but they do not remove the necessity to hand-optimize algorithms to achieve decent parallelized speed-ups. And this requires, above all, a lot of time and know-how both in the specific algorithms and in parallelization.
ReplyDelete
Replies
namar0x0309July 28, 2009 at 3:41 AM
Jolie !
ReplyDelete
Replies
AnonymousApril 27, 2010 at 1:03 PM
Wow great article thanks!
ReplyDelete
Replies
shikhaSeptember 14, 2010 at 2:05 AM
details r realy fine...i think its going to help me in project...
ReplyDelete
Replies
AnonymousDecember 18, 2010 at 8:32 AM
Notice that the SPUs can DMA Video ram too.
ReplyDelete
Replies

Add comment