I've presented on this before, but what I added this time was a series of architecture block diagrams to illustrate the wide range of systems out there. They specifically focus on the memory topology relevant for code.
Some quick notes:
- Sizes and distances between boxes don't have meaning in these diagrams, just the topology.
- There are simplifications (e.g. I haven't added EDRAM on the 360). However, the high level structure of the systems is valuable to contrast, and I've focused on what general processing typically accesses. If I've goofed something, let me know, but also perhaps I omitted it to keep things simpler.
- R stands for Registers, L1 and L2 for caches, Mem for Memory, GMem for graphics memory
We start with simple PCs and Multi-Core PCs. Memory is cached, but even with multi-core systems the programmer doesn't have to worry about consistency. As long as synchronization primitives are used to avoid race conditions, the systems take care of getting the right data when you fetch it. (This takes some work, since invalid data could be in an L1 cache that should be replaced by data currently in a write queue from another CPU.)
Getting into consoles, we start with the Wii. There are two types of memory, both accessible by CPU and GPU. However, what's really interesting is the ability to lock a portion of the L1 cache and explicitly manage it with DMA transfers. In one test case, we saw 2.5 times performance improvement by explicitly managing Floodgate transfers with the locked cache!
The Xbox 360 looks quite a bit like a multi-core PC, with multiple hardware threads per core. The main thing to note is the single memory used for "system" and graphical resources. Also, the GPU happens to be the memory controller, and has access to L2, but programmers needed concern themselves with this and only a few developers take advantage of GPU L2 access.
The PlayStation 3 (CELL processor) is the earliest architecture that really rocked the boat. A series of co-processors named SPUs have dedicated memory for instructions and data called Local Stores. These must be managed explicitly by DMA transfers. PlayStation 3 is why we built Floodgate, but as you'll see, it's not the only system that can benefit.
nVidia's CUDA is certainly an interesting architecture. It differs significantly from other systems, being a large collection of fairly small microprocessors. Each microprocessor block has a shared register file, and a large number of threads that are very efficiently switched by a hardware implemented scheduler. Each block also has a shared memory cache that must be explicitly managed by code.
The left side of the diagram is the CPU of the system, I left it as a dual-core just for an example.
Intel's Larrabee looks like a many core system in many ways. Again, I left a generic dual-core CPU on the left side. The architecture feature to note is that the L2 cache has been broken up and a portion dedicated to each core of 4 hardware threads. However, there is a high speed ring bus that provides access to any L2 from any core. The caches maintain coherency so programmers need only worry about race conditions, but not data barriers, write queues, and caches. However, high performance code will take advantage of the faster access of "local L2 cache".
Some things to summarize:
- There a wide variety of machine types currently on the market, or about to be here.
- Some architectures have non-uniform memory, and many require explicit memory management.
- Systems that don't require explicit memory management still benefit from it. e.g.:
- Wii with Locked Cache
- CUDA with Shared Memory
- 360 with prefetching
- Larrabee with "right sized" "local L2 cache" data
- Large numbers of computing elements are coming. CUDA already exposes a very high count, but so does Larrabee. These systems will require efficient blends of both functional decomposition and data decomposition
Nice overview, very educating!
ReplyDeleteI was there, thank you for the presentation. It was one of the bests in the development track.
ReplyDeleteWOw, nice post, the simplicity teaches easily.
ReplyDeleteThanks for the info. Quick and to the point. With CUDA/PS3 etc the market for HPC is going to be wonderful as well the graphics which come out of Nvidia & AMD chips.
ReplyDeleteThank you very much for the overview! Very interesting and informative. However, seeing all these different parallel architectures at once, I am even less convinced that the graphics / game / HPC community will be able to single-handedly solve the Parallel Programming Problem, which academic computer science has not managed to solve over decades. Cross-platform libraries providing synchronization primitives etc. may facilitate the development process, but they do not remove the necessity to hand-optimize algorithms to achieve decent parallelized speed-ups. And this requires, above all, a lot of time and know-how both in the specific algorithms and in parallelization.
ReplyDeleteJolie !
ReplyDeleteWow great article thanks!
ReplyDeletedetails r realy fine...i think its going to help me in project...
ReplyDeleteNotice that the SPUs can DMA Video ram too.
ReplyDelete