I've presented on this before, but what I added this time was a series of architecture block diagrams to illustrate the wide range of systems out there. They specifically focus on the memory topology relevant for code.
Some quick notes:
- Sizes and distances between boxes don't have meaning in these diagrams, just the topology.
- There are simplifications (e.g. I haven't added EDRAM on the 360). However, the high level structure of the systems is valuable to contrast, and I've focused on what general processing typically accesses. If I've goofed something, let me know, but also perhaps I omitted it to keep things simpler.
- R stands for Registers, L1 and L2 for caches, Mem for Memory, GMem for graphics memory
We start with simple PCs and Multi-Core PCs. Memory is cached, but even with multi-core systems the programmer doesn't have to worry about consistency. As long as synchronization primitives are used to avoid race conditions, the systems take care of getting the right data when you fetch it. (This takes some work, since invalid data could be in an L1 cache that should be replaced by data currently in a write queue from another CPU.)
Getting into consoles, we start with the Wii. There are two types of memory, both accessible by CPU and GPU. However, what's really interesting is the ability to lock a portion of the L1 cache and explicitly manage it with DMA transfers. In one test case, we saw 2.5 times performance improvement by explicitly managing Floodgate transfers with the locked cache!
The Xbox 360 looks quite a bit like a multi-core PC, with multiple hardware threads per core. The main thing to note is the single memory used for "system" and graphical resources. Also, the GPU happens to be the memory controller, and has access to L2, but programmers needed concern themselves with this and only a few developers take advantage of GPU L2 access.
The PlayStation 3 (CELL processor) is the earliest architecture that really rocked the boat. A series of co-processors named SPUs have dedicated memory for instructions and data called Local Stores. These must be managed explicitly by DMA transfers. PlayStation 3 is why we built Floodgate, but as you'll see, it's not the only system that can benefit.
nVidia's CUDA is certainly an interesting architecture. It differs significantly from other systems, being a large collection of fairly small microprocessors. Each microprocessor block has a shared register file, and a large number of threads that are very efficiently switched by a hardware implemented scheduler. Each block also has a shared memory cache that must be explicitly managed by code.
The left side of the diagram is the CPU of the system, I left it as a dual-core just for an example.
Intel's Larrabee looks like a many core system in many ways. Again, I left a generic dual-core CPU on the left side. The architecture feature to note is that the L2 cache has been broken up and a portion dedicated to each core of 4 hardware threads. However, there is a high speed ring bus that provides access to any L2 from any core. The caches maintain coherency so programmers need only worry about race conditions, but not data barriers, write queues, and caches. However, high performance code will take advantage of the faster access of "local L2 cache".
Some things to summarize:
- There a wide variety of machine types currently on the market, or about to be here.
- Some architectures have non-uniform memory, and many require explicit memory management.
- Systems that don't require explicit memory management still benefit from it. e.g.:
- Wii with Locked Cache
- CUDA with Shared Memory
- 360 with prefetching
- Larrabee with "right sized" "local L2 cache" data
- Large numbers of computing elements are coming. CUDA already exposes a very high count, but so does Larrabee. These systems will require efficient blends of both functional decomposition and data decomposition