Beautiful Pixels: August 2008

Saturday, August 23, 2008

Multi-Platform Multi-Core Architecture Comparison (PC, Wii, Xbox 360, PS3, CUDA, Larrabee)

I just gave a presentation at the Game Connection Developers Conference in Leipzig. It dealt with Multi-Platform support for Multi-Core development... which we've solved at Emergent with Floodgate.

I've presented on this before, but what I added this time was a series of architecture block diagrams to illustrate the wide range of systems out there. They specifically focus on the memory topology relevant for code.

Some quick notes:

Sizes and distances between boxes don't have meaning in these diagrams, just the topology.
There are simplifications (e.g. I haven't added EDRAM on the 360). However, the high level structure of the systems is valuable to contrast, and I've focused on what general processing typically accesses. If I've goofed something, let me know, but also perhaps I omitted it to keep things simpler.
R stands for Registers, L1 and L2 for caches, Mem for Memory, GMem for graphics memory

We start with simple PCs and Multi-Core PCs. Memory is cached, but even with multi-core systems the programmer doesn't have to worry about consistency. As long as synchronization primitives are used to avoid race conditions, the systems take care of getting the right data when you fetch it. (This takes some work, since invalid data could be in an L1 cache that should be replaced by data currently in a write queue from another CPU.)

Getting into consoles, we start with the Wii. There are two types of memory, both accessible by CPU and GPU. However, what's really interesting is the ability to lock a portion of the L1 cache and explicitly manage it with DMA transfers. In one test case, we saw 2.5 times performance improvement by explicitly managing Floodgate transfers with the locked cache!

The Xbox 360 looks quite a bit like a multi-core PC, with multiple hardware threads per core. The main thing to note is the single memory used for "system" and graphical resources. Also, the GPU happens to be the memory controller, and has access to L2, but programmers needed concern themselves with this and only a few developers take advantage of GPU L2 access.

The PlayStation 3 (CELL processor) is the earliest architecture that really rocked the boat. A series of co-processors named SPUs have dedicated memory for instructions and data called Local Stores. These must be managed explicitly by DMA transfers. PlayStation 3 is why we built Floodgate, but as you'll see, it's not the only system that can benefit.

nVidia's CUDA is certainly an interesting architecture. It differs significantly from other systems, being a large collection of fairly small microprocessors. Each microprocessor block has a shared register file, and a large number of threads that are very efficiently switched by a hardware implemented scheduler. Each block also has a shared memory cache that must be explicitly managed by code.

The left side of the diagram is the CPU of the system, I left it as a dual-core just for an example.

Intel's Larrabee looks like a many core system in many ways. Again, I left a generic dual-core CPU on the left side. The architecture feature to note is that the L2 cache has been broken up and a portion dedicated to each core of 4 hardware threads. However, there is a high speed ring bus that provides access to any L2 from any core. The caches maintain coherency so programmers need only worry about race conditions, but not data barriers, write queues, and caches. However, high performance code will take advantage of the faster access of "local L2 cache".

Some things to summarize:

There a wide variety of machine types currently on the market, or about to be here.
Some architectures have non-uniform memory, and many require explicit memory management.
Systems that don't require explicit memory management still benefit from it. e.g.:

Wii with Locked Cache
CUDA with Shared Memory
360 with prefetching
Larrabee with "right sized" "local L2 cache" data

Large numbers of computing elements are coming. CUDA already exposes a very high count, but so does Larrabee. These systems will require efficient blends of both functional decomposition and data decomposition

Ed Holzwarth and I designed Floodgate in 2005/2006 to deal with many of these issues on PS3 & Xbox 360. I'm pleased to find our approach has positioned us well for upcoming hardware architectures we didn't know about then (CUDA, Larrabee). If you'd like more info on Floodgate, for now I'll just send you to some marketing material and a white paper. Also, much credit to those who actually implemented and maintain the system: David Asbell, Stephen Chenney, Michael Noland, Dan Amerson, & Joel Bartley (sorry if I missed someone).

Thursday, August 14, 2008

Siggraph 2008: Top Picks to Follow Along

Siggraph is still underway, and it’s a great year to be a game developer at Siggraph.

I’ve already got my pick for the three best items for a game developer to look at, and you can do so now:

Advances in Real-Time Rendering in 3D Graphics and Games Course

Halo 3 lighting
Virtual Textures
Simulation and Rendering Massive Crowds on GPU
Wavelets with Current and Future Hardware
Starcraft Effects & Techniques

Beyond Programmable Shading Course

GPU architecture: history, current cards, future direction
Larrabee architecture and programming techniques
Writing data & task parallel algorithms for tightly coupled computation and graphics
Compute API discussions
Examples of parallel applications in games and graphics

Larrabee Paper

If you weren’t able to make it to a presentation, I highly recommend you check out Siggraph Encore, where you can purchase videos of the presentations.

Friday, August 8, 2008

Heading to Siggraph 2008

Siggraph 2008 is here! I'm not ready!

Monday, I'll be presenting. If you missed my Gamefest presentation on Parallel Rendering with DirectX 9, now's your chance. ;)

Advances in Real-Time Rendering in 3D Graphics and Games: Part l & Part2
- Loads of excellent presentations for game developers
- I'll be presenting at ~11:30
- Monday, 8:30 am - 12:15 pm, 3:45 pm - 5:30 pm

I'll be focusing a lot on the future of game systems.... Larrabee is making waves there, but so are others. Here are some events I'm prioritizing:

Parallelism Papers
- Larrabee paper is there
- Tuesday, 10:30 am - 12:15 pm
Beyond Programmable Shading class: Fundamentals & In Action
- Excellent topics relevant for future systems such as Larrabee ... (and other unannounced systems)
- Thursday, 8:30 am - 12:15 pm, 1:45 - 5:30 pm

I wanted to review plenty of papers before going, but I just didn't get time. So, of course I'll hit

Fast-Forward Technical Papers Preview
- Monday, 6 - 8 pm

What are your must see sessions?

Terminology Rant

Siggraph is coming up, and it’s a great intro to this post! Siggraph is pronounced “sig-raph”, not “see-graph”. (The “si” is pronounced the same as in “sigma”).

Yes, that’s right, this is a terminology rant!

Some mis-used terms just get to me… but posting on just BiNormal and Zoom wasn’t enough, so I invited some friends to add to the pile. ;) I got more responses than I anticipated. Several overlapped from multiple people, so here’s the list:

BiTangent vs BiNormal

A BiNormal is defined in calculus as the cross product between a curve's Normal and Tangent. BiNormal is frequently misused in graphics when people need a basis to use on a 2D manifold surface. In that case, there is only one normal, but an infinite set of tangents. Normal mapping typically uses the tangents oriented by the u or v parameterization on the surface. NBTs are really “Normal, BiTangent, Tangent” sets.

Zoom vs Dolly

Misused when people actually mean dolly in-out. They’re confused that “making something bigger on screen” means “zooming”. Zoom a change in the field of view of a camera (by changing the focal length). This is definitely different than dolly… and we’d have no Alfred Hitchcock’s Vertigo without them both!

Hardware vs. Software (Eric Haines)

I hear this all the time, and it does bug me: "we should run that on the hardware, using a pixel shader". So the GPU is "hardware"? What's the CPU, then? Corrupting the term is kind of pointless, so let's call it the GPU or the graphics accelerator or the graphics card or whatever, but save "hardware" to mean CPU or GPU (or all those other random electronic bits inside the box). The flip side is calling the CPU "software", as in, "well, we can't run it on a pixel shader, so we'll need to run it in software". The CPU and GPU are both controlled by software.

Clipping (Kevin Cristensen)

The most annoying one for me. Reviewers like to use it and so does production or upper management. What they really mean is geometry penetrations between characters and other characters or characters and world geometry/objects. It doesn't affect gameplay at all but Indy got docked major points for it by IGN and other reviewers.

Maybe they are referring to the graphic engine clipping the character by the geometry? Not sure, they never really explain it. They usually say the character clips into the world or something lame.

Orthonormal Matrix (Eric Haines)

In the "don't taunt the mathematicians" category, "orthonormal matrix" is not a term most mathematicians use. A matrix composed from mutually perpendicular vectors, with all vectors normalized, in mathematics is normally called an "orthogonal matrix" - there is no term "orthonormal matrix". Well, there are a few rebellious mathematicians and their engineer lackeys who will daringly use "orthonormal matrix", especially after having a little too much sugar in their tea, but this is not a generally accepted term. It's illogical to me that such a matrix is "orthogonal" and not "orthonormal", since "orthonormal axes" and "orthonormal basis" is perfectly fine usage, but that's how it is in mathematics.

Frustum vs Frustrum (Eric Haines & Kevin Cristensen)

It’s Frustum… not Frustrum

Bezier segments and B-spline (Bill Baxter)

[Some] seem to refer to a sequence of Bezier segments as a "B-spline". You can convert one to the other, but that doesn't mean they're the same thing!

Phong shading (Eric Haines)

Phong shading: this term means two very different things. One usage is synonymous with "Phong interpolation", or per-pixel lighting. This used to be the main meaning of "Phong shading", vs. "Gouraud shading" (vertex interpolation). The other usage really means "Phong lighting" or "Phong illumination", and this is generally what is meant by "Phong shading" nowadays, as in "Blinn-Phong shading model". "Shading model" has come to mean "lighting model", vs. its ancient meaning of "type of interpolation". We still cope by using context: "Phong shading" usually means the specular-highlight cosine-lobe lighting model, but if we see the word "Gouraud" nearby we know it means interpolation instead. Must confuse newcomers, however, so it's probably better to say "Phong interpolation" if you have to say it at all, and best is probably "per-pixel lighting" and let Phong's association with interpolation die out.

Dot3 Bump Mapping (Dan Amerson)

It's not bump mapping, it's normal mapping.

Texture vs Texture Map (Eric Haines)

[People will say,] "Let's apply a texture map of a brick wall here", when what really should be said is simply "texture" instead of "texture map". The "map" part of "texture map" refers to the function used to transform a surface location in space to a location on the texture.

Overload vs. override (Dan Amerson)

People generally mean override when referring to virtual functions in a subclass, but I routinely hear people use overload for that situation.

Thanks to contributors:
Dan Amerson
Bill Baxter
Kevin Christensen
Eric Haines (perhaps because he's an author of a book, he had a lot to offer. ;) more than I could use)

If you have your own favorites… comment away. ;) I’d love to hear them.