I disagree with comparing game behavior. Known good is a viable method of testing and is good for testing general issues off hand,
Games are not created equal, they do not utilise the available api's or hardware equally and are not valid method of testing.

I also use benchmarks like 3dmark which are even better (and a favorite of overclockers like myself).
Benchmarks are performance tests, they do not exercise every aspect of the hardware at the same time.

Even burn in utilities only specifically test individual aspects of the core at a given time: If the power relay is faulty for example, most stability tools would not pick it up because they don't put full load on the system.

If even one game hangs a computer: its not the game causing the hang (as user mode applications cannot put the processor in such a state), the computer is just perfectly stable.