ATI
introduced Stream Computing on Friday at an event
in San Francisco. Stream Computing is not a product
per se, but a class of applications that will
run on a GPU instead of a CPU, and luckily for
ATI, they will do it many times faster than the
latest and greatest CPU's out there.
Dave
Orton kicked things off with an overview of the
whole concept, and he put forward quite a realistic
outlook on what it can and can not do. Far from
being the best thing for every task out there,
Orton started out saying how it may or may not
apply to your workload. If the problem you have
will map to a GPU, you can see speedups from 10x
to 40x, a massive increase. Doing the same math
purely on a CPU would take a very expensive computer.
A
modern GPU is built from clusters of shaders,
both pixel and vertex for DX9, adding in geometry
for DX10. Some of the problems map well to pixel
shaders, others to vertex shaders, but since neither
does exactly the same type of math, you can't
use both efficiently. The R600 class of GPUs will
bring unified shaders to the mix, this means each
shader can do it all.
If you look at it from a strict system architecture
point of view, a modern GPU is a cluster of very
fast micro cores connected with a frighteningly
high speed interconnect and supported with hugely
high speed memory. This is how Stream Computing
sees a GPU, many mini math engines, a current
R580 class GPU has 375 Gigaflops of theoretical
performance available with 64Gbps of memory bandwidth.
The next generation will have over 500 Gigaflops
on tap, but for some reason, ATI would not go
into detail.
The name Stream refers to the concept of pulling
a stream of data in, processing it and streaming
it out as one flow. If your problem supports it,
and you program things right, you will end up
with a nonstop flow of answers to the data you
feed in. If you do it wrong, you get data shuttled
back and forth, memory thrashing, and all sorts
of inefficient use of compute power. All of these
things, while many times faster than a CPU, are
not in competition with it, you need a CPU as
a controller and to do the many things a GPU simply
can not accomplish. It really is a synergistic
relationship. It may lessen the need for more
CPUs, but won't eliminate it.
The concept of Stream Computing is a cute technical
paradigm, but without real world applications,
what is the point? Well, as it turns out, Stream
has several sets of problems that work well on
it now, and will show immediate benefits as soon
as you put them on the GPU.
Dave Orton listed the main classes of problems
as Scientific Computing, Climate Research, Homeland
Security, Wall Street Risk Assessment, Seismic
Modeling and Search for the enterprise side. This
is where Stream is aimed, but it also maps very
well to several consumer applications, game physics
being the major one.
Scientific Computing is one of the big targets,
and it can reach the top end of the 10-40x speed
increase range. If coded right, HPC is a huge
win for the whole Stream concept, in one of the
later demos at the launch ATI showed a real world
30x speedup on the first release of code.
Then comes climate research, you may have noticed
all the talk about hurricanes, tornadoes, wind
shear, and the difficulty of forecasting the weather
much farther out than 48 hours. The whole climate
field is a combination of a lot of science, huge
compute resources and a little black magic. We
are good at the macro things, we can tell you
that there is a hurricane at position XYZ, but
what is a little more nebulous is where it will
land, how it will turn, and other things that
may kill you or level your neighborhood. This
calls for an almost bottomless well of CPU power,
and luckily maps to Stream Computing well with
about a 20x speedup.
Homeland security is another obvious one, things
like fingerprint and facial recognition are touted
by politicians as the next big thing even if many
of them can not tell you how it will actually
make you more secure. Searching a database of
millions or billions of fingerprints takes horsepower,
and that translates into banks of computers or
long waits. If you are sitting in a security line
at an airport, the simple phrase 'please wait
sir, our computers are quite slow today' can be
akin to torture. Facial recognition takes orders
of magnitude more power than fingerprint recognition,
but luckily Stream brings massive speedups there
too.
Wall Street is another major target, and let me
say this bluntly, stock traders are crazy. They
have gone from basing value on how a company does,
to how it may do, to how people think that others
may think it will do. It is a huge series of what
if questions based on monumentally large data
sets.
If you spot a gap in the price a stock is at against
where it will move, you may have seconds to act
with millions of dollars at stake. If you are
a percentage point faster than the next guy at
figuring that trade off, your payouts can be bigger
than hitting the lottery. This crowd will throw
cubic dollars at a few percent speedup, 10x is
unheard of. Can you ask for more than rich clients
willing to spend almost anything for an advantage?
Sure you can, and another one of these is the
oil and gas industry. They are another group in
a cutthroat competition with others in the same
industry to find the next big oil field. With
oil in the $70 a barrel range, finding large field
is worth more than many medium sized countries.
If it takes compute power to model the earth,
they will buy as much of it as they can get their
hands on.
Last comes things like search, not pressing F3
in windows, but Google. Search is a nearly perfect
Stream Computing application, you pull lots of
data in one side, pattern match it, and send it
out the other side. Google is currently compute
bound, or so many people say, and is delaying
new services until they can build out more rack
space. A 2x increase in speed would be invaluable
to Google, 20x is mana from heaven.
Dave Orton also had another example of how search
could benefit from Stream. Imagine you have all
your digital pictures on your PC, maybe 10 years
worth. You know you have this picture of yourself
with Bob near a boat, but was it in the summer
of 98 or 99, and what was the file name again?
With Stream, the compute power needed to answer
the query 'find me all pictures of Bob' becomes
available. You could do this on a CPU, but if
it takes a minute with stream, it may be half
an hour on a CPU, and you are not going to use
it much with that kind of lag.
The
science of Stream Computing has two major components
that change how we look at much of computing.
The first is - does the problem fit the GPU, IE
can you phrase it is such a way that it works
with the accelerated math at your fingertips.
It may involve looking at the problem from a new
angle, or more likely using different algorithms.
If there is an algorithm which is 5x slower than
the best in class, but you can map that one to
a GPU for a 30x speedup, it is still a clear win.
The
other hurdle is how you split the problem up into
GPU and non-GPU chunks. As ATI keeps pointing
out, if you do it badly, you can end up wasting
any advantage you might gain by going to the GPU.
Luckily there are tools that are finally hitting
the market which will ease a lot of the burden
there. Stream is in it's infancy, but several
companies are putting it to very good use right
now.
With that in mind Dave Orton turned things over
to four partners who are using or are about to
use Stream Computing functionality. They are Folding@Home,
Peakstream, Microsoft and Havok. They are all
doing radically different things in the same way
with the same silicon.
Vijay Pande, Associate Professor of Chemistry
at Stanford University was the first, you may
recognize him from the Folding@Home project. You
probably have seen F@H in action at one time or
other, it models how proteins fold up in 3d space.
Proteins a long chains of amino acids, and they
interact with each other on an atomic level to
fold into various twisted knots. If it folds one
way, you have one set of functionality, folds
another way, you get a completely different effect
from the same protein. Imagine there are millions
of ways to fold a given protein, and orders of
magnitude more ways in which each of those folds
can happen.
This level of compute power is completely unattainable
by modern clusters, but Folding@Home makes a virtual
machine far larger than any you could hope to
buy now. About 2 million PCs participate in the
program, and it is about the equivalent of a 200,000
CPU supercomputer. One slide they put up with
geographic locations of people running F@H, it
mapped almost perfectly to areas of there world
where there was electricity.
Luckily this problem maps very well to Stream
computing, and Vijay Pande is seeing a 20-40x
speedup with a copy of F@H ported to the X1900.
Problems that did take 30 years to solve can now
be done in one year giving hope to sufferers of
diseases like Alzheimers. In a demo, the copy
running on the GPU was clicking off many frames
a second while the non GPU version was having
problems getting a single FPS.
The beta of Folding@Home for ATI GPUs will be
released on October 2, and they are aiming to
turn the project into a Petaflop computer by the
end of the day. If enough people join in, they
think a 10 Petaflop computer may be possible to
hit. This is all much more than theoretical however,
Vijay said to expect some results from the program
in a few months, all of you who have been contributing
will have something worthwhile to show for your
efforts.
The stage was then turned over to Peakstream and
its VP of marketing, Michael Mullaney. Peakstream
makes tools so you can write Stream Computing
code, debug and optimize it. There are several
parts to the Peakstream toolset, the two most
important are the virtual machine and the profiler.
Code written for the Peakstream VM can run either
on the CPU or the GPU, it is more of a tool to
write an application for and forget, the underlying
code hopefully does it all for you. With any luck,
the compilers can make all the hard choices and
you just fire and forget. If not, there is always
the profiler to lean on.
In addition to all the things a normal profiler
would do, spot slow and inefficient code points,
the Peakstream profiler does one thing of critical
importance, it can spot thrashing of data between
the host and the GPU. Earlier, I pointed out that
this was one of the big “no-no’s”
to getting performance out of Stream Computing
code, when you are using cycles to shuttle things
back and forth or worse yet sitting around waiting,
you are not computing. I can see how anyone writing
GPU code would need this tool.
One example of this which Michael Mullaney was
talking about - oil and gas exploration using
seismic waves. The concept is simple, you set
off an explosive on the surface, and it sends
out shock waves. The rate at which they propagate
out depends on the density of the material they
are moving through. With strategically placed
listening devices, you can literally map out the
subsurface structures with striking detail.
The
data isn't all that hard to collect, but turning
several pings into an accurate 3D model of the
world beneath your feet takes a huge amount of
number crunching. The data supplied is from the
company Hess, a large oil and gas exploration
outfit. As you can see, the code running on the
GPU with Peakstream tools ran 16x faster than
on the CPU. If you have a cluster of 1024 CPUs
crunching away day and night on this code, you
can knock that down to 64 X1950s and save yourself
an immense amount of money, space and electricity.
Many
of these apps are ravenous in their appetite for
flops, Mullaney said that ExaFLOPS were not nearly
enough, ZetaFLOPS could be put to good use. Because
of this, it probably won't make Hess's data center
any smaller, it will just increase the work it
can do. I don't think you will hear them complain
though.
There
were a few other examples mentioned in passing.
One is a homeland security application where they
listen to conversations in public places and pick
out keywords from the stream. Scary big brother
stuff, but probably a lot closer than you think.
Another
topic was about as far from the data center as
you can imagine, an undisclosed mobile military
signal processing application. Instead of using
a huge bulky laptop and having it crunch the numbers
with the speed of a sloth, the military can do
it on a much smaller laptop in less time. When
you are in a foxhole in a far off land, bullets
whizzing over your head, speed is more than a
theoretical problem.
HPC
is about a $9 Billion market, and peakstream is
uniquely positioned to provide a huge increase
in performance to the sector. When people are
floored by the 30-40% speed increase of Conroe
over an A64, imagine if you could show them a
10x boost? The fanboys would not know how to express
their joy on the forums.
On
that upbeat note, he turned things over to Chas
Boyd of Microsoft who works in the Graphics Platform
Unit. There were no big announcements from MS
this time around, just a few hints of things to
come. MS is actively coding for the GPU, and this
will show up more and more in Vista. The UI, Aero
Glass is a good example of how it will work, you
don't know that it runs on the GPU, nor should
you really care, it just works.
Another
example is an upcoming photo editing program from
MS. It doesn't do anything that the older versions
did not, and certainly does not threaten Photoshop,
but it brings ease of use to the genre. If you
have ever applied a complex filter to a picture,
it happens pretty quickly in the preview window,
and takes a little longer to apply to the full
picture. With Stream Computing, MS can put slider
bars up on the right and those filters happen
in real time as you drag the slider. This would
be unheard of if you had a five second lag at
each step.
The other interesting demo involves the ever popular
speeding up of sorting algorithms. They had a
demo of a tree on a grassy hill, with each leaf
and blade of grass rendered correctly. To do this,
and to prevent polygons from passing through each
other like when you see the arm of a bad guy poking
through the door you are about to open, you need
to sort all the polygons in real time.
With
a humanoid character in a building, this isn't
much of a trick, even if many developers can't
seem to get it right. Doing it with all the blades
of grass in a field is a trick, or at least puts
an unacceptable burden on the CPU if it can be
done at all. Stream Computing can harness the
GPU to do the repetitive heavy lifting here, and
it makes animations of a tree rotating on grass
possible.
MS
is legitimizing the concept for mainstream use.
Don't expect to see miracles or a modern game
running on an G965, just look for smoother transition,
perkier effects, and things that had lag happen
now. Ease of use is the key here.
Last
up was the one you have been waiting for, physics,
the first killer app of the Stream genre, and
who better to show it of than Havok? They had
three demos, two repeats of their Computex demos,
and a new one based on shooting cannon balls into
a Lego fortress.
Jeff
Yates of Havok started off with a brief history
of physics in computing, starting with pong, moving
on to simple objects that bounced around in a
semi-real fashion. From there, it is on to the
future immersive world of full physics simulation
and Lord of the Rings style group combat. The
hope is that Stream Computing can get you there
quicker than any other technique.
Without
an official announcement, he pointed out that
HavokFX would run on ATI cards, and they did quite
well at the physics game. Everything in a game
is starting to move toward using physics at the
core, and Havok is there to help. OK, this may
be a little biased as they are a physics middleware
company, but I can see their point.
A
CPU can handle many objects in a game, but not
a flow of boulders bouncing down a hill needing
tens of thousands of interactions here and there.
Stream on a GPU can simulate from 1000 to 10,000
objects, and Jeff Yates says a single GPU is worth
about 1000 CPUs in this regard. I can just see
the next Intel physics demo a Spring IDF, the
1001 CPU cluster for gaming, who needs a GPU anymore?
In
addition to the brick wars game, each castle had
thousands of blocks, they showed off cloth simulations
in real time. This isn't particularly hard to
do, nor does it add much to gameplay, but as far
as immersion in a game world goes, it makes a
big difference.
Everyone
is getting into the physics on a GPU game, from
game developers to ATI and Nvidia, it is only
a matter of time and a bit of experimentation
before it becomes pervasive. The whole GPU physics
vs PPU card is still an open question with no
clear leader in sight.
From
Left to Right: Chas Boyd of Microsoft, Jeff Yates
of Havok, Vijay Pande of Stanford University,
Michael Mullaney of Peakstream, and Dave Orton
of ATI.
With that, all of the players came back on stage
for a Q&A session. Most of the questions focused
around two topics, convergence of GPUs and IEEE
compliant floating point ops, and the making of
a GPU with the graphics functions cut out to be
a Stream co-processor. The short answer to the
IEEE floating point question was no, they are
not IEEE compliant, nor will they be soon, but
are definitely moving in that direction. The answer
to the non G GPU was a more emphatic no wrapped
in a no comment. Basically, the whole appeal of
Stream Computing is you take something that is
already there and put it to wider use. A specialized
co-processor is not generalized, nor can it be
assumed to be there, so it probably won't happen.
Overall,
the day went well. ATI was not giving out specifics,
nor were they saying anything that was not already
out there in the press. What they did do, and
did it quite effectively, was to point out that
this whole Stream Computing concept is out there,
has serious momentum, and is only gaining ground.
There are multiple companies using it in currently
available products, and the list is growing longer
every day.
Stream
provides direct and measurable benefits to many
classes of users as long as your problem fits
the paradigm. It can give you an order of magnitude
speed boost in an era of incremental advances,
something that it is hard to overstate the importance
of. If your app doesn't fit, R600 is just around
the corner, and unified shaders with DX10 may
widen the applicability range more than many people
expect.