comparing GPUs to CPUs isn't fair

preview_player
Показать описание
In my previous video, I talked about why CPUs cannot have thousands of cores. While this is true, due to thermal, electrical, and memory limitations, alot of the comments in the video were about how CPU's have thousands of cores. In this video, we discuss the subtle differences in GPU microarchitecture, which makes CUDA "cores" and CPU cores significantly different.

CPU cores are heavy computing machines, that are able to process arbitrary input from users using arbitrary programs. Because of this, CPUs are more generalized. GPUs on the other hand, are good at one thing: bulk processing on bulk data.

🏫 COURSES 🏫

🔥🔥🔥 SOCIALS 🔥🔥🔥
Рекомендации по теме
Комментарии
Автор

Don't forget that a CPU core also implements the entire x86/x64 instruction set while a shader core is only going to implement a much smaller and simpler instruction set. This is how they fit so many more cores on a GPU die in the first place.

CharlesVanNoland
Автор

I remember when NVIDIA did this Tegra presentation and I had to cringe when they claimed they had the first 200 core (or something like that) mobile processor. They really just had a generic arm design and a GPU and added those cores up like they were equivalent.

CjqNslXUcM
Автор

Basically, CPUs are optimized to minimize latency, GPUs are optimized to maximize throughput (bandwidth)
While at first glance they seem to imply the same thing, they do not. You could get a result from the CPU in 1ms, but only process 10 items, but a GPU can process 10, 000 items in 100ms. You would expect this to mean 10, 000/100 = 100 items in 1ms, but yeah that's not how GPUs work. You pay for the the high bandwidth in latency
It is nuanced, but once you understand it, the difference is actually night and day.

GPUs aren't also flexible. The programs you write, are "inherently" parallel. No std::thread kinda stuff. You write a scalar like program that is "automagically" parallelized, so you have to write thinking about parallel access from the get-go

dexterman
Автор

4090 has 83 TFLOPs. It’s the 4080 that has the 49.

Maxim_Espada
Автор

You should have done the mechanical layout of the a CPU vs GPU core. Its a clear the difference that way. Way more parts for the CPU core as they are very different and not even in the same realm.

BentonL
Автор

4:38 former maintainer of Intel's OpenCL driver for Linux here, on Intel the Y branch threads would execute after the X branch has finished(reached the "else" statement) and block the X branch threads until the end of the if/else. I'm not familiar with Nvidia but I think they do the same.

Also with AVX512 the line seems to be blurring somewhat, AVX512 has the same lane masking capability just like Intel's GPU ISA.

linnaea_lavia
Автор

This kind of parallelism is actually called Single Instruction Multiple Threads, as it is slightly different from Single Instruction Multiple Data. In fact a Warp can be Single Instruction Multiple Threads like explained in the video and process multiple pixels at once (and following every branch in unison), while every core can be Single Instruction Multiple Data and process a vec4 at a time, not just a float.

naturallyinterested
Автор

A great explanation. Thank you!

In a base analogy, it can be taken like thus; CPUs are like 3 architect/builders. They can perform numerous amounts of complex stuff well, but they're limited in number, and therefore efficiency when complexity isn't required.

GPUs are like ant colonies; not smart enough to build wonders, but there are many enough to work fast and efficiently on singular tasks.

aeureus
Автор

Good video. I'd like to add that programming GPUs in a way which approaches the advertised performance is rather difficult. You have mentioned the branches, but also, they are lacking features (no call stack, no dynamic memory allocation), ideally need specific memory access patterns (search for memory coalescing, and bank conflicts), have manually managed in-core caches, and important technical info is a trade secret (like their instruction sets).

soonts
Автор

Would really appreciate more videos on this style explaining these kinds of concepts

danielray
Автор

You just shouldnt call GPU FP units cores, that's just a marketing term from NVIDIA. Shaders or PF32 units would be better names for what they are. The closest thing in a NVIDIA GPU to a CPU core would be something like a SM. And there are only like 128 SMs even in the highest End GPUs.

lukas_ls
Автор

Gpu cores also run at a lower clock speed which allows stacking more of them in a small chip

aaron
Автор

Love the video. Really interesting and pretty simple to understand.

no-one
Автор

Great presentation!

Size / yield / energy : a big clever CPU core is harder to manufacture. Scaling out CPU to the same amount of cores as in a GPU, but keeping the complexity of the CPU… you’ll get insane power draw, low yield and insane prices like super computers.

One thing cut for time here: that memory architecture is very different because they are built for so different purposes. GPUs have specialized memory that shuffle a lot of memory at very wide buses; read a lot of closely aligned memory. CPU have very narrow busses (in DDR5, 2x32bit per stick). So a CPU chip can shuffle a lot of different data at the same time while GPUs are good at shuffling a lot more of the same data. So the GPU memory model is bad for running multiple different programs at the same time. So the literal hardware interfaces of the chip are built for extremely different purposes, entire different programming idea :)

randomgeocacher
Автор

I think it's useful to mention that GPUs will frequently and deliberately block on memory as the memory subsystem is geared towards throughput with little caching in the way of reducing access latency. Hence, a SM core may theoretically switch context after every warp instruction.

michaelprantl
Автор

One thing to note. Starting with Ampere, Nvidias Cuda cores have a FPU, and a combined Int32 and FPU. They aren't completely split anymore. They did this to increase the max theoretical performance, however, there's almost never a case where some Integer calculations are being run. It's really quite interesting actually. AMD has done the same thing with Navi 31, in the Radeon 7000 series. I guess it's a way to squeeze out some extra performance without increasing die size

hufthenerd
Автор

Saying a GPU is faster than a CPU is like saying a rock is a much better car than a tennis racket.
Unless you have an explicit context and an exact specification compared to _what_ it is supposedly faster there is absolutely no point in even trying to reason what such a statement means.

Finkelfunk
Автор

That branch blocking fits a GPU perfectly, because it short-circuits the computation path if the view of the object is blocked, or not in view.

patrickvolk
Автор

Nice explanation of warp scheduling and stuff, I used those ideas a lot in my path tracer

kylebowles
Автор

That's some deep level of knowledge
Thank you sir for the infos ❤️

oussamakhlif
welcome to shbcf.ru