Hardware trends: How many cores do we need?!

preview_player
Показать описание


Sources and Credits:

- Videos from the AMD official Youtube channel were used for illustrative purpose, all copyrights belong to the respective owners, used here under Fair Use.
- Videos from the Intel official Youtube channel were used for illustrative purpose, all copyrights belong to the respective owners, used here under Fair Use.
- Videos from the NVidia official Youtube channel were used for illustrative purpose, all copyrights belong to the respective owners, used here under Fair Use.
- Videos from the Microsoft official Youtube channel were used for illustrative purpose, all copyrights belong to the respective owners, used here under Fair Use.
- Videos from the Samsung official Youtube channel were used for illustrative purpose, all copyrights belong to the respective owners, used here under Fair Use.

- Amdahl's Law thesis based on a talk by Sophie Wilson.

- A few seconds from several other sources on youtube are used with a transformative nature, for educational purposes. If you haven't been credited please CONTACT ME directly and I will credit your work. Thanks!!

#nvidia #amd #intel
Рекомендации по теме
Комментарии
Автор

Corporations may want the masses to use thin clients, but that is not going to happen. People do not want to be tied to the Internet and/or to a specific corporation for all of their needs. That system might work in the short term, but once all of the perpetual monthly subscriptions start adding up, it will turn more and more people off.


I personally, enjoy having my own Home Theater system, my own security camera system, my own web server, my own compute farm, my own... whatever I want to build system.


As a 51 year old musician and music producer, I also need larger displays, not bendable ones. I have never had a need for a display that bends, nor will I ever. I have always needed a larger portable display that I can see musical notes and the details of my DAW on.

BoDiddly
Автор

"glue two 9900k if they have too" had me rolling
Great video as always.

thanu
Автор

You're ignoring a very important part of this question. While you can't scale performance infinitely, there is also the matter of the fact that you don't have to. Take the ray-tracing question as an example. At peak performance the system will be running fast enough to let you apply this algorithm to an entire scene in real time, without sacrificing performance, while leaving resources for other operations that may be able to use this hardware. In other words you don't need infinite scaling performance for all operations; you need enough compute resources to be able to reach the top of the S-curve for every single workload that you may want to run. That's your point of optimal efficiency.

The way this scales is by opening up the possibility of using those resources for additional workloads. So for instance, those extra ray-tracing cores could potentially be used for other physics simulations with minimal tweaking.

Next up, there's the fact that you're overly-generalizing how programs are parallelized. It's somewhat pointless to talk about chrome being 50% parallelizeable, because chrome is an interwoven mesh of systems, some of which don't have any need to operate in parallel. Instead it's more accurate to look at individual subsystems, how well they can be parallelized, and what sort of benefits that would yield you. This means that while chrome might only have 50% of code that is parallelizeable, consider what happens if you break it down into 100 subsystems, of which 20 of most computationally intensive submodules can be optimized to the 95% level. The result could be that compute heavy operations that previously took a second could take 100ms, while sequential operations that previously took 100ms would stay at 100ms. On average the speed increase would be quite small, but the actual perceived performance impact would be huge, simply by virtue of the fact that analyzing averages means you're not fully understanding the impact of the outliers, which is much easier to perceive as a users of the system.

In other words, applying these algorithms to modern multi-modular systems doesn't give you a very accurate picture of the benefits of large scale parallel hardware. You can do a lot of math to show that the average benefits have a hard cap, but in the process you ignore the actual perceived differences that come from having hardware available to handle peak loads that would otherwise cause very noticeable delays, as well as the fact that given more resources programmers are going to be able to write more systems to utilize those resources in order to provide services that are currently not feasible due to the average parallel compute power of most users being too low to support them.

Edit: Also, the math for ray-tracing doesn't really add up. Something like 95% is a very poor estimate of the parallelization factor. Consider this scenario: there are 4 million rays to trace. Each trace takes 1000 cycles. After all the rays are traced there is a combination step that takes 4 million cycles to combine the results together. Let's say that each trace can be run in parallel, while the final step for some reason can not. In this case the total number of instructions that will run is 4, 004, 000, 000 cycles, of this 4, 000, 000, 000 ban be run in parallel. In such a result the p value is actually 99.9% which is quite a ways off from your 95% estimate, while the speedup of the affected section is potentially capped at 4, 000, 000x, if all of the rays could be run in parallel. That really changes the results of your calculation to a very significant degree, to the point where the actual latency improvement of the whole program caps out at nearly 1000x.

The disconnect comes from the fact that the p value is not really "how much of the program can be run in parallel" it's more accurate to describe it as "how much of a program's execution time can benefit from parallel execution." That's an important distinction because for many programs the vast, vast majority of the runtime is spent inside a very small amount of code. That means even though most of the program may not be parallizeable, being able to run a small set of loops in parallel could yield huge boosts to the p value.

TikiTDO
Автор

Nobody will ever need more than -640k RAM- 32 cores

piotrfila
Автор

There's something hypnotic about hearing you talk about technology... Love your videos dude. Keep 'em coming!

stolz_ar
Автор

@ 6:22 I don't follow your logic. How does C=A+B translate into "MUL2W R4/R5, R0/R1, R2/R3 : LDL2 R0/R1, [X]"? I don't really know what instruction set you're using, but my guess is that the code snippet is intended to do a doublewide multiply from the register pairs R0/R1 and R2/R3 into R4/R5 simultaneously with a doublewide load of R0/R1 from the memory location given by [X]. The trick, of course, is that the clock cycle starts with R0/R1 holding the first operand of the multiply and once read from the register pair by the multiply can safely be replaced by the contents of memory location [X]. Whatever, that's not the same at all as C=A+B!

The instructions:
LDL R0, [AddressA] : LDL R1, [AddressB]
ADDW R2, R0, R1
STL R2, [AddressC]
are data dependent and really can't be parallelized any further using the classical Von Neumann (CPU+memory) machine architecture. The best the programmer or compiler can do is find another independent expression to execute in parallel with the ADDW and STL.

Now if the instruction was coded as:
ADD [AddressC], [AddressA], [AddressB]
or something similar then the three steps can be hidden by the architect, even to the point of executing the instruction, not in the CPU, but in the memory unit holding the three addresses. But still, no matter what, you can't add two numbers until they have first been fetched and fed into the input gates of an adder somewhere, and the result has to be computed before it can be stored, even if the architect cheats and makes each "clock cycle" include all those steps.

If every memory unit had its own CPU capabilities attached we could maybe get a lot more fine-grained parallelism going. Right now our computers are limited by the bottleneck between the CPU and main memory, which is why we keep caching more and more memory on the CPU chips. But still, caching just speeds up fetching and storing, it doesn't eliminate them.

It turns out that most of what a computer does is copy data conditionally from one location to another based on the values of that or other data. There are very few complex expressions involving multiplies, divides, square roots and other time consuming operations to be executed in most programs. Address indexing computations are actually the most common complex operation. You see a lot of "X[I] := A[J] >= B ? C[K] : D" or similar equivalents spread out over multiple statements.

dlwatib
Автор

Love your vids m8, always stoked to see another notification from Coreteks !

squirt
Автор

This is quality content. Really like your channel.

ianboltron
Автор

The problem, that this video didn't touch on, is that adding cores to CPUs is also bottlenecked by the buses connecting the cores to memory and other peripherals. The bus bottleneck is not a problem for computations that don't use a lot of data.

But graphics for example uses tons of data, so placing the CPU and GPU in the same package will create a huge bottleneck when they both fight for bandwidth on the same buses. A lot of GPU advances have been in using wider and wider buses and more ports on memory, allowing the same memory chips to be access over multiple buses.

steamsuhonen
Автор

I don't see how the shown parallelized code example should work. You cannot store the result without completing the computation. You cannot compute the calculation without reading the values from the registers. It cannot run simultaneously because these operations are dependent on each other. And I don't think Raytracing follows Amdahls Law, instead I think it scales nearly perfectly with cores. More Cores = More Rays per second = More FPS in pure Raytracing Games... You just have to increase memory bandwidth accordingly.

perschistence
Автор

This channel is almost too intelligent for YouToob.

Silicon & copper are taking their last gasps. The only way we're gonna see per-watt, Moore's law type performance increases again is when semiconductors are ditched (as well as Von Neumann architecture.) I assume one of the National Laboratories already has prototypes working. We'll probably find out about it when the first DOD hardware using it get's shot down.

As for how many core's do we need? Well obviously, as many as we can afford... ;-)

andrewmcfarland
Автор

Just a clarification, Amdahl's law only applies to individual programs that have both sequential and parallel components. It does not apply to running entirely separate programs on separate cores which is perfectly parallel, so multi-tasking can scale indefinitely. I know that the majority of users will never use their hardware to its full potential, but having more potential is still a very good thing when it comes to putting powerful tools in the hands of curious individuals. With more capable hardware comes more talent looking for ways to (ab)use it.

Anyway, I don't think there is any need for the CPU/GPU manufacturers to halt progress on parallelizing their hardware because they don't think many people can use it... that's intel bullshit sitting on quad-core CPUs when there was a clear (though perhaps somewhat irrational) demand for more cores.

The software developers should always push the limits of their hardware, and the hardware manufacturers should always be looking to remove those limitations. Neither should ever assume that any number of cores will be enough... just let the market forces find the limits (if they even exist). I can easily imagine a future where we have 256 CPU cores, with 128 of them dedicated to a proper AI assistant that is far better at multi-tasking than we humans are.

nichogenius
Автор

I’m subscribed to 120 YouTubers and you are the only one for which I turn on the notifications bell.

Feelfroow
Автор

the one thing you left off is having multiple programs running all once. even for common users. watching/streaming while gaming listening to music and compressing our pRon library to take up less space.

stolenlaptop
Автор

As an expert in the field of computers and of parallel programmning, your presentation is pretty simple. And the solutions are not going to be simple. Just as GPUs work with many "cores" so will future applications be written to use many "cores" / Threads. And no, we programmers are actually taught in many places to do this type of programming. Sure it will take a while, but nobody would bother making the changes to games to use 16 or more cores until such processors appear. The games and other software will follow. Given the power law of cycles / thermals, this is the only way we can go obviously. It's no possible to make 10GHz processors.

TheWindyweather
Автор

Amdahl's law is pessimistic because it does not account for that while adding cores, developers will expand the scope of what the system can solve, i.e. Gustavson's law reflects this. An example is adding more cores may not increase the framerate in RT, but it could render the scene at higher resolution at the same speed.

tylovset
Автор

Yet another great video. I have only found your channel recently but man, your content is absolute gold. And even in this short few months, I can see from video to video that you are constantly working on improving production quality, too, on top of delivering interesting, unique and insightful content. Recipe for greatness! I will be here to celebrate your one millionth subscriber in a few years. Keep it up!

Noobsaucer
Автор

Ray tracing algorithm itself is 100% parallel. I don't know where the 95% figure comes from. In practice the non-parallel part is access to scene data and filters on the final image. Actually it's almost identical to rasterization in terms of parallelization, but you have things like de-noising instead of anti-aliasing. Rasterization scales almost linearly to thousands of cores, there's no reason to think ray tracing won't.

senoctar
Автор

Excellent discussion of trends & practical benefits. Thank you.

robertkopp
Автор

Gaming- For GPUs we need more. For CPUs, we can't tell for sure as it seems most game's recommend specs don't exceed Sandy-Bridge SKUs.

TheRealBleach