Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

preview_player
Показать описание





Credits:

#garyexplains
Рекомендации по теме
Комментарии
Автор

Fun fact: with 128 CUDA cores in a Nano, how many cores actually perform the square root operations in the program? Answer: zero. Yep, with the Nano being based on Nvidia's Maxwell architecture, not one of those 128 cores is capable of computing a square root directly. Instead the Nano's single Maxwell SM (streaming multiprocessor) comes with 32 SFUs (special function units) which are used to compute the square root. But even quirkier, these SFUs only know how to compute the reciprocal square root, as well as the regular reciprocal operation. So to get a square root the SFU will actually execute two instructions: a reciprocal square root, followed by a reciprocal. Strange but true! But actually documented in Nvidia's "CUDA C Programming Guide" in the section on "Performance Guidelines: Maximize Instruction Throughput".
Ah yes, the joys of having a day job as a CUDA programmer. You get to be gobsmacked every day by the weird ways you need to go about trying to optimize your programs to scrimp and save on every precious clock cycle :P

xenoaltrax
Автор

Greetings from near Albuquerque, New Mexico, USA. Thanks for all you do to bring various computing concepts, hardware, and software to your viewers. I want to leave a few comments about this video on Build Your Own GPU Accelerated Supercomputer.

When you take your square root problem and divide it into smaller and smaller but more numerous parts, that is called 'strong scaling' of a numerical problem. This implies that the problem size on each compute node becomes smaller and smaller. Eventually, if the problem continues to be broken up into smaller and smaller pieces, what happens is the communication time from compute node to compute node imposed by the message passing interface (MPI) becomes dominant over the compute time on each node. When this happens, the efficiency of parallel computing can be really low. My point here is that your video shows that double the compute nodes and you halve the compute time. That scaling will happen at first but cannot be continued ad infinitum.

Another approach to parallel computing is to take a small problem of a fixed-size on one compute node, then keep adding the same size problem (but expanding the compute domain) to other compute nodes, all working on the same but now bigger problem. This is called 'weak scaling.' And as one might guess, the performance and efficiency curves for strong and weak scaling are quite different.

As you know but perhaps some viewers do not, running nvidia GPUs requires knowing the CUDA programming language, which requires a non-trivial effort. This language is entirely different from programming languages such as Python, Fortran, or C++. This is why Intel chose to use more X86 co-processors in their Core i9 boards instead of GPUs so that programmers could stay with their familiar programming languages. AMD took the same approach with their ThreadRipper boards. Software development time is much reduced without having to learn CUDA to program the extra compute nodes. Implementing CUDA on top of typical programming languages can extend significantly the time between the start of a software development program and when the software actually executes properly on a given platform.

In a nutshell, the plus side of all this is that GPUs are super fast for numerical computing. GPUs are hands-down faster than any X86 processor. Downside is the difficulty in programming a problem to make proper use of the GPUs.

One more comment. For viewers interested in parallel computing, I highly recommend OPENMPI as the Message Passing Interface version to use as it is open source, actively developed, and easy to implement.

d.barnette
Автор

"We just take square roots. We're simple folks here."

**builds a supercomputer cluster with GPU acceleration 😎**

visiongt
Автор

Mans greatest achievement was working out how to do math faster than his mind would let him ! ! !

OperationsAndSmoothProductions
Автор

Your a very good teacher. Because im a noob and i understood everything and learned alot. I went from not knowing what a jetson nano was to learning about parrallel computing and building supercomputers.
Thank you 👍

krazykillar
Автор

I really would like to build one of these, I've followed an HPC course at Uni and it fascinated me, beeing able to build a CUDA cluster for like 250€ is awesome!

fdx
Автор

You are the first to explain that I understand

JuanReyes-ucmc
Автор

So fascinating. Wow . Thank you all. And the producer.

yelectric
Автор

Gary, can you make the gpu's and cpu's work together ? And by the way that was awesome..

dfbess
Автор

Fast Transform fixed-filter-bank neural nets don't need that much compute. Moving the training data around is the main problem. The total system DRAM bandwidth is the main factor. Clusters of cheap compute boards could be a better deal than an expensive GPU. For training you can use Continuous Gray Code Optimization. Each device has the full neural model and part of the training set. Each device is sent the same short list of sparse mutations and returns the cost for its part of the training data. The costs are summed and the same accept or reject mutations message is sent to each device.

notgabby
Автор

9:53 Ok so if i understand correctly: time will return the number of seconds program has run, mpiexec is the utility responsible for cluster management and ./simpleMPI refers to a local binary which is then distributed and run across the cluster? 12:03 Also the Xavier GPU being more powerful you mean the number of cores it has right? Also i would like to see from professor Garry video on Amdahl`s law :)

Flankymanga
Автор

*GARY!!!*
*GOOD MORNING PROFESSOR!*
*GOOD MORNING FELLOW CLASSMATES!*
Stay safe out there everyone!

MarkKeller
Автор

Hey Gary, thanks for this video. Awesome!

miladini
Автор

Hi @GaryExplains - fantastic video. Thank you for sharing your knowledge with the community.

I have a quick question. Given that the Jetson Nano used in this video is discontinued, what Jetson module would you recommend instead? Could this work with 4 Jetson Orin Nano modules (and would the Dev Kit be needed or could we just go with the module)? Thanks!

b
Автор

Yes, a video on Amdahl's law, please!

JoelJosephReji
Автор

Very Cool! You forgot to mention it take about what ~18W of power? Gary, can you, please, explain exactly how Xavier NX unit can be used for video encoding. I know it runs linux OS Ubuntu on it, so my question is, can it be booted directly of SSD and used as regular desktop PC, running one of the open source editors, such kdenlive, which, by the way, supports parallel video rendering.

naturepi
Автор

I will watch and study all your videos.. I want to do more than just study. There's something I'd like to create. If possible. I'll try to reach out when I'm finished studying all your videos. May it be possible that I could ask a few questions just to gain some knowledge. Great video. I know not much of it but I understood you. There's a lot to it. I need help with my project.

christopherZisa
Автор

Could you make this into a render farm? That is separate from the question as to whether that would be a good idea or even efficient.

audiblevideo
Автор

It would be great if you did videos covering all the details of setting up such a cluster, for a Linux-based environment. What software, how to cable it all up, etc. etc. etc.

KipIngram
Автор

What’s the hardware rack you are using ?

kovlabs