Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

Показать описание

Credits:

#garyexplains

Рекомендации по теме

Комментарии

Fun fact: with 128 CUDA cores in a Nano, how many cores actually perform the square root operations in the program? Answer: zero. Yep, with the Nano being based on Nvidia's Maxwell architecture, not one of those 128 cores is capable of computing a square root directly. Instead the Nano's single Maxwell SM (streaming multiprocessor) comes with 32 SFUs (special function units) which are used to compute the square root. But even quirkier, these SFUs only know how to compute the reciprocal square root, as well as the regular reciprocal operation. So to get a square root the SFU will actually execute two instructions: a reciprocal square root, followed by a reciprocal. Strange but true! But actually documented in Nvidia's "CUDA C Programming Guide" in the section on "Performance Guidelines: Maximize Instruction Throughput".
Ah yes, the joys of having a day job as a CUDA programmer. You get to be gobsmacked every day by the weird ways you need to go about trying to optimize your programs to scrimp and save on every precious clock cycle :P

xenoaltrax

Greetings from near Albuquerque, New Mexico, USA. Thanks for all you do to bring various computing concepts, hardware, and software to your viewers. I want to leave a few comments about this video on Build Your Own GPU Accelerated Supercomputer.

When you take your square root problem and divide it into smaller and smaller but more numerous parts, that is called 'strong scaling' of a numerical problem. This implies that the problem size on each compute node becomes smaller and smaller. Eventually, if the problem continues to be broken up into smaller and smaller pieces, what happens is the communication time from compute node to compute node imposed by the message passing interface (MPI) becomes dominant over the compute time on each node. When this happens, the efficiency of parallel computing can be really low. My point here is that your video shows that double the compute nodes and you halve the compute time. That scaling will happen at first but cannot be continued ad infinitum.

Another approach to parallel computing is to take a small problem of a fixed-size on one compute node, then keep adding the same size problem (but expanding the compute domain) to other compute nodes, all working on the same but now bigger problem. This is called 'weak scaling.' And as one might guess, the performance and efficiency curves for strong and weak scaling are quite different.

As you know but perhaps some viewers do not, running nvidia GPUs requires knowing the CUDA programming language, which requires a non-trivial effort. This language is entirely different from programming languages such as Python, Fortran, or C++. This is why Intel chose to use more X86 co-processors in their Core i9 boards instead of GPUs so that programmers could stay with their familiar programming languages. AMD took the same approach with their ThreadRipper boards. Software development time is much reduced without having to learn CUDA to program the extra compute nodes. Implementing CUDA on top of typical programming languages can extend significantly the time between the start of a software development program and when the software actually executes properly on a given platform.

In a nutshell, the plus side of all this is that GPUs are super fast for numerical computing. GPUs are hands-down faster than any X86 processor. Downside is the difficulty in programming a problem to make proper use of the GPUs.

One more comment. For viewers interested in parallel computing, I highly recommend OPENMPI as the Message Passing Interface version to use as it is open source, actively developed, and easy to implement.

d.barnette

"We just take square roots. We're simple folks here."

**builds a supercomputer cluster with GPU acceleration 😎**

visiongt

Mans greatest achievement was working out how to do math faster than his mind would let him ! ! !

OperationsAndSmoothProductions

Your a very good teacher. Because im a noob and i understood everything and learned alot. I went from not knowing what a jetson nano was to learning about parrallel computing and building supercomputers.
Thank you 👍

krazykillar

I really would like to build one of these, I've followed an HPC course at Uni and it fascinated me, beeing able to build a CUDA cluster for like 250€ is awesome!

fdx

You are the first to explain that I understand

JuanReyes-ucmc

So fascinating. Wow . Thank you all. And the producer.

yelectric

Gary, can you make the gpu's and cpu's work together ? And by the way that was awesome..

dfbess

Fast Transform fixed-filter-bank neural nets don't need that much compute. Moving the training data around is the main problem. The total system DRAM bandwidth is the main factor. Clusters of cheap compute boards could be a better deal than an expensive GPU. For training you can use Continuous Gray Code Optimization. Each device has the full neural model and part of the training set. Each device is sent the same short list of sparse mutations and returns the cost for its part of the training data. The costs are summed and the same accept or reject mutations message is sent to each device.

notgabby

9:53 Ok so if i understand correctly: time will return the number of seconds program has run, mpiexec is the utility responsible for cluster management and ./simpleMPI refers to a local binary which is then distributed and run across the cluster? 12:03 Also the Xavier GPU being more powerful you mean the number of cores it has right? Also i would like to see from professor Garry video on Amdahl`s law :)

Flankymanga

*GARY!!!*
*GOOD MORNING PROFESSOR!*
*GOOD MORNING FELLOW CLASSMATES!*
Stay safe out there everyone!

MarkKeller

Hey Gary, thanks for this video. Awesome!

miladini

Hi @GaryExplains - fantastic video. Thank you for sharing your knowledge with the community.

I have a quick question. Given that the Jetson Nano used in this video is discontinued, what Jetson module would you recommend instead? Could this work with 4 Jetson Orin Nano modules (and would the Dev Kit be needed or could we just go with the module)? Thanks!

b

Yes, a video on Amdahl's law, please!

JoelJosephReji

Very Cool! You forgot to mention it take about what ~18W of power? Gary, can you, please, explain exactly how Xavier NX unit can be used for video encoding. I know it runs linux OS Ubuntu on it, so my question is, can it be booted directly of SSD and used as regular desktop PC, running one of the open source editors, such kdenlive, which, by the way, supports parallel video rendering.

naturepi

I will watch and study all your videos.. I want to do more than just study. There's something I'd like to create. If possible. I'll try to reach out when I'm finished studying all your videos. May it be possible that I could ask a few questions just to gain some knowledge. Great video. I know not much of it but I understood you. There's a lot to it. I need help with my project.

christopherZisa

Could you make this into a render farm? That is separate from the question as to whether that would be a good idea or even efficient.

audiblevideo

It would be great if you did videos covering all the details of setting up such a cluster, for a Linux-based environment. What software, how to cable it all up, etc. etc. etc.

KipIngram

What’s the hardware rack you are using ?

kovlabs

Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

This Guy BUILT His Own Graphics Card!

Nvidia CUDA in 100 Seconds

Buying a GPU for Deep Learning? Don't make this MISTAKE! #shorts

Writing Code That Runs FAST on a GPU

Nvidia RTX 3080 Mini! The Future of GPUs! #shorts #pcgaming #gpu #aprilfools

No GPU? No Problem! A Quick Guide To Integrated Graphics

Why I Want This Pocket GPU 😱

Build Your Own Supercomputer Using a Jetson Mate

Building a GPU cluster for AI

Build your own Deep learning Machine - What you need to know

Building My Own AI Server for Just $1195.36: A Homelab Journey

NVidia GPU Cloud on Oracle Cloud Infrastructure - Massive GPU acceleration for deep learning!

A Portable GPU That Fits In Your Palm Of Your Hand! Pocket AI RTX A500

this is not a GPU #shorts

Building GPU-Accelerated Workflows with TensorFlow and Kubernetes [I] - Daniel Whitenack

How To Use Your GPU for Machine Learning on Windows with Jupyter Notebook and Tensorflow

Making the world's CHEAPEST graphics card

Mythbusters Demo GPU versus CPU

How to Setup NVIDIA GPU For Deep Learning | Installing Cuda Toolkit And cuDNN

Nvidia tried so hard to stop this - GPU Sharing with Virtual Machines

Manifold Viewer - Create Custom GPU Accelerated Filters in Seconds

How to Turn Your AMD GPU into a Local LLM Beast: A Beginner's Guide with ROCm

How to make your CPU as fast as a GPU - Advances in Sparsity w/ Nir Shavit