AVX512 (3 of 3): Deep Dive into AVX512 Mechanisms

preview_player
Показать описание


In this video we will explore the AVX 512 mechanisms.

Links:

CPUID Wikipedia page (contains a list of all the instruction set feature flags):

Software used to make this vid:

Рекомендации по теме
Комментарии
Автор

How come you were using vmovupd instead of vmovups in the rounding example to load the single precision floats. Is there any difference between the two, like vmovups requires 4byte alignment. Or do the vmovu* instructions remove all alignment requirements and expand to the same micro ops?

nayjames
Автор

Oh man, I can't wait to see some AVX1024 registers 😆

PunmasterSTP
Автор

Watching a video like this makes me understand how CPUs keep gaining millions upon millions of transistors. The muxing, control lines, registers and logic in general to implement all of these instructions, things like broadcasting etc would just keep piling on the transistors..!

And the detail in that koala drawing ... 🤯

TomStorey
Автор

Oh, wasn't expecting the art montage at the end. Appreciate it all the same with the series 🤭

_lapys
Автор

Your dad is an absolute legend indeed!

stijnkuipers
Автор

Interesting stuff. Those masks are quite fascinating. Also love your dad's artwork. Very talented.

NeilRoy
Автор

8:21 - The only thing special about k0 is that you can't use it in a lot of instructions. The *encoding* "000" is used to mean "no mask". It doesn't *read* k0 when you do that, it just is hardwired to "no mask". That's why changing the contents of k0 doesn't change that behaviour - it never actually reads the register. But because the encoding is reserved, it also means you can't use k0 for most instructions. It's a perfectly good normal register, it's just the *encoding* for most vector instructions is reserved, so you can only really use it in the other mask-register instructions - as a temporary or things like that.

Annoyingly, some assemblers allow you to use the "{k0}" syntax, which is technically illegal. Because again - the instruction doesn't read k0! They should produce an error, but they don't.

tom_forsyth
Автор

We ain't in Kansas anymore, Toto. Loved this trilogy, thank you. The Kmasks, the compressed displacement, broadcasting, the register files, all of it is exciting. I have experimented with SIMD since you first introduced us to SSE. I suspect that the power of these instructions will only really be experienced after a paradigm shift in the way we structure data. The classic vision of data in records (structs, Classes etc.) has served is well with the classic architectures . These revolved around pointers and pointer arithmetic (shock! horror! bare naked pointers are at the foundation of it ALL). The new architecture is less friendly to the mixing of numerical, textual and bit-field data. It thrives on sequential lists of data all being the same types. So, data currently stored thus: Name: Michael, Age: 69, Salary : beyond your wildest dreams; Name: Creel, Age: ... etc. will need to be stared as Name: Michael, Creel; Age: 69... etc. I, perhaps, need a lot more examples of data for clarity but the idea being that each numerical field can be accessed as one long array. Why? When one isn't number crunching enterprise amounts of data, the overhead to gather the numerical data from classic records can erase the advantage of these powerful instruction sets. It is not an obstacle just a different view of your data.

I really like your Dad's pictures. I can see why you are so proud of him. Stay healthy.

willofirony
Автор

This CPU is capabale of ZVX521 Fondation instruction set!

AlFasGD
Автор

Really very good picture of Kwala.or Quwala. By the way great tutorial for Intel CPU AVX512 series all three.

imrank
Автор

Great intro to this instruction set! I'm a database guy, so not quite sure when I'll ever write my first assembly code, but your teaching style is so good that I can't help watching!

oresteszoupanos
Автор

Great video - its interesting as a developer to learn some asm/intrinsincs.

matsedv
Автор

Awesome, I received 11 points! The compressed displacement explanation and example was brilliant. And thank you for sharing your Dad's artwork.

robertzavala
Автор

Great video, great lesson about AVX512 mechanisms

KristianDjukic
Автор

man I love your videos so much :'D

lordadamson
Автор

Thanks! I was capabale to understand most of the stuff.

mikkoyliharsila
Автор

Really cool and awesome videos talking about some of the AVX512 ! Can you explore more about avx512 in the future, especially the FMA instructions in the future? CUDA with tensor cores boost the GEMM computation through put by such a big degree that now the Ampere A100 basically has more silicon for tensor cores instead of the common CUDA cores. The GA100 actually has much less fp32 CUDA cores than the GA102 gaming/content creation lineups(like RTX A600 or 3090) . It would be interesting to see how the avx512 FMA implementation on BLAS boosts the speed/throughput in comparison to that of avx2 and no avx at all.

fdc
Автор

@Creel
20:24
Did you notice, that it rounded myFloats[0] = 1.5 to 2, but myFloats[4] = 0.5 to 0?
I would consider that strange. If a value is x.5 i would always expect to round the value upwards.

OpenGLever
Автор

Avx2 also has automatic broadcasting cool instruction

lukehanscom
Автор

Hello mate . Will u make some videos on opencl and gpu programming. Should be nice interesting addition to ur high performance software computing guide.

kadiyamsrikar
visit shbcf.ru