MicroOps in the Pentium MMX

preview_player
Показать описание
A Short video talking about how the Pentium P5 and Pentium MMX used micro-operations compared with more traditional processors like the AMD K6, and the implications for the length pre-decoder of the Pentium MMX.

Chapters:
0:00 Intro and PMMX Overview
1:51 Comparison with K6
4:50 PMMX vs K6 uOP Throughput
6:50 Instruction Queue
7:37 uOp Decoding Analysis
9:46 Parameter Elimination
11:02 Opcode table and Simplification
Рекомендации по теме
Комментарии
Автор

MicroOps? More like "Magnificent lectures with quality that's tops!" 👍

PunmasterSTP
Автор

This reminds of Jim Keller's comment that x86 decode 'isn't all that bad if you're building a big chip'

qfytidw
Автор

I really appreciate you taking the time to make these videos, they're very insightful and understandable even with limited design knowledge. Do you plan to do any videos on how SIMD is implemented on these processors, your mention of microcode emulation piqued my interest

billylaws
Автор

Excellent as alway. Really like your idea of pages, look like bitplane for microcode.
* Somes intructions can be fused together, could the decoder handle it with the help of a dedicated page ?
* Somes thought about the hardware multiplication, using the FPU is tempting (the 80bits extended precision have a 64bits mantissa) but wouldn't be funky register wise ?
I read between 10 and 20 cycle for a 32x32 mul on the pentium, seem pretty quick without dedicated hardware. I thought a shifter and a adder would at least take 32 cycles worst case.
* You point out that the pipelines U and V have both a load and store unit on the pentium and mmx. all the other (Pentium pro included) have a load and a store with their own pipeline.
I suppose it's the Out of Order architecture ? Or/and the widening of the adress space to 35-36 bits ? About the implementation, depending of the periphericals ( like PCIe or SATA on an Artix throught liteX ) is it viable to keep a 32 bits adressing space, directly push to 48bits to be futur proof or something in between ?
* In a pipeline architecture, doesn't the worst case scenario depend largely on the adressing mode ? two instructions in immediate adressing, it's one fetch for the two instructions, in direct mode two more read, and two more in indirect mode. Cache or not, the cpu still have one memory access with limited bandwidth and latency. Where and how do you arbitrate all these memory access ?

vincentvoillot