The 8087 emulation did not work exactly as described in footnote 1. Instead of writing 8087 instructions, the compiler wrote INT (software interrupt, often used for system or runtime calls) instructions. The first opcode of 8087 instructions only has eight possible values so you could for example use eight software interrupt vectors to encode the opcode into the second byte of the INT instruction.
If an 8087 was present, the interrupt handler simply patched in place the INT instruction, replacing it with an 8087 instruction. If the math coprocessor was absent, instead, the interrupt handler decoded the subsequent instruction bytes from the instruction stream and performed software emulation. All this was needed because the 8088 and 8086 didn't have undefined opcode exceptions!
The 8087's microcode ROM is very unusual because it stores two bits per transistor for higher density. The ROM uses four transistor sizes so each position outputs one of four voltages, which are converted to two bits.
(This is separate from the constant ROM, which is a normal one-bit-per-transistor ROM.)
Thank you for this tremendous piece of silicon detective work. Really hope that that there are further instalments to come, for example covering the microcode.
Just to highlight too that footnote 1, describing the lengths that the Intel engineers had to go to ensure that interaction between the 8086 and the 8087 worked, is fascinating.
Ken, could the two 10^18 constants have two different signs? The signs must be stored somewhere like the exponents. Since most bits are zero for both signs and exponents, does it make sense for them to be coded as Boolean functions instead of being stored in a ROM?
I actually wrote that idea (two different signs) in the post but took it out :-) My thinking is that the hardware must support negation, so it would save space to have one constant and negate it instead of two constants with different signs.
I'm still investigating the chip, so I hope to find the exponent ROM (or Boolean logic as you suggest) and solve this puzzle.
Thanks for the article Ken. I’d like to remark the small refresher on how transistors works, it makes up for a really pleasant read. Also the footnotes are actually useful in decluttering the articles but amenable and clear to read.
Back in EE school in 1991, I installed an 8087 in my PC. The speed-up running Micro-CAP 3 was amazing, as a simple BJT CE amp sim only took a few minutes. Of course then I moved to a 486 and that same sim was finished before the mouse button lifted.
There has been an amazing amount of progress since then, but my PC is still too slow, as the simulations only get larger.
I couldn't figure out any explanation for that constant. If I had to guess, maybe some optimization for the base-2 log/exp CORDIC algorithm, like a polynomial approximation of base-2 log for the small remainder from CORDIC. The dividing by 3 might come from the Taylor series.
There was an instruction to load ln(2) onto the x87 stack. It didn't do anything else; it just pushed ln(2) onto the stack. ln(2) is to compute the natural log. The x87 log instruction had an additional scale factor; if you gave it ln(2) as the scale factor, it computed the natural log instead of the base 2 log.
I don't know what ln(3) would be used for. My guess is it's used as somewhere in the algorithm to compute log2(x). Probably as a fixed point in CORDIC, or possibly as a boundary condition: if the intermediate value is above ln(3), do a bitshift, and add a constant to the result. But I don't know.
The fact that the x87 is a separate processor is a key part of what made Quake what it was. Because math on the FPU ran concurrently with the main CPU, you could do integer math physics calculations on the x86 and floating point math graphics calculations on the x87. So you'd have a significant speedup by interleaving everything together. Must have been a nightmare to write that code. Some x86 clones were clever, and used the same circuitry for everything. I think Cyrix chips for instance didn't separate the circuitry for x86 and x87 instructions. So x86 heavy code ran similarly between Cyrix and Intel, and x87 heavy code did also, but Quake (which combined x86 and x87 instructions relatively equally) ran like garbage on Cyrix chips. At the time, this gave Cyrix chips a terrible reputation. These days, Intel and AMD put fairly little effort into making x87 instructions fast, because code that cares about floating point performance use AVX instructions.
>These days, Intel and AMD put fairly little effort into making x87 instructions fast, because code that cares about floating point performance use AVX instructions.
AFAIK x87 instructions were removed/deprecated in x86_64
> I'm a bit puzzled why the 8087 doesn't need the constant log2(1 + 2-1), which is used by that algorithm.
Not necessarily. For example, the algorithm for 2^x with x between 0 and 1 just has to subtract the largest value in the table from x and perform the corresponding shift-and-add on the result. So for large enough x, they may have just done shift-by-2-and-add multiple times on the result, subtracting the corresponding log multiple times from the operand.
Though that still leaves the question of why they left out this particular table value and not any others. It may have been a balancing act of overall accuracy and performance vs table size vs microcode complexity.
“After more thought, I determined that the rows do not alternate but are arranged in a repeating "ABBA" pattern.” ... which differs from Konami develops, who used a “BA” pattern in their roms.
Seeing this reminded me of how cool it felt when I got an Intel coupon for a free 80287 with the Intel AboveBoard memory expansion board I bought to add extended memory (DIPs!) to my 80286.
That would explain it, but, unfortunately I mixed up that constant. I meant ln(2)/3. log2(3) isn't a constant in the 8087. Apologies for misleading you.
In its external representation, the 8087 omitted the leading one for the reason you suggest; it is redundant.
Your suggestion could apply to the ROM; they could have hard-coded the first bit to 1, saving a row in the ROM. I think they could have avoided a few transistors by doing this.
>"The die photo above shows the "engine" that ran the microcode program; it is basically a simple CPU. Next to it is the large ROM that holds the microcode."
[...]
The chip's data path consists of 67 horizontal rows, so it seemed pretty clear that the 134 rows in the ROM corresponded to two sets of 67-bit constants. I extracted one set of constants for the odd rows and one for the even rows, but the values didn't make any sense. After more thought, I determined that the rows do not alternate but are arranged in a repeating "ABBA" pattern.7 Using this pattern yielded a bunch of recognizable constants, including pi and 1. Bits from those constants are shown in the diagram below. (In this photo, a 1 bit appears as a green stripe, while a 0 bit appears as a red stripe.) In binary, pi is 11.001001... and this value is visible in the upper labeled bits.
[...]
"The basic idea of CORDIC is to compute tangent and arctangent by breaking down an angle into smaller angles, and rotating a vector by these angles. The trick is that by carefully choosing the smaller angles, each rotation can be computed with efficient shifts and adds instead of trig functions. Specifically, suppose we want to find tan(z). We can break z into a sum of smaller angles: z ≈ {atan(2-1) or 0} + {atan(2-2) or 0} + {atan(2-3) or 0} + ... + {atan(2-16 or 0}. Now, rotating a vector by, say atan(2-2), can be done by multiplying by 2-2 and adding. The key thing is that multiplying by 2-2 is just a fast bit shift. Putting this all together, computing tan(z) can be done by comparing z with the atan constants, and then doing 16 cycles of additions and shifts, which are fast to perform in hardware.13 To make the algorithm work, the atan constants are precomputed and stored in the constant ROM.14
[...]
Some of the constants (such as pi) are expected, while others (such as log2(3)) are more puzzling."
If an 8087 was present, the interrupt handler simply patched in place the INT instruction, replacing it with an 8087 instruction. If the math coprocessor was absent, instead, the interrupt handler decoded the subsequent instruction bytes from the instruction stream and performed software emulation. All this was needed because the 8088 and 8086 didn't have undefined opcode exceptions!