| Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
|
|||||
| First Prev [ 1 2 ] Next Last |
IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision).
The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (originally the reference number was IEC 559:1989).[1]
Following is a description of the standard's format for floating-point numbers.
A single-precision binary floating-point number is stored in a 32 bit word:
1 8 23 width in bits +-+--------+-----------------------+ |S| Exp | Fraction | +-+--------+-----------------------+ 31 30 23 22 0 bit index (0 on right) bias +127Where S is the sign bit and Exp is the Exponent field.
The exponent is biased in the engineering sense of the word – the value stored is offset (by 127 in this case) from the actual value. Biasing is done because exponents have to be signed valuesA negative number is a number that is less than zero, such as −3. A positive number is a number that is greater than zero, such as 3. Zero itself is neither negative nor positive, though in computing zero is sometimes treated as though it were a pos in order to be able to represent both tiny and huge values, but two's complementTwo's complement is a method of signifying negative numbers in binary. It is also an operation which may be applied to positive binary values in order to perform subtraction using the method of complements, effectively allowing subtraction of one binary n, the usual representation for signed values, would make comparisonThe IEEE Standard for Binary Floating-Point Arithmetic IEEE 754 is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (inc harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison. So, for a single-precision number, an exponent in the range −126 .. +127 is biased by adding 127 to get a value in the range 1 .. 254 (0 and 255 have special meanings described below). When interpreting the floating-point number the bias is subtracted to retrieve the actual exponent.
The set of possible data values can be divided into the following classes:
(NaNs are used to represent undefined or invalid results, such as the square root of a negative number.)
The classes are primarily distinguished by the value of the Exp field, modified by the fraction. Consider the Exp and Fraction fields as unsigned binary integers (Exp will be in the range 0–255):
Class Exp Fraction Zeroes 0 0 Denormalised numbers 0 non zero Normalised numbers 1-254 any Infinities 255 0 NaN (Not a Number) 255 non zeroFor normalised numbers, the most common, Exp is the biased exponent and Fraction is the fractional part of the significandComputer arithmetic The significand (also the coefficient or, more informally, the mantissa is the part of a floating-point number that contains its significant digits. Depending on the interpretation of the exponent, the significand may be considered to. The number has value v:
v = s × 2e × m
Where
s = +1 (positive numbers) when S is 0
s = −1 (negative numbers) when S is 1
e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")
m = 1.Fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of Fraction). Therefore, 1 <= m < 2.
Note: