FLOATING POINT NUMBERS Most high level programming languages allow the programmer to do arithmetic using floating point numbers. They may be used when the programmer wants do do arithmetic on numbers containing a fractional part, or on integers which are greater than the maximum integer which is held internally as an integer. Floating point numbers can be input or printed as . or 0.E. The first representation is the representation of decimal fractions that most people are used to. Most of you will probably have used a variation of the second representation, which is particularly useful for representing very large or very small numbers. This variation is usually called scientific notation. 10 Examples: c = 2.997 x 10 cm/s (speed of light in a vacuum) -24 m = 1.673 x 10 g (proton mass) p These numbers can be represented in floating point notation as 0.2997E11 and 0.1673E-23 respectively. The number to the right of the 'E' is the power of ten by which the number to the left of the 'E' must be multiplied the give the number's value. Floating point numbers are represented internally as an exponent and a mantissa (the significant digits). Both the exponent and the mantissa can have either a positive or negative sign. The number of bits in the exponent and mantissa vary from machine to machine, and from language to language on the same machine. In fact, a few languages such as FORTRAN allow the programmer to choose between two different floating point representations with different exponent and mantissa sizes (single precision and double precision). The number of bits the floating point number occupies sets limits on the accuracy or range of numbers which can be represented. For a fixed number of bits, the more accurately the number is represented, the smaller the range of possible values that number may have, because the range of the number is determined ny the size of the exponent, and the accuracy is determined by the size of the mantissa. A typical representation might allow 7 bits for the exponent, 23 bits for the mantissa, and one bit for the signs of the exponent and mantissa. The maximum integer is then about 2**127 or 10**38, and the smallest fraction is about 10**(-38) The accuracy is 23 bits, or about 7 decimal digits. A REPRESENTATION FOR FLOATING POINT NUMBERS This representation uses: 1 bit for the sign of the number (bit 31). 8 bits for the exponent (bits 23 to 30). 23 bits for the mantissa (bits 0 to 22). 0 is used for positive sign, and 1 for negative sign. The exponent held in excess -128 notation, i.e. 128 is added to the true exponent before the number is stored. The number is always normalised before it is stored, so that the mantissa actually represents a number in the form 1. All numbers except zero can be stored in this way, and so only the binary fraction part needs to be stored explicitly. Zero is stored as 00000000 hex, i.e. sign, exponent and mantissa are 0. The following table gives examples of numbers stored in this way. Decimal Number Floating Point Binary Representation -------------- ------------------------------------ S<--exp--><-------mantissa--------> 0.0 00000000 00000000 00000000 00000000 0.25 00111111 00000000 00000000 00000000 0.5 00111111 10000000 00000000 00000000 0.75 00111111 11000000 00000000 00000000 1.0 01000000 00000000 00000000 00000000 1.125 01000000 00010000 00000000 00000000 1.25 01000000 00100000 00000000 00000000 1.375 01000000 00110000 00000000 00000000 1.5 01000000 01000000 00000000 00000000 2.0 01000000 10000000 00000000 00000000 3.0 01000000 11000000 00000000 00000000 4.0 01000001 00000000 00000000 00000000 5.0 01000001 00100000 00000000 00000000 6.0 01000001 01000000 00000000 00000000 -1.0 11000000 00000000 00000000 00000000 -5.0 11000001 00100000 00000000 00000000 MULTIPLICTION AND DIVISION OF FLOATING POINT NUMBERS The following algorithm is used in floating point multiplication: Product of X and Y: If (X = 0) or (Y = 0) then product := 0 else sign of product := (sign of X) xor (sign of Y); for both X and Y, do Shift the exponent into an 8 bit register. Subtract 128 from the exponent. Set bit 23 to 1. end for; exponent of product = (exponent of X) + (exponent of Y) + 128. if (exponent of product) > 127 then Error: numeric overflow. Exit. end if; Multiply the two numbers. The result will be held in 48 bits. If bit 47 = 1 then Shift the result 23 bits to the right. Add 1 to the exponent of the product. else Shift the result 22 bits to the right. end if; If the least significant bit = 1 then add 1 to the result (* rounding *). end if; If bit 25 = 1 (* possible because of rounding *) then Shift the number two bits to the right. Add 1 to the exponent of the product. else Shift the number more bit to the right. end if; If (exponent of product) > 127 then Error: numeric overflow. Exit. else if (exponent of product) < 128 then product := 0 end if; Clear bit 23 of product. Insert sign and exponent. end if; To summarise, xor the signs, add the exponents and multiply the mantissas, rounding upwards if necessary. Division is performed in an analogous way, by xoring the signs, subtracting the exponents and dividing the mantissas. ADDITION AND SUBTRACTION OF FLOATING POINT NUMBERS The following algorithm performs addition of two floating point numbers (X and Y) of the same sign: sign of sum := sign of X. If (magnitude of Y) > (magnitude of X) then Swap X and Y. end if; exponent of sum := exponent of X. exponent difference := (exponent of X) - (exponent of Y). Set bit 23 of each mantissa to 1. Shift the (mantissa of Y) (exponent difference) places to the right. If mantissa of Y = 0 then sum := X. else Add the two mantissas. If bit 24 of the sum of the two mantissas = 1 then If bit 0 = 1 then Add 1 to the sum (rounding) end if; Shift the sum 1 bit to the right. Add 1 to exponent of sum. If the exponent of sum > 127 then Error: Numeric overflow. Exit. end if end if; end if; Subtraction is performed analogously, with one minor difference: if the two numbers are equal, the result is zero. This must be handled separately from the other cases. FIXED POINT DECIMAL NUMBERS Sometimes, non-integers are held in fixed point decimal notation. This is less efficient for handling very large numbers, but preserves accuracy in cases where floating point does not. Many decimal fractions cannot be converted exactly to binary fractions. Each digit is held in 4 bit BCD format, and the position of the decimal point is stored separately from the number. Numbers can be an arbitrary number of digits long. For example, 137.002 is stored as (137002:-3), 3.14159 is stored as (314159:-1), 12 is stored as (12:-2), and 0.000456 is stored as (456:+3). Addition of fixed point decimal numbers: 3.14159 + 0.00456 = (314159:-1) + (456:+4) = ((3141590 + 0000456):-1) = (3142046:-1) = 3.142046 5.1 + 6.3 = (51:-1) + (63:-1) = (1,14:-1) = (114:-2) = 11.4 3.14159 + 0.00456 = (314159:+5) + (456:-5)