What are the function of floating point?
To representation of non-integral
numbers including very small and very large numbers.
5.98 x 10 7 = Significant
digits × base exponent
Example
Actual
|
Floating Point
|
0.0000000478
|
4.78 × 10-8
|
0.00000001
|
0.1 × 10-7
|
-1000000000
|
-1.0 × 109
|
-0.00111
|
-1.11× 2-3
|
IEEE 754-1985 was an industry standard for representing floating-point numbers in computers, officially adopted in 1985 and superseded in 2008 by IEEE 754-2008. During its 23 years, it was the most widely used format for
floating-point computation. It was implemented in software, in the form of
floating-point libraries, and in hardware, in the instructions of
many CPUs and FPUs. The first integrated
circuit to
implement the draft of what was to become IEEE 754-1985 was the Intel 8087.
IEEE Floating-Point
Format
Single Precision
Double Precision
x = (-1)S × (1+Fraction) × 2(Exponent-Bias)
sign = 0, because the number is positive. (1 indicates
negative.)
biased exponent = actual exponent + bias
Floating Point
Example
From decimal to floating
number
- Convert -1313.3125 to IEEE
32-bit floating point format.
a. The integral part is
131310 = 101001000012.
b. The fractional:
0.3125
|
× 2 =
|
0.625
|
0
|
Generate
0 and continue.
|
0.625
|
× 2 =
|
1.25
|
1
|
Generate
1 and continue with the rest.
|
0.25
|
× 2 =
|
0.5
|
0
|
Generate
0 and continue.
|
0.5
|
× 2 =
|
1.0
|
1
|
Generate
1 and nothing remains.
|
c. So 1313.312510 = 10100100001.01012.
d. Normalize:
10100100001.01012 = 1.010010000101012 × 210.
e. Fraction is
01001000010101000000000,
exponent is 10 + 127 = 137 = 100010012,
sign bit is 1 because it is negative
number.
Then you will get the
answers as follow:
Binary 32 bits
Sign [1bit]
|
Exponent [8bits]
|
Fraction [23bits]
|
1 (-)
|
10001001
|
01001000010101
000000000 |
Binary 64 bits
Sign [1bit]
|
Exponent [11bits]
|
Fraction [52bits]
|
1 (-)
|
10000001001
|
01001000010101
0000000000000000
0000000000000000
000000 |
From floating point
to decimal
a. Separate:
01000100001101100001000000000000 2
Sign [1bit]
|
Exponent [8bits]
|
Fraction [23bits]
|
0(+)
|
10001000
|
011011000010
00000000000 |
b. Exponent: 100010002 = 13610; 136 − 127 = 9.
c. Denormalize:
1.011011000012 × 29 = 1011011000.01.
d. Convert:
Exponents 29 28 27 26 25 24 23 22 21 20 2-1 2-2
Place Values 512 256 128 64 32 16 8 4 2 1 0.5 0.25
Place Values 512 256 128 64 32 16 8 4 2 1 0.5 0.25
Bits 1 0 1 1 0 1 1 0 0 0 . 0 1
Value 512 +128+64 +16 +8 +0.25=728.25
e. Sign: positive
because the sign 1bit is 0
Result: 01000100001101100001000000000000 2 is
728.25.
728.25.
Floating point addition
is analogous to addition using
scientific notation. For example, to add 2.25x 10^0 to 1.340625x 10^2 :
1. Shift the decimal point of the smaller number to the left until the exponents are equal. Thus, the first number becomes 0.0225x 10^2 .
2. Add the numbers with
decimal points aligned:
3. Normalize the result.
= 1.363125x10^2
Floating Point Multiplication
Multiply the following two numbers in scientific notation by hand:
1.110 × 1010 × 9.200 × 10-5
1. Add the exponents to
find
New Exponent = 10 + (-5) = 5
If we
add biased exponents, bias will be added twice. Therefore we
need to subtract it once to compensate:
(10 + 127) + (-5 + 127) = 259
259 - 127 = 132 which is (5 +
127) = biased
new exponent
|
2. Multiply
1.110 × 9.200 = 10.212000
Can only keep three digits to the right
of the decimal point,
so the result is
so the result is
10.212 × 105
3. Normalise the result
1.0212 × 106
4. Round it
1.021
× 106
Published by
SITI NURHASTINI BINTI ROSALI ( B031310320 )
No comments: