Friday, 2 September 2011

Range of Numbers and Precision

Range of Numbers and Precision :

          The range of numbers that can be represented in any machine depends upon the number of bits in the
exponent, while the fractional accuracy or precision is ultimately determined by the number of bits
in the mantissa. The higher the number of bits in the exponent, the larger is the range of numbers
that can be represented. For example, the range of numbers possible in a floating-point binary number
format using six bits to represent the magnitude of the exponent would be from 2−64 to 2+64, which
is equivalent to a range of 10−19to 10+19. The precision is determined by the number of bits used to
represent the mantissa. It is usually represented as decimal digits of precision. The concept of precision
as defined with respect to floating-point notation can be explained in simple terms as follows. If the
mantissa is stored in n number of bits, it can represent a decimal number between 0 and 2n−1 as the
mantissa is stored as an unsigned integer. If M is the largest number such that 10M −1 is less than or
equal to 2n−1, then M is the precision expressed as decimal digits of precision. For example, if the
mantissa is expressed in 20 bits, then decimal digits of precision can be found to be about 6, as 220 −1
equals 1 048 575, which is a little over 106−1. We will briefly describe the commonly used formats
for binary floating-point number representation.

Floating Point Number Formats :

           The most commonly used format for representing floating-point numbers is the IEEE-754 standard.
The full title of the standard is IEEE Standard for Binary Floating-point Arithmetic (ANSI/IEEE STD
754-1985). It is also known as Binary Floating-point Arithmetic for Microprocessor Systems, IEC
60559:1989. An ongoing revision to IEEE-754 is IEEE-754r. Another related standard IEEE 854-
1987 generalizes IEEE-754 to cover both binary and decimal arithmetic. A brief description of salient
features of the IEEE-754 standard, along with an introduction to other related standards, is given below.
ANSI/IEEE-754 Format

The IEEE-754 floating point is the most commonly used representation for real numbers on
computers including Intel-based personal computers, Macintoshes and most of the UNIX platforms.
It specifies four formats for representing floating-point numbers. These include single-precision,
double-precision, single-extended precision and double-extended precision formats. Table 1.1 lists
characteristic parameters of the four formats contained in the IEEE-754 standard. Of the four formats
mentioned, the single-precision and double-precision formats are the most commonly used ones. The
single-extended and double-extended precision formats are not common.
Figure 1.1 shows the basic constituent parts of the single- and double-precision formats. As shown in
the figure, the floating-point numbers, as represented using these formats, have three basic components
including the sign, the exponent and the mantissa. A ‘0’ denotes a positive number and a ‘1’ denotes
a negative number. The n-bit exponent field needs to represent both positive and negative exponent
values. To achieve this, a bias equal to 2n−1− 1 is added to the actual exponent in order to obtain the
stored exponent. This equals 127 for an eight-bit exponent of the single-precision format and 1023 for
an 11-bit exponent of the double-precision format. The addition of bias allows the use of an exponent
in the range from −127 to +128, corresponding to a range of 0–255 in the first case, and in the range
from −1023 to +1024, corresponding to a range of 0–2047 in the second case. A negative exponent
is always represented in 2’s complement form. The single-precision format offers a range from 2−127
to 2+127, which is equivalent to 10−38 to 10+38. The figures are 2−1023 to 2+1023, which is equivalent to 10−308 to 10+308 in the case of the double-precision format.

           The extreme exponent values are reserved for representing special values. For example, in the case
of the single-precision format, for an exponent value of −127, the biased exponent value is zero,
represented by an all 0s exponent field. In the case of a biased exponent of zero, if the mantissa is zero
as well, the value of the floating-point number is exactly zero. If the mantissa is nonzero, it represents
a denormalized number that does not have an assumed leading bit of ‘1’. A biased exponent of +255,
corresponding to an actual exponent of +128, is represented by an all 1s exponent field. If the mantissa
is zero, the number represents infinity. The sign bit is used to distinguish between positive and negative
infinity. If the mantissa is nonzero, the number represents a ‘NaN’ (Not a Number). The value NaN is
used to represent a value that does not represent a real number. This means that an eight-bit exponent
can represent exponent values between −126 and +127. Referring to Fig. 1.1(a), the MSB of byte 1
indicates the sign of the mantissa. The remaining seven bits of byte 1 and the MSB of byte 2 represent
an eight-bit exponent. The remaining seven bits of byte 2 and the 16 bits of byte 3 and byte 4 give a
23-bit mantissa. The mantissa m is normalized. The left-hand bit of the normalized mantissa is always


‘1’. This ‘1’ is not included but is always implied. A similar explanation can be given in the case of
the double-precision format shown in Fig. 1.1(b).
Step-by-step transformation of (23)10 into an equivalent floating-point number in single-precision
IEEE format is as follows:
• (23)10
= (10111)2
= 1.0111e + 0100.
• The mantissa = 0111000 00000000 00000000.
• The exponent = 00000100.
• The biased exponent = 00000100 + 01111111 = 10000011.
• The sign of the mantissa = 0.
• (+23)10
= 01000001 10111000 00000000 00000000.
• Also, (–23)10
= 11000001 10111000 00000000 00000000.
IEEE-754r Format
As mentioned earlier, IEEE-754r is an ongoing revision to the IEEE-754 standard. The main objective of
the revision is to extend the standard wherever it has become necessary, the most obvious enhancement
to the standard being the addition of the 128-bit format and decimal format. Extension of the standard
to include decimal floating-point representation has become necessary as most commercial data are
held in decimal form and the binary floating point cannot represent decimal fractions exactly. If the
binary floating point is used to represent decimal data, it is likely that the results will not be the same as
those obtained by using decimal arithmetic.
In the revision process, many of the definitions have been rewritten for clarification and consistency.
In terms of the addition of new formats, a new addition to the existing binary formats is the 128-bit
‘quad-precision’ format. Also, three new decimal formats, matching the lengths of binary formats,
have been described. These include decimal formats with a seven-, 16- and 34-digit mantissa, which
may be normalized or denormalized. In order to achieve maximum range (decided by the number of
exponent bits) and precision (decided by the number of mantissa bits), the formats merge part of the
exponent and mantissa into a combination field and compress the remainder of the mantissa using
densely packed decimal encoding. Detailed description of the revision, however, is beyond the scope
of this book.
IEEE-854 Standard
The main objective of the IEEE-854 standard was to define a standard for floating-point arithmetic
without the radix and word length dependencies of the better-known IEEE-754 standard. That is why
IEEE-854 is called the IEEE standard for radix-independent floating-point arithmetic. Although the
standard specifies only the binary and decimal floating-point arithmetic, it provides sufficient guidelines
for those contemplating the implementation of the floating point using any other radix value such
as 16 of the hexadecimal number system. This standard, too, specifies four formats including single,
single-extended, double and double-extended precision formats.
Example 1.11
Determine the floating-point representation of −14210 using the IEEE single-precision format.
Solution
• As a first step, we will determine the binary equivalent of (142)10. Following the procedure outlined
in an earlier part of the chapter, the binary equivalent can be written as (142)10
= (10001110)2.
• (10001110)2
= 1.000 1110 × 27 = 1.0001110e + 0111.
• The mantissa = 0001110 00000000 00000000.
• The exponent = 00000111.
• The biased exponent = 00000111 + 01111111 = 10000110.
• The sign of the mantissa = 1.
• Therefore, −14210
= 11000011 00001110 00000000 00000000.
Example 1.12
Determine the equivalent decimal numbers for the following floating-point numbers:
(a) 00111111 01000000 00000000 00000000 (IEEE-754 single-precision format);
(b) 11000000 00101001 01100 45 0s (IEEE-754 double-precision format).
Solution
(a) From an examination of the given number:
The sign of the mantissa is positive, as indicated by the ‘0’ bit in the designated position.
The biased exponent = 01111110.
The unbiased exponent = 01111110−01111111 = 11111111.
It is clear from the eight bits of unbiased exponent that the exponent is negative, as the 2’s
complement representation of a number gives ‘1’ in place of MSB.
The magnitude of the exponent is given by the 2’s complement of (11111111)2, which is
(00000001)2= 1.
Number Systems 17
Therefore, the exponent =−1.
The mantissa bits = 11000000 00000000 00000000 (‘1’ in MSB is implied).
The normalized mantissa = 1.1000000 00000000 00000000.
The magnitude of the mantissa can be determined by shifting the mantissa bits one position to the left.
That is, the mantissa = (.11)2
= (0.75)10.
(b) The sign of the mantissa is negative, indicated by the ‘1’ bit in the designated position.
The biased exponent = 10000000010.
The unbiased exponent = 10000000010−01111111111 = 00000000011.
It is clear from the 11 bits of unbiased exponent that the exponent is positive owing to the ‘0’ in
place of MSB. The magnitude of the exponent is 3. Therefore, the exponent = +3.
The mantissa bits = 1100101100 45 0s (‘1’ in MSB is implied).
The normalized mantissa = 1.100101100 45 0s.
The magnitude of the mantissa can be determined by shifting the mantissa bits three positions to
the right.
That is, the mantissa = (1100.101)2
= (12.625)10.
Therefore, the equivalent decimal number =−12625.

No comments:

Post a Comment

Newest