## Saturday, September 11, 2021

### Single Precision and Double Precision (Floating Point Representation)

Floating Point Representation

Very large integer and very small fractions, a computer must be able to represent numbers and operate on them in such a way that the position of the binary point is variable and is automatically adjusted as computation proceeds. In this case binary numbers are said to float and the numbers are called floating point numbers.

The floating-point representation has three fields: sign, significant digits and exponent. Let us to consider the number 111101.1000110 represented in the floating-point format. To represent the above number in floating point number, first binary point is shifted to right of the first bit and the number is multiplied by the correct scaling factor to get the same value. It is important that the base in the scaling factor is fixed 2. The string of the significant digits is commonly known as mantissa.

The number should be in the normalized form and is given as

In the above example, we can say that,

Sign = 0 (this number is positive)

Mantissa = 111011001110 (Significant Digits)

Exponent = 5

There are two types of IEEE standard:

1.    32-bit standard (Single-precision representation)

2.    64-bit standard (Double – precision representation)

Single Precision Representation

Field 1 Sign = 1-bit

Field 2 Exponent = 8-bit

Field 3 Mantissa = 23-bit

Instead of the signed exponent, E, the value actually stored in the exponent field is

E’ = E (Scaling factor) + Bias (127)

Here Bias is 127, so it is known as excess-127 format.

Double Precision Representation

Field 1 Sign = 1-bit

Field 2 Exponent = 11-bit

Field 3 Mantissa = 52-bit

Instead of the signed exponent, E, the value actually stored in the exponent field is

E’ = E (Scaling factor) + Bias (1023)

Here Bias is 1023, so it is known as excess-1023 format.

Example:

Represent 1259.12510 in single precision and double precision formats.

Solution:

Step 1: Convert Decimal Numbers in to binary

1259 = 100 1110 1011

0.125 = 0.001

Binary number = 1 0 0 1 1 1 0 1 0 1 1 + 0. 0 0 1 = 1 0 0 1 1 1 0 1 0 1 1. 0 0 1

Step 2: Normalize the number

1 0 0 1 1 1 0 1 0 1 1. 0 0 1 = 1. 0 0 1 1 1 0 1 0 1 1 0 0 1 x 210

Step 3: Single precision representation

Here S=0 (because given number is positive)

E=10 (exponent)

M = 0 0 1 1 1 0 1 0 1 1 0 0 1

Bias for a single precision format = 127

E’ = E + 127 = 10+127 = (137)10 = (1 0 0 0 1 0 0 1)2

Number in single precision format is given:

Step 4:  Double precision representation

Here S=0 (because given number is positive)

E=10 (exponent)

M = 0 0 1 1 1 0 1 0 1 1 0 0 1

Bias for a double precision format = 1023

E’ = E + 1023 = 10+1023 = (1033)10 = (1 0 0 0 0 0 0 1 0 0 1)2

Number in double precision format is given: