What is floating point?

The processing units of a computer use several types of numbers. One of them are integers, others are floating point. In this article I will comment on the second ones, since they are somewhat more unknown than the first ones. This way you will learn everything you need to know about this type of numbers, what they are used for, formats, etc.

What is floating point?

Floating point is a format used to represent real numbers in computer systems. Real numbers are those that include an integer and a fractional part, such as 3.14 or -0.75.

In a floating point number, the representation is divided into three parts:

  • Base: For a number with an exponent, this is the number itself to which the exponent is applied.
  • Sign: it is a bit that indicates if the number is positive or negative. One to indicate negative and zero to indicate positive.
  • Exponent: consists of a set of bits that represents the exponent to which a base must be raised to obtain the real value of the number. The exponent allows you to represent very large or very small numbers using a fixed number of bits.
  • Mantissa: also called a fraction or meaning, it is a binary representation of the significant part of the number. The mantissa contains the binary digits that represent the fractional part of the real number.

The floating point format is based on scientific notation, where a number is represented as the product of a mantissa and a power of base 2 raised to the exponent.

On the other hand, it is also important to know another terminology that will be important to you to better understand what a floating point number is:

  • Length: The length of the data refers to the number of bits required for its representation. If we have a 4-bit length, then we will have representations ranging from 0000 to 1111, and if it is 8-bit it would be 00000000 to 11111111, which is the decimal equivalent of representing 0 to 255 (256 possible values).
  • Precision: You can’t always represent a floating point number with absolute precision, sometimes it’s necessary to round to some decimal place.
  • Overflow: Floating point occurs when trying to store a number larger than can be represented or stored in the length being used, so not all bits will be stored and an error will occur.

IEEE 754 standard

The IEEE 754 standard is a widely used specification for the representation and manipulation of floating point numbers in computer systems. This standard defines formats and rules for the binary representation of real numbers, as well as for performing arithmetic and other related operations.

It was created in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the various floating-point implementations that made it difficult to use them reliably.

The IEEE 754 standard establishes two main formats for the representation of floating point numbers: single precision (32 bits) and double precision (64 bits).

Comparison: Integer vs. floating point speed

You may be wondering if it is better to have a processor with better performance in integer operations or in floating point operations. This used to be a common question in the past, although it has now faded into the background, it’s still interesting. The performance of the software will depend on this choice.

In general, many conventional applications make more intensive use of integer operations. However, in applications such as scientific software, video games, multimedia, etc., it is where the use of floating point numbers becomes more important. So it will depend on the software you use most often.

Currently, we use the FLOPS unit of measure to describe the performance of a machine in terms of floating point calculations per second that it can perform. However, this unit is more relevant in supercomputing or High Performance Computing (HPC) environments than in the realm of home computing.

Also, it is important to note that instructions dealing with integers tend to process faster, while floating point operations may require more clock cycles to complete. For this reason, it is not so easy to directly compare the number of instructions per cycle in integer and floating point operations.