Skip to main content
Library homepage
 

Text Color

Text Size

 

Margin Size

 

Font Type

Enable Dyslexic Font
Mathematics LibreTexts

3.2: Floating Point Numbers

( \newcommand{\kernel}{\mathrm{null}\,}\)

Many fields in scientific computing rely on using decimals and the standard way to store these in a computer is with floating point numbers. Details on floating-point numbers are in Appendix XXXX. Julia has 16-,32- and 64-bit floating point numbers called Float16, Float32 and Float64 and by default on most systems is the Float64.

There are two limitations to any floating-point number. First, the number of digits stored in the number and secondly, the maximum and minimum values. Each built-in type splits the number of bits into storing both and there is a balance between these. A rule of thumb is that

• Float16 stores 4 decimal digits and the max is about 32,000.

• Float32 stores 8 decimal digits and the max is about 1038.

• Float64 stores 16 decimal digits and the max is about 10307

We can using the bitstring function in julia to find the binary representation. Notice that

Login with LibreOne to run this code cell interactively.

If you have already signed in, please refresh the page.

bitstring(Float16(8.625))
bitstring(Float16(8.625))
"0100100001010000"

Again, details are in Appendix XXXXX but, in short, a floating- point number is stored in scientific notation with the abscissa, exponent and the sign all combined together.

Unlike integers, most numbers cannot be stored exactly with a floating-point number. For example, 1/3 divides 1 by 3 and results in a floating-point number close to the fraction 13. In julia this is 0.3333333333333333 and also note that 

Login with LibreOne to run this code cell interactively.

If you have already signed in, please refresh the page.

bitstring(1/3)
bitstring(1/3)
"0011111111010101010101010101010101010101010101010101010101010101"

Notice that there are non-zero bits throughout the number in this case that didn’t occur with 8.625. This is because as a fraction 8.625 has a denominator of 8, which is a power of 2. If a fraction can be written with such a denominator, the number in binary has 0s that pad the right end of the number.

What does this matter? Consider the following:

Login with LibreOne to run this code cell interactively.

If you have already signed in, please refresh the page.

  1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9
  1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9
1.0000000000000002

is not 1, the expected result.  This is an example of the limitations of floating-point numbers and 1) either we deal with it or 2) use a different data type (in this case either a BigFloat or Rational would be better).

Note: This occurred because the closest floating point to the fraction 1/9 was just slightly above 1/9 and adding up 9 of those numbers results in the extra amount

Unless you know you have some reason to choose otherwise, choose Float64 for most floating-point numbers. There are still underflow and overflow errors associated with it, but as we will see in Chapter XXXXX, generally round-off error associated with floating-point number is more detrimental to calculations.

 


This page titled 3.2: Floating Point Numbers is shared under a CC BY-NC-SA license and was authored, remixed, and/or curated by Peter Staab.

Support Center

How can we help?