Skip to main content
Mathematics LibreTexts

3.2: Floating Point Numbers

  • Page ID
    53660
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Many fields in scientific computing rely on using decimals and the standard way to store these in a computer is with floating point numbers. Details on floating-point numbers are in Appendix XXXX. Julia has 16-,32- and 64-bit floating point numbers called Float16, Float32 and Float64 and by default on most systems is the Float64.

    There are two limitations to any floating-point number. First, the number of digits stored in the number and secondly, the maximum and minimum values. Each built-in type splits the number of bits into storing both and there is a balance between these. A rule of thumb is that

    • Float16 stores 4 decimal digits and the max is about 32,000.

    • Float32 stores 8 decimal digits and the max is about \(10^{38}\).

    • Float64 stores 16 decimal digits and the max is about \(10^{307}\)

    We can using the bitstring function in julia to find the binary representation. Notice that

    bitstring(Float16(8.625))
    "0100100001010000"

    Again, details are in Appendix XXXXX but, in short, a floating- point number is stored in scientific notation with the abscissa, exponent and the sign all combined together.

    Unlike integers, most numbers cannot be stored exactly with a floating-point number. For example, 1/3 divides 1 by 3 and results in a floating-point number close to the fraction \(\frac{1}{3}\). In julia this is 0.3333333333333333 and also note that 

    bitstring(1/3)
    "0011111111010101010101010101010101010101010101010101010101010101"

    Notice that there are non-zero bits throughout the number in this case that didn’t occur with 8.625. This is because as a fraction 8.625 has a denominator of 8, which is a power of 2. If a fraction can be written with such a denominator, the number in binary has 0s that pad the right end of the number.

    What does this matter? Consider the following:

      1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9+1/9
    1.0000000000000002

    is not 1, the expected result.  This is an example of the limitations of floating-point numbers and 1) either we deal with it or 2) use a different data type (in this case either a BigFloat or Rational would be better).

    Note: This occurred because the closest floating point to the fraction 1/9 was just slightly above 1/9 and adding up 9 of those numbers results in the extra amount

    Unless you know you have some reason to choose otherwise, choose Float64 for most floating-point numbers. There are still underflow and overflow errors associated with it, but as we will see in Chapter XXXXX, generally round-off error associated with floating-point number is more detrimental to calculations.

     


    This page titled 3.2: Floating Point Numbers is shared under a CC BY-NC-SA license and was authored, remixed, and/or curated by Peter Staab.

    • Was this article helpful?