3.2: Floating Point Numbers
( \newcommand{\kernel}{\mathrm{null}\,}\)
Many fields in scientific computing rely on using decimals and the standard way to store these in a computer is with floating point numbers. Details on floating-point numbers are in Appendix XXXX. Julia has 16-,32- and 64-bit floating point numbers called Float16
, Float32
and Float64
and by default on most systems is the Float64
.
There are two limitations to any floating-point number. First, the number of digits stored in the number and secondly, the maximum and minimum values. Each built-in type splits the number of bits into storing both and there is a balance between these. A rule of thumb is that
• Float16
stores 4 decimal digits and the max is about 32,000.
• Float32
stores 8 decimal digits and the max is about 1038.
• Float64
stores 16 decimal digits and the max is about 10307
We can using the bitstring
function in julia to find the binary representation. Notice that
Again, details are in Appendix XXXXX but, in short, a floating- point number is stored in scientific notation with the abscissa, exponent and the sign all combined together.
Unlike integers, most numbers cannot be stored exactly with a floating-point number. For example, 1/3 divides 1 by 3 and results in a floating-point number close to the fraction 13. In julia this is 0.3333333333333333 and also note that
Notice that there are non-zero bits throughout the number in this case that didn’t occur with 8.625. This is because as a fraction 8.625 has a denominator of 8, which is a power of 2. If a fraction can be written with such a denominator, the number in binary has 0s that pad the right end of the number.
What does this matter? Consider the following:
is not 1, the expected result. This is an example of the limitations of floating-point numbers and 1) either we deal with it or 2) use a different data type (in this case either a BigFloat
or Rational
would be better).
Note: This occurred because the closest floating point to the fraction 1/9 was just slightly above 1/9 and adding up 9 of those numbers results in the extra amount
Unless you know you have some reason to choose otherwise, choose Float64
for most floating-point numbers. There are still underflow and overflow errors associated with it, but as we will see in Chapter XXXXX, generally round-off error associated with floating-point number is more detrimental to calculations.