3.2: Floating Point Numbers
- Page ID
- 53660
Many fields in scientific computing rely on using decimals and the standard way to store these in a computer is with floating point numbers. Details on floating-point numbers are in Appendix XXXX. Julia has 16-,32- and 64-bit floating point numbers called Float16
, Float32
and Float64
and by default on most systems is the Float64
.
There are two limitations to any floating-point number. First, the number of digits stored in the number and secondly, the maximum and minimum values. Each built-in type splits the number of bits into storing both and there is a balance between these. A rule of thumb is that
• Float16
stores 4 decimal digits and the max is about 32,000.
• Float32
stores 8 decimal digits and the max is about \(10^{38}\).
• Float64
stores 16 decimal digits and the max is about \(10^{307}\)
We can using the bitstring
function in julia to find the binary representation. Notice that
Again, details are in Appendix XXXXX but, in short, a floating- point number is stored in scientific notation with the abscissa, exponent and the sign all combined together.
Unlike integers, most numbers cannot be stored exactly with a floating-point number. For example, 1/3 divides 1 by 3 and results in a floating-point number close to the fraction \(\frac{1}{3}\). In julia this is 0.3333333333333333 and also note that
Notice that there are non-zero bits throughout the number in this case that didn’t occur with 8.625. This is because as a fraction 8.625 has a denominator of 8, which is a power of 2. If a fraction can be written with such a denominator, the number in binary has 0s that pad the right end of the number.
What does this matter? Consider the following:
is not 1, the expected result. This is an example of the limitations of floating-point numbers and 1) either we deal with it or 2) use a different data type (in this case either a BigFloat
or Rational
would be better).
Note: This occurred because the closest floating point to the fraction 1/9 was just slightly above 1/9 and adding up 9 of those numbers results in the extra amount
Unless you know you have some reason to choose otherwise, choose Float64
for most floating-point numbers. There are still underflow and overflow errors associated with it, but as we will see in Chapter XXXXX, generally round-off error associated with floating-point number is more detrimental to calculations.