4-4  FLOATING POINT NUMBERS ON DIFFERENT MACHINES 
 *************************************************

 The information in this chapter is due mainly to Arne Vajhoej, without 
 him it would have been impossible. The other data was gathered with the 
 routine MACHAR written by Cody and Malcolm. 

 Much more sophisticated routines are SPARA and DPARA from the PARANOIA 
 package, the package can be found in Netlib. These routines check single 
 (SPARA) and double precision (DPARA) arithmetic for finer numerical details.

 The tables below are useful when you need to supply routines with
 a 'required accuracy parameter', or when you have to convert routines
 written for some REAL type to another REAL type.

 We will first take the complex VMS case and later add a typical 
 UNIX machine. 


 VMS case
 --------

   Floating-point related compiler switches on VMS
   ===============================================

 Architecture |  Compiler switch  |  REAL*4    |  REAL*8    |  REAL*16   |
 =============|===================|============|============|============|
     VAX      | /G_FLOATING       | F_FLOATING | G_FLOATING | H_FLOATING |
 -------------|-------------------|------------|------------|------------|
     VAX      | /NOG_FLOATING     | F_FLOATING | D_FLOATING | H_FLOATING |
 -------------|-------------------|------------|------------|------------|
    ALPHA     | /G_FLOATING       | F_FLOATING | G_FLOATING | X_FLOATING |
 -------------|-------------------|------------|------------|------------|
    ALPHA     | /NOG_FLOATING     | F_FLOATING | D_FLOATING | X_FLOATING |
 -------------|-------------------|------------|------------|------------|
    ALPHA     | /FLOAT=G_FLOAT    | F_FLOATING | G_FLOATING | X_FLOATING |
 -------------|-------------------|------------|------------|------------|
    ALPHA     | /FLOAT=D_FLOAT    | F_FLOATING | D_FLOATING | X_FLOATING |
 -------------|-------------------|------------|------------|------------|
    ALPHA     | /FLOAT=IEEE_FLOAT | S_FLOATING | T_FLOATING | X_FLOATING |
 -------------|-------------------|------------|------------|------------|


   Floating-point types on DEC computers
   =====================================

   Name     | Size | Standard | VAX status   | ALPHA status  | Comments
 ===========|======|==========|==============|===============|===============
 F_FLOATING |  4   |   DEC    | #            | #             |
 -----------|------|----------|--------------|---------------|---------------
 S_Floating |  4   |  IEEE    |  ========    | #             |
 -----------|------|----------|--------------|---------------|---------------
 D_FLOATING |  8   |   DEC    | # Default    | #             | Less Precision
            |      |          |              |               | on ALPHA!
 -----------|------|----------|--------------|---------------|---------------
 G_FLOATING |  8   |   DEC    | #            | # Default     | 
 -----------|------|----------|--------------|---------------|---------------
 T_Floating |  8   |  IEEE    |  ========    | #             |
 -----------|------|----------|--------------|---------------|---------------
 H_Floating |  16  |   DEC    | # Older VAXs |  ==========   |  
            |      |          | */# Newer    |               |
 -----------|------|----------|--------------|---------------|---------------
 X_Floating |  16  |  IEEE    |  ========    | * Only in     |  
            |      |          |              |   Fortran     |
 -----------|------|----------|--------------|---------------|---------------
    #  Implemented in hardware       *  Implemented in software


 Remark on table above
 ---------------------
    D_FLOAT calculations on ALPHA are done by converting to 
    G_FLOAT, computing and converting back to D_FLOAT, see 
    remarks to the next table.


   Numerical properties of floating-points on DEC computers
   ========================================================

   Name  | Size | Mant | Expo |  Minimum   |  Maximum   | Precision | Roun
         |      | issa | nent |            |            | (1-) (1+) | ding
 ========|======|======|======|============|============|===========|======
 F_FLOAT |  32  |  23  |  8   |  0.29E-38  |  0.17E+39  |   6E-8    | DEC
 --------|------|------|------|------------|------------|-----------|------
 S_Float |  32  |  23  |  8   |  0.12E-37  |  0.34E+39  |  6,12E-8  | IEEE
 --------|------|------|------|------------|------------|-----------|------
 D_FLOAT |  64  |  55  |  8   |  0.29E-38  |  0.17E+39  |  14E-18   | DEC
 (ALPHA) |  **  |  52  |  8   |  0.29E-38  |  0.17E+39  |  11E-17   | DEC
 --------|------|------|------|------------|------------|-----------|------
 G_FLOAT |  64  |  52  |  11  | 0.56E-308  |  0.9E+308  |  11E-17   | DEC
 --------|------|------|------|------------|------------|-----------|------
 T_Float |  64  |  52  |  11  | 0.22E-307  | 0.18E+309  | 11,22E-17 | IEEE
 --------|------|------|------|------------|------------|-----------|------
 H_Float | 128  | 112  |  15  | 0.84E-4932 | 0.59E+4932 | 0.96E-34  | DEC
 --------|------|------|------|------------|------------|-----------|------
 X_Float | 128  | 112  |  15  | 0.34E-4931 | 0.12E+4933 |  1,2E-34  | IEEE
 --------|------|------|------|------------|------------|-----------|------

 Remarks on the table above
 --------------------------
   1) The mantissa size doesn't include the hidden bit. 

   2) The 'effective precision' have actually two values:

        (1+) the smallest positive number satisfying: 1.0 + X .NE. 1.0
        (1-) the smallest positive number satisfying: 1.0 - X .NE. 1.0

     In the non-DEC float types the two values are different, and 
     both are given.

   3) D_FLOAT on ALPHA loses 3 mantissa bits, it has the low precision 
      of G_FLOAT combined with the small range of D_FLOAT.


 X_FLOAT always underflows using denormalized numbers, (also called
 graceful underflowing), all other float types underflows by default 
 in the assign zero method.

 You can change the underflowing behaviour for IEEE floating-points, with 
 the switch  /IEEE_MODE=DENORM_RESULTS  the underflow trapping is done in 
 software and not in the Floating Point Unit and is slow.


 Sun IEEE floats (SPARCsystem 600MP) 
 -----------------------------------

  REAL*4 Characteristics 
  ======================
  Representation radix      2
  Mantissa size             24
  Exponent size             8
 
  Rounding:  IEEE type 
  Underflow: Graceful 
 
  Numerical Precision (+)     1.19209E-07
  Numerical Precision (-)     5.96046E-08
  Minimal Usable number       1.17549E-38
  Maximal Usable number       3.40282E+38
 
 
  REAL*8 Characteristics 
  ======================
  Representation radix      2
  Mantissa size             53
  Exponent size             11
 
  Rounding:  IEEE type 
  Underflow: Graceful 
 
  Numerical Precision (+)     2.2204460492503D-16
  Numerical Precision (-)     1.1102230246252D-16
  Minimal Usable number       2.2250738585072-308
  Maximal Usable number       1.7976931348623+308
 
 
  REAL*16 Characteristics 
  =======================
  Representation radix      2
  Mantissa size             113
  Exponent size             15
 
  Rounding:  IEEE type 
  Underflow: Graceful 
 
  Numerical Precision (+)     1.9259299443872358530559779425849273Q-034
  Numerical Precision (-)     9.6296497219361792652798897129246366Q-035
  Minimal Usable number       3.3621031431120935062626778173217526-4932
  Maximal Usable number       1.1897314953572317650857593266280070+4932
 
 
  +---------------------------------------------------------------------+
  |     SUMMARY OF FLOATING POINT TYPES                                 |
  |     ===============================                                 |
  |     REAL*4  precision                       6-12 X (10 ** -8)       |
  |     REAL*4  smallest useable number         3-12 X (10 ** -39)      |
  |     REAL*4  largest useable number          1-3  X (10 ** +38)      |
  |                                                                     |
  |     REAL*8  precision                       2-22 X (10 ** -17)      |
  |     REAL*8  smallest useable number         3    X (10 ** -39) -    |
  |                                             6    X (10 ** -309)     |
  |     REAL*8  largest useable number          1    X (10 ** +38) -    |
  |                                             1    X (10 ** +308)     |
  |                                                                     |
  |     REAL*16 precision                       1-2  X (10 ** -34)      |
  |     REAL*16 smallest useable number         9-34 X (10 ** -4933)    |
  |     REAL*16 largest useable number          6-12 X (10 ** +4931)    |
  +---------------------------------------------------------------------+


Return to contents page