next up Next: What Can be vectorized Up: VECTORIZATION OF FORTRAN

The Vectorizing Compiler

This section describes the basic rules that govern whether or not the Fujitsu FORTRAN90/VP compiler is able to produce instructions that utilize the vector unit of the VPP300. This process is often called vectorization. It should not be confused with efforts the programmer makes to choose algorithms suitable for the VPP300, nor efforts made to cast the code in a way that allows the compiler to generate vector, rather than scalar, instructions.

This section also outlines the types of problems that may arise as a result of subtle differences between the scalar and vector execution of certain FORTRAN constructs.



What Can be Vectorized by the Compiler?

The basic unit of code that can be vectorized is the DO loop. Only innermost loops can be vectorized, and there must not be any passing of control outside the loop. DO WHILE and DO UNTIL loops cannot be vectorized.

A single DO loop can contain both vectorizable and non-vectorizable statements. Recall that vector instructions are single instructions that operate on groups of similar data (arrays).

The FORTRAN90/VP compiler is very sophisticated, and so it is simpler to list the things it cannot vectorize than those it can.

It cannot vectorize loops containing:

The compiler cannot vectorize the following data types:

or the following intrinsic functions:

Vector operations for single-precision (32 bits) floating-point numbers (e.g. REAL or COMPLEX) are actually performed in double precision (64 bits). There is therefore little extra cost in CPU time in using double precision. Indeed, in some cases the CPU time is reduced substantially by using double precision. However, since memory used will be doubled, single precision should be used unless the extra precision is actually required or the code runs faster in double precision.



Order of Execution of Statements in Vectorized DO Loops

In a DO loop which is vectorized, the order of execution of the statements is modified and the loop is replaced by a series of single vector instructions for each statement in the loop. It is as if each statement was executed as an independent DO loop. This is done preserving the order of the definition of variables (called data definition) and reference, so as to preserve the original intention of the code. For example,

                DO 8 I = 1, 100

                A(I) = B(I) + C(I)

                E(I) = A(I) * D (I)

        8       F(I) = A(I) - D(I)

becomes:

                DO 8A I = 1, 100           Loop 1
        8A      A(I) = B(I) + C(I)

                DO 8B I = 1, 100           Loop 2
        8B      E(I) = A (I) * D (I)

                DO 8C I = 1, 100           Loop 3
        8C      F(I) = A(I) - D(I)

Since updated values of A are needed to define E and F, the arithmetic of loop 1 must be executed first. The order of execution of loops 2 and 3 is of no consequence. Note that the splitting of the loops shown above is illustrative only and does not happen in practice.

If a loop contains statements that cannot be vectorized, then the compiler will construct separate scalar DO loops for those parts and vector instructions for the rest.



Recursive References

Data whose definition and reference order would be altered by vectorization are called recursive data and the use of such data is called a recursive reference. Executable statements containing a recursive reference are not vectorized by the compiler by default. For example, there is a recursive reference to A in the second loop below:

                DO 1 J = 1, 4
                    A(J) = 0.0        loop to define initial values
        1           B(J) = 10.0

                A(l) = 1.0
                DO 2 J = 2, 4         recursive loop
        2           A(J) = A(J-l) * B(J)

The second loop should result in the elements of A containing (1.,10.,100.,1000.). If the compiler was to be foolhardy enough to execute this loop in the vector unit of the VP2200, the results would be (1.,10.,0.,0.)! In the vector unit the values of A(l), A(2) and A(3) on the RHS are fed into the vector pipelines in their unaltered state (i.e. (1,0,0) ) and multiplied by (10,10,10) giving (10,0,0) for A(2), A(3), A(4) and hence the wrong result. To see why this would happen, it is necessary to understand what happens in the vector unit.

In general, the multiplication of two numbers can be broken up into a number of steps: fetch the operands from memory and put them in working registers, do the appropriate shifting of bits involved in the multiplication, and store the results back in memory. The second part in fact consists of a number of steps.

The concept of vector pipelining is that the second, third, etc. operands used in arithmetic in a DO loop can be fed into the start of this multistep process before the result of the first pair of operands is delivered back into memory. This is what gives the VP its speed compared with a scalar processor in which the instruction steps act serially upon each pair of operands, not starting the second pair until the first is finished. The vector pipeline can thus be considered as analogous to a car assembly line and contrasted with the building of complete cars, one at a time, in a pre- Ford 'scalar' factory.

The fact that the first result is not available when the second pair of operands enter the multiplication process is the cause of the problem with the recursive loop above. The intention of the loop is that the updated A(2) from the first pass through the loop should be available to calculate A(3). In a VP, the result for A(2) would not be available when work on evaluating A(3) commences. The compiler is cautious and refuses to generate vector instructions for this loop. There are cases when it is over-cautious. As detailed in the vector tuning section, you can force the compiler to vectorize such loops if you are confident that there is no recursive reference.

Errors Caused by the Vectorization of 'Innocent' Scalar Code

Because of the way DO loops are vectorized, there are a number of cases where apparently normal scalar code causes problems. Most often, such problems arise from 'tricky' code which conforms neither to the spirit nor the letter of FORTRAN, but has escaped with warning messages at most from other compilers. Some are briefly listed here:

1. The following causes a storage protection error if N > 10 because elements of A beyond A(10) are referenced by the vectorization of the entire statement marked **. On a scalar machine, the assignment in this statement would never be executed.

                SUBROUTINE MYSUB (B,N)
                DIMENSION A(10), B(N)

                DO 10 I = 1, N
                        IF (I.LE.10) B(I) = A(I)       **
                        IF (I.GT.10) B(I) = 0.0
        10        CONTINUE

2. Loops with an IF statement that causes a jump out of the loop can lead to problems if in the scalar code the jump implicitly avoids an error condition. For example:

                DIMENSION X(100)
                . . .
                DO 10  I = 1, N
                        IF (AA.GT.SQRT(X(I))) GO TO 20
                . . .
        10      CONTINUE
        20      CONTINUE

If, in a scalar execution of the above, the values of AA and X(I) were such that the jump to 20 occurs when I=50, and say, X(80) is negative, then the code would run normally. On the VP, all the SQRTs would be evaluated 'simultaneously' leading to an error. Note that no error would occur if there were an explicit test for a negative X(I) even though the vector SQRT function would still be used.

3. Compound logical expressions are evaluated serially on a scalar machine and so the second part is not evaluated if the first is false. On the VP, the simple expressions may be evaluated independently before a final true/false decision is made. Thus

                DO 10 I=l,N
        10         IF ((IB(I).NE.0).AND.(MOD(IA(I),IB(I)).EQ.0))
               *        X(I) = 0.

will cause a divide by zero error if any IB(1)=0. (Recall MOD(K,J) involves a division by J.) Note that this may happen on the VP even if the loop is not vectorized.

4. Double precision arrays must be on correct boundaries for the vector unit to give correct results (that is, they must have addresses that are multiples of 8). Arrays known to the compiler to be not on correct boundaries will not be vectorized, so there is no problem in that case. However, the compiler cannot recognize such cases in dummy arguments in subroutines. They will be vectorized and cause an error with code jwe0019i. Disagreements between argument types or EQUIVALENCE statements may cause this to occur. This error may also occur if a reference is made to an element of a multiply-dimensioned array that is within the bounds of the entire address range of the array but beyond the dimensions of an index. For example:

                REAL X(100), Y(5,20), Z(5,20)
                READ (5,*) N
                DO 1 J = 1, N
                        X(J) = Y(J,l) + Z(J,l)

will fail for N larger than 32 (the minimum size of a vector register on the VPP300) even though the loop is not referencing elements outside the bounds of any array. The SUBCHK debugging options would pick up any such references. This type of code can merely give wrong results and not an abend. For example:

                REAL X(3,4)
                DO 10 I=l, 6                Note I>3 as dimensioned
        10                X(I,2) = X(I,l)

5. Floating-point vector arithmetic must be performed on normalized floating-point numbers. Use of EQUIVALENCE or COMMON that leads to non-normalized numbers will cause an error with system code jwe0019i.

6. The order of operations within a vectorized statement may differ from that generated by a scalar compiler. This may lead to underflows or overflows which were avoided on the scalar machine by cancellation of exponents. Brackets can be used to explicitly group terms to prevent this.

7. It is possible to get disagreement with scalar results because the rounding errors that arise as a result of vector arithmetic may differ from those arising from scalar arithmetic. Recall that all vector partial results are in double-precision, and that inner products are evaluated in quite a different order in the scalar and vector modes. See the Academic Consultants for more details.

8. There are a number of more obscure errors that may arise in vectorizing apparently innocuous scalar code. Most involve code that is syntactically correct but of dubious logical validity, e.g.

                DIMENSION A(10)
                CALL MYSUB (A,A)

                SUBROUTINE MYSUB (B,A)
                DO 1 I = 1, 10
                        A(I) = . . .
                        B(I) = . . .

The results of the above are not guaranteed by standard FORTRAN scalar compilers.


ANU Supercomputer Facility -
Home Page | Contact us
The Australian National University