This section describes the basic rules that govern whether or not the Fujitsu FORTRAN90/VP compiler is able to produce instructions that utilize the vector unit of the VPP300. This process is often called vectorization. It should not be confused with efforts the programmer makes to choose algorithms suitable for the VPP300, nor efforts made to cast the code in a way that allows the compiler to generate vector, rather than scalar, instructions.
This section also outlines the types of problems that may arise as a result of subtle differences between the scalar and vector execution of certain FORTRAN constructs.
A single DO loop can contain both vectorizable and non-vectorizable statements. Recall that vector instructions are single instructions that operate on groups of similar data (arrays).
The FORTRAN90/VP compiler is very sophisticated, and so it is simpler to list the things it cannot vectorize than those it can.
It cannot vectorize loops containing:
The compiler cannot vectorize the following data types:
or the following intrinsic functions:
Vector operations for single-precision (32 bits) floating-point numbers (e.g. REAL or COMPLEX) are actually performed in double precision (64 bits). There is therefore little extra cost in CPU time in using double precision. Indeed, in some cases the CPU time is reduced substantially by using double precision. However, since memory used will be doubled, single precision should be used unless the extra precision is actually required or the code runs faster in double precision.
In a DO loop which is vectorized, the order of execution of the statements is modified and the loop is replaced by a series of single vector instructions for each statement in the loop. It is as if each statement was executed as an independent DO loop. This is done preserving the order of the definition of variables (called data definition) and reference, so as to preserve the original intention of the code. For example,
DO 8 I = 1, 100
A(I) = B(I) + C(I)
E(I) = A(I) * D (I)
8 F(I) = A(I) - D(I)
becomes:
DO 8A I = 1, 100 Loop 1
8A A(I) = B(I) + C(I)
DO 8B I = 1, 100 Loop 2
8B E(I) = A (I) * D (I)
DO 8C I = 1, 100 Loop 3
8C F(I) = A(I) - D(I)
Since updated values of A are needed to define E and F, the arithmetic of loop 1 must be executed first. The order of execution of loops 2 and 3 is of no consequence. Note that the splitting of the loops shown above is illustrative only and does not happen in practice.
If a loop contains statements that cannot be vectorized, then the compiler will construct separate scalar DO loops for those parts and vector instructions for the rest.
Data whose definition and reference order would be altered by vectorization are called recursive data and the use of such data is called a recursive reference. Executable statements containing a recursive reference are not vectorized by the compiler by default. For example, there is a recursive reference to A in the second loop below:
DO 1 J = 1, 4
A(J) = 0.0 loop to define initial values
1 B(J) = 10.0
A(l) = 1.0
DO 2 J = 2, 4 recursive loop
2 A(J) = A(J-l) * B(J)
The second loop should result in the elements of A containing (1.,10.,100.,1000.). If the compiler was to be foolhardy enough to execute this loop in the vector unit of the VP2200, the results would be (1.,10.,0.,0.)! In the vector unit the values of A(l), A(2) and A(3) on the RHS are fed into the vector pipelines in their unaltered state (i.e. (1,0,0) ) and multiplied by (10,10,10) giving (10,0,0) for A(2), A(3), A(4) and hence the wrong result. To see why this would happen, it is necessary to understand what happens in the vector unit.
In general, the multiplication of two numbers can be broken up into a number of steps: fetch the operands from memory and put them in working registers, do the appropriate shifting of bits involved in the multiplication, and store the results back in memory. The second part in fact consists of a number of steps.
The concept of vector pipelining is that the second, third, etc. operands used in arithmetic in a DO loop can be fed into the start of this multistep process before the result of the first pair of operands is delivered back into memory. This is what gives the VP its speed compared with a scalar processor in which the instruction steps act serially upon each pair of operands, not starting the second pair until the first is finished. The vector pipeline can thus be considered as analogous to a car assembly line and contrasted with the building of complete cars, one at a time, in a pre- Ford 'scalar' factory.
The fact that the first result is not available when the second pair of operands enter the multiplication process is the cause of the problem with the recursive loop above. The intention of the loop is that the updated A(2) from the first pass through the loop should be available to calculate A(3). In a VP, the result for A(2) would not be available when work on evaluating A(3) commences. The compiler is cautious and refuses to generate vector instructions for this loop. There are cases when it is over-cautious. As detailed in the vector tuning section, you can force the compiler to vectorize such loops if you are confident that there is no recursive reference.
Because of the way DO loops are vectorized, there are a number of cases
where apparently normal scalar code causes problems. Most often, such
problems arise from 'tricky' code which conforms neither to the spirit nor
the letter of FORTRAN, but has escaped with warning messages at most
from other compilers. Some are briefly listed here:
1. The following causes a storage protection error if N > 10 because
elements of A beyond A(10) are referenced by the vectorization of the
entire statement marked **. On a scalar machine, the assignment in this
statement would never be executed.
2. Loops with an IF statement that causes a jump out of the loop can lead
to problems if in the scalar code the jump implicitly avoids an error
condition. For example:
If, in a scalar execution of the above, the values of AA and X(I) were such
that the jump to 20 occurs when I=50, and say, X(80) is negative, then
the code would run normally. On the VP, all the SQRTs would be
evaluated 'simultaneously' leading to an error. Note that no error would
occur if there were an explicit test for a negative X(I) even though the
vector SQRT function would still be used.
3. Compound logical expressions are evaluated serially on a scalar
machine and so the second part is not evaluated if the first is false. On
the VP, the simple expressions may be evaluated independently before a
final true/false decision is made. Thus
will cause a divide by zero error if any IB(1)=0. (Recall MOD(K,J)
involves a division by J.) Note that this may happen on the VP even if
the loop is not vectorized.
4. Double precision arrays must be on correct boundaries for the vector
unit to give correct results (that is, they must have addresses that are
multiples of 8). Arrays known to the compiler to be not on correct
boundaries will not be vectorized, so there is no problem in that case.
However, the compiler cannot recognize such cases in dummy arguments
in subroutines. They will be vectorized and cause an error with code
jwe0019i. Disagreements between argument types or
EQUIVALENCE statements may cause this to occur. This error may
also occur if a reference is made to an element of a multiply-dimensioned
array that is within the bounds of the entire address range of the array but
beyond the dimensions of an index. For example:
will fail for N larger than 32 (the minimum size of a vector register on
the VPP300) even though the loop is not referencing elements outside the
bounds of any array. The SUBCHK debugging options would pick up
any such references. This type of code can merely give wrong results and
not an abend. For example:
5. Floating-point vector arithmetic must be performed on normalized
floating-point numbers. Use of EQUIVALENCE or COMMON that
leads to non-normalized numbers will cause an error with system code
jwe0019i.
6. The order of operations within a vectorized statement may differ from
that generated by a scalar compiler. This may lead to underflows or
overflows which were avoided on the scalar machine by cancellation of
exponents. Brackets can be used to explicitly group terms to prevent
this.
7. It is possible to get disagreement with scalar results because the
rounding errors that arise as a result of vector arithmetic may differ from
those arising from scalar arithmetic. Recall that all vector partial results
are in double-precision, and that inner products are evaluated in quite a
different order in the scalar and vector modes. See the Academic
Consultants for more details.
8. There are a number of more obscure errors that may arise in vectorizing
apparently innocuous scalar code. Most involve code that is syntactically
correct but of dubious logical validity, e.g.
The results of the above are not guaranteed by standard FORTRAN scalar
compilers.
Errors Caused by the Vectorization of 'Innocent' Scalar Code
SUBROUTINE MYSUB (B,N)
DIMENSION A(10), B(N)
DO 10 I = 1, N
IF (I.LE.10) B(I) = A(I) **
IF (I.GT.10) B(I) = 0.0
10 CONTINUE
DIMENSION X(100)
. . .
DO 10 I = 1, N
IF (AA.GT.SQRT(X(I))) GO TO 20
. . .
10 CONTINUE
20 CONTINUE
DO 10 I=l,N
10 IF ((IB(I).NE.0).AND.(MOD(IA(I),IB(I)).EQ.0))
* X(I) = 0.
REAL X(100), Y(5,20), Z(5,20)
READ (5,*) N
DO 1 J = 1, N
X(J) = Y(J,l) + Z(J,l)
REAL X(3,4)
DO 10 I=l, 6 Note I>3 as dimensioned
10 X(I,2) = X(I,l)
DIMENSION A(10)
CALL MYSUB (A,A)
SUBROUTINE MYSUB (B,A)
DO 1 I = 1, 10
A(I) = . . .
B(I) = . . .
ANU Supercomputer Facility - Home Page |
Contact us
The Australian National University