Comparison of Two-Dimensional FFT Methods on the Hypercube : The Choice Between Strips and Patches
Complex two-dimensional FFT's up to size 256 X 256 points were implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition, and distributed FFT's. Comparison is based on partitioning the data into either strips or patches and three distinct methods are discussed. Of the two strips methods, the Transpose-Split implementation involves local independent FFT computations in both directions with an intervening transpose, while the Local-Distributed method performs local FFT's in one direction followed by distributed FFT's in the other. The patch or Block method partitions the matrix in submatrices and maps the ($i,j$)th block into node $i \hat j$ ($\hat$ denoting concatenation of binary numbers). Both the row-wise FFT's and the column-wise FFT's require inter-node communication for the distributed butterfly computations. Timing results show that on the Intel iPSC/System 286, there is hardly a difference between the three methods. This is due primarily to the inadequacies of Intel communication handling and its inability to overlap communication and computation as well as a lack of vector boards. A model follows where several important factors such as vectorization and communication complexity are considered. While timing results show that on a vectorized hypercube with ideal synchronization of communication and computation the Block method is asymptomatically the fastest. The conclusion is that this problem is highly system dependent and therefore no sweeping recommendations can be made.
computer science; technical report
Previously Published As