IOOpt- Automatic Derivation of I/O complexity bounds for affine programs
Evaluating the complexity of an algorithm is an important step when developing applications as it highly impact both its time and energy performance. The arithmetic complexity which is the number of operations, regardless of the execution order, is, for neural networks (or more generally affine programs), easy to characterize. The data-movement (or, I/O) complexity characterization is way more complex as it refers to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location when considering all possible valid schedules.
This paper presents \IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1.~a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2.~an upper bound that allows to assess the tightness of the lower bound; 3.~a tiling recommendation (loop permutation and tile’s sizes) that matches the upper bound. While the associated algorithms for computing lower bound (and the associated implementations) that are described in this paper can be applied to any affine programs, a substantial effort has been made to provide bounds that are as tight as possible for neural networks. In particular, it extends the previous work of Olivry et al. so as to handle multi-dimensional reductions and expose the constraints associated to small dimensions that are both present in convolutions.
As for the upper bound algorithm, if the input can be a general tilable affine program (e.g. output of a polyhedral compiler such as PluTo), the involved algebraic computations are tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide a useful tiling recommendation to the programmer.
We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then, we evaluate numerically the tightness of our bound using Yolo9000 convolution layer sizes and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the run-time of the recommended tiled code for the convolution layers from Yolo9000.
Thu 21 OctDisplayed time zone: Central Time (US & Canada) change
15:40 - 17:00
|Automatically Enforcing Fresh and Consistent Inputs in Intermittent Systems|
|IOOpt- Automatic Derivation of I/O complexity bounds for affine programs|
|Integration Verification Across Software and Hardware for a Simple Embedded System|
|Execution reconstruction: Harnessing failure reoccurrences for failure reproduction|
|Discussion, Questions and Answers|