

# Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

JIE ZHAO, College of Computer Science and Electronic Engineering, Hunan University, China JINCHEN XU, Information Engineering University, China PENG DI, WANG NIE, JIAHUI HU, YANZHI YI, SIJIA YANG, ZHEN GENG, RENWEI ZHANG, BOJIE LI, ZHILIANG GAN, and XUEFENG JIN, Huawei Technologies Co. Ltd., China

Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.

In this article, we present a so-called *basteln* (an abbreviation for backward slicing of tiled loop nests) *strategy* in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.

We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8× over cuBLAS/cuDNN and 1.1× over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.

This work was done when J. Zhao was with Information Engineering University, Zhengzhou, and Renmin University of China, Beijing. P. Di is now a senior staff engineer of software security at Ant Group, Hangzhou, China. Z. Geng is now a senior staff engineer of parallel computing software at Alibaba, Hangzhou, China.

The work of J. Zhao and J. Xu was partly supported by the National Natural Science Foundation of China under grant U20A20226. The views and conclusions in this article do not represent the official policies of the Chinese Government but only belong to the authors.

Authors' addresses: J. Zhao, College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China; e-mail: jiezhao@hnu.edu.cn; J. Xu, Information Engineering University, Zhengzhou, 450001, China; e-mail: atao728208@126.com; P. Di, W. Nie, J. Hu, Y. Yi, S. Yang, Z. Geng, R. Zhang, B. Li, Z. Gan, and X. Jin, Huawei Technologies Co. Ltd., Beijing, 100085, China; e-mails: dipeng1982@gmail.com, {peter.nie, hujiahui8, yiyanzhi, yangsijia2}@huawei.com, dylangeng@163.com, {zhangrenwei1, bojie.li, ganzhiliang, jinxuefeng}@huawei.com.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

0734-2071/2024/01-ART5 \$15.00

https://doi.org/10.1145/3635305

5:2 J. Zhao et al.

CCS Concepts: • Software and its engineering → Translator writing systems and compiler generators; • Computer systems organization → Heterogeneous (hybrid) systems; Neural networks;

Additional Key Words and Phrases: Tiling, fusion, data locality, parallelism, redundant computation, memory hierarchy, polyhedral model, optimizing compilers

#### **ACM Reference format:**

Jie Zhao, Jinchen Xu, Peng Di, Wang Nie, Jiahui Hu, Yanzhi Yi, Sijia Yang, Zhen Geng, Renwei Zhang, Bojie Li, Zhiliang Gan, and Xuefeng Jin. 2024. Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations. *ACM Trans. Comput. Syst.* 41, 1-4, Article 5 (January 2024), 45 pages. https://doi.org/10.1145/3635305

#### 1 INTRODUCTION

Optimizing compilers are evolving into essential software of computer systems. They deploy a program of loop nests to underlying hardware with increasingly complex memory hierarchy by composing various loop transformations, among which loop tiling and fusion have a great impact on others [51]. Loop tiling [33] is a transformation that groups iterations of loop nests into smaller blocks, maximizing data reuse along multiple loop dimensions when the block fits in registers or caches; loop fusion [37] is a technique that intertwines two or more loop nests without violating their producer-consumer relations, allowing more values to be allocated in faster memories and thereby enabling storage reduction.

Image processing and deep learning are two representative application domains where a program is written in the form of a loop nest pipeline. Thanks to their well-defined **Domain-Specific Languages (DSLs)** that allow users to first tile an output loop nest (using the *tile* primitive) and next fuse some input loop nests at a particular tiled loop level (using the *compute\_at* primitive) for these application domains, Halide [54] and TVM [14] can compose loop tiling and fusion that make full use of the possibly overlapped data reuse featured by the complex computation patterns like convolution and heterogeneous stencil pipelines in such programs, thereby achieving promising performance on various hardware platforms with different memory hierarchies. However, as these domain-specific frameworks are tailored to their target programs and only consider a limited set of computation patterns, their general applicability is also restricted. In the meantime, the implementations of these domain-specific frameworks still require the users to check the validity of the applied transformations.

Unlike domain-specific frameworks, optimizing compilers [12, 13, 24, 62] based on the polyhedral model [21]—a mathematical abstraction widely acknowledged for its ability to compose affine loop transformations—automatically implement loop tiling and fusion for more general programs of loop nests by expressing such transformations as affine functions. Such polyhedral compilation tools first perform loop fusion and next carry out loop tiling, defining a particular order on these two transformations. This specific ordering reconciles parallelism/tilability and locality by switching between a conservative fusion strategy (maximizing parallelism/tiling possibilities by transferring data through lower-level caches or off-chip communications more frequently) and an aggressive fusion strategy (mitigating data movements between hierarchical memories at the expense of losing tilability/parallelism). Expressing loop fusion and tiling as affine functions, however, cannot model the emerging tradeoff between parallelism/tilability, locality, and recomputation required by the computation patterns of convolution in deep learning and heterogeneous stencil pipelines in image processing.

Fortunately, recent efforts from the polyhedral community have already started to address this issue. For instance, the polyhedral compiler of Halide—PolyMage [44]—can automatically generate

overlapped tile shapes for heterogeneous stencils in image processing pipelines. The affine fusion pass of the MLIR infrastructure [40] also automates loop fusion that may introduce recomputation by implementing a backward slicing strategy [53], which our earlier polyhedral approach [65] also leverages to perform overlapped tiling [39] for deep learning applications. Unlike prior polyhedral work [13, 24, 62] that abstracts the schedules of a program as affine functions, these methods express the schedules of image processing or deep learning applications using affine relations [36], making it possible to produce complex program transformations like overlapped tiling. As defined in the work of Karr [36], an affine relation refers to a relation that maps one or multiple program variables to an image using their linear combinations plus a constant, and a relation is more general than a function in that it allows an element in its domain to be mapped to multiple images in its range.

Considering that the aforementioned progress still targets domain-specific applications, in this article we investigate the interplay between loop tiling and fusion, and address the full generality of this issue in optimizing compilers. In particular, we express loop tiling and fusion as affine relations and propose a so-called *basteln strategy* to better model their interplay in optimizing compilers. The word *basteln* is an abbreviation for *backward slicing* of *tiled loop nests*, meaning to perform handicrafts in German. The basteln strategy in polyhedral compilation can thus be inspected as implementing a particular skill that was originally performed by hand using a polyhedral approach. Specifically, the basteln strategy is implemented by Algorithm 4 (the compose engine for loop tiling and fusion), which takes as input the result of Algorithm 1 (the pre-tiling fusion algorithm) and recursively invokes Algorithm 2 (the tiling inference algorithm) and Algorithm 3 (the post-tiling fusion algorithm). Algorithm 1 implements a conservative fusion by detecting simple dependence patterns between producers and consumers, with each resulting fusion group associated with an iteration space.

Next, Algorithm 2 performs tiling to each output iteration space that produces live-out data and continues by computing the memory footprints required by each tile of the output iteration spaces. The upward exposed data (i.e., those data used within tiled output iteration spaces but defined by others) are extracted from the memory footprints. The tile shapes of those iteration spaces that produce intermediate values are then determined by combining such upward exposed data with the access relations, allowing for the construction of tile shapes imposed by the dependences. The output of Algorithm 2 is delivered to Algorithm 3, which executes post-tiling fusion without changing the parallelism/tilability of output iteration spaces. The combination of Algorithm 2 and Algorithm 3 is managed by Algorithm 4, with the cases of multiple output iteration spaces appropriately handled. Finally, the basteln strategy is complemented by Algorithm 5 to perform operator inlining when targeting GPU.

A partial version of this work was presented at MICRO-53 [65], with the preliminary algorithms implemented in PPCG [62] (a general-purpose optimizer) and AKG, a domain-specific compiler for deep learning applications developed by the authors of this work and presented at PLDI 2021 [68]. The implementation in PPCG was used to generate OpenMP/CUDA code for CPU/GPU, whereas AKG targets code generation for Huawei Ascend 910 [42], a dedicated accelerator for deep learning. AKG allows its users to rewrite a sub-graph (also known as a fused operator) produced by the graph engine [66] using a tensor expression language. The graph engine in turn can take as input a model expressed using TensorFlow [1], PyTorch [49], or MindSpore [31], and represent a neural network using a computational graph, which is then partitioned into individual sub-graphs. AKG executes automatic kernel generation for each of these sub-graphs.

While AKG [68] only supports code generation for a dedicated deep learning accelerator, the preliminary version [65] of the basteln strategy also did not compare the performance of its generated CUDA code with cuBLAS [46]/cuDNN [17] or the recent domain-specific compilation framework,

5:4 J. Zhao et al.

TVM [14]. Considering the great success of deep learning and the significant computing power of GPU, we devoted a great amount of time on developing a CUDA backend for AKG in the past 2 years, which exposes some issues of the initial idea and motivates us to extend the earlier publications [65, 68] from many angles, including simplifying the algorithmic flow, generalizing the approach for more scenarios, software engineering on the GPU implementation, and conducting more extensive experiments. Specifically, this work makes the following contributions beyond those presented in the earlier publications [65, 68]:

- We design Algorithm 1 to merge loop nests with simple dependence patterns before tiling output iteration spaces, rendering the approach independent of other fusion heuristics. The initial idea [65] leverages the fusion heuristic of *isl* [61] as the starting fusion strategy, based on which the tiling algorithm is developed. We introduce this pre-tiling algorithm in Section 4.1.
- We upgrade Algorithm 2 for intermediate iteration spaces composed of multiple statements, simplifying its algorithmic flow. Instead of constructing affine relations for each statement, we group multiple statements of an intermediate iteration space to upgrade Algorithm 2, as will be explained in Section 4.5.
- We refine Algorithm 3 for fusion with scattered tile shapes, ensuring the safety of post-tiling fusion in more general scenarios. It was originally assumed the output of Algorithm 2 can cover the full iteration space of a fused group, which is true for the simple computation patterns considered before. We generalize this post-tiling fusion by also considering scattered tile shapes that may be generated in more general cases, as will be described in Section 5.2.
- We optimize Algorithm 4 that was strictly constrained, reinforcing the approach by allowing for more optimization opportunities. In addition to checking the emptiness of the intersection between multiple output iteration spaces, Algorithm 4 is enhanced in Section 5.3 with another condition not introducing redundancy taken into consideration. In particular, the optimized algorithm can perfectly deal with the self-attention mechanism [60] adopted by deep neural networks for natural language processing.
- We offer solutions to the scenarios where a single fusion group is expected for, widening the general applicability of the approach. Some cases like a sub-graph of deep learning applications expect for a single GPU kernel, but the partial version [65] loses its effectiveness in some cases. We introduce an optimization strategy to deal with such cases in Section 5.4.
- We develop a CUDA backend for AKG [68], making it possible to automatically generate code for deep learning models on GPU. Following the basteln strategy to model loop tiling and fusion, the CUDA code generation strategy is presented in Section 6.2.
- We propose Algorithm 5 as an operator inlining algorithm, reducing the number of intermediate values that have to be allocated on faster memories after fusion. AKG [68] did not consider operator inlining since the programming model of its target platform requires the generated code to be written in the three-address form. However, such a restriction does not exist when targeting GPU, which calls for the implementation of an operator inlining algorithm. We describe Algorithm 5 in Section 6.3.
- We reinforce the evaluation and discussion, enriching the experimental results and the comparison with related work. An in-depth comparison of deep learning operators and models on GPU between our work, vendor libraries, and TVM is explained in Section 7.2, and the evaluation of more deep neural networks on Huawei Ascend 910 is conducted in Section 7.3.

In addition, the basteln strategy makes progress over the state of the art as follows. Unlike existing frameworks [14, 54] that only target image processing and deep learning applications, our approach also considers programs extracted from linear algebra and sparse matrix computation,

showing a more general applicability. Rather than requiring users to specify loop tiling and fusion by hand, our method automatically guarantees the validity of each transformation, going beyond existing polyhedral tools [13, 24, 62] by implementing a novel composition of loop tiling and fusion. Algorithm 2 can construct tighter tile shapes and is more generally applicable than PolyMage [44], a polyhedral implementation of the Halide scheduler. In particular, the widely used tiling-afterfusion order in the past [12, 13, 24, 62] can be considered a special case of Algorithm 4. Finally, the algorithms presented in this work also moderate compilation time without restricting to special cases [58] or relaxing scheduling constraints [2].

We conduct experiments on six real-life deep learning models used to address complex problems of image classification, neural language processing, and the recommendation system. The experimental results show that our approach achieves a mean speedup of 1.8× over the highly tuned cuBLAS/cuDNN libraries and 1.1× over TVM on GPU. When used to generate code for the dedicated accelerator, the basteln strategy outperforms a classical polyhedral fusion heuristic and TVM by 11% and 20%, respectively. We also evaluate the approach using 10 benchmarks covering application domains including image processing, sparse matrix computation, and linear algebra, demonstrating the effectiveness and general applicability of our approach CPU and GPU. Finally, we compare the compilation overhead of our approach with existing polyhedral fusion heuristics, validating that the approach can also sometimes obtain compile-time improvements.

The article is organized as follows. Section 2 introduces the background of this work and its motivation. Section 3 overviews the basteln strategy. Section 4 explains the technique for constructing tile shapes, and Section 5 describes the post-tiling fusion algorithm. Section 6 presents the code generation strategies for different architectures, followed by the experimental results reported in Section 7. Section 8 discusses related work, and Section 9 concludes the article.

## 2 BACKGROUND AND MOTIVATION

This section first introduces the polyhedral model and the interplay between loop tiling and fusion. Next, it presents the polyhedral representation that will be used in this work. Finally, the motivation of the basteln strategy will be discussed.

## 2.1 Tiling and Fusion in the Polyhedral Model

The polyhedral model is a mathematical abstraction for automatic parallelization and locality optimization. It represents a program using iteration domains, access relations, dependences, and schedules. It uses schedules to represent both the original lexicographical order of a program and one generated by a scheduling algorithm. A schedule is an affine function over all statement instances (i.e., iteration domains). A scheduling algorithm has to respect the dependences relating statement instances that depend on each other, which are in turn computed on the basis of access relations. An access relation is an affine map between statement instances and memory locations.

With the power of the polyhedral model, one can easily apply many combinations of affine transformations to exploit different optimizations, among which we mainly focus on the compositions of tiling and fusion in this article. Consider the example shown in Figure 1(a) composed of three deeply nested loops, each of which represents a primitive operator of deep neural networks. The first loop nest can be viewed as a quantization, and the second is a 2D convolution over an input image A using kernel B, with C representing the output image. The convolution is usually followed by an activation function, which is represented using the last loop nest in Figure 1(a). We suppose that the activation function is ReLU, and one can replace it using any other activation functions. Using the polyhedral model, the initial schedule can be expressed using a multi-dimensional affine

5:6 J. Zhao et al.



(b) The code and GPU mapping of a conservative fusion heuristic

Fig. 1. A 2D convolution with its fusion result.

schedule as

$$init\_schedule := \{S_1(h, w) \to (0, h, w); S_2(h, w) \to (1, h, w, 0); \\ S_3(h, w, kh, kw) \to (1, h, w, 1, kh, kw); S_4(h, w) \to (2, h, w)\},$$
 (1)

with  $S_1$  to  $S_4$  shown in Figure 1(a).

The scheduling algorithms of the polyhedral model can compute a new schedule amenable to loop tiling, which can be combined with different fusion heuristics. With a conservative fusion heuristic, the new schedule can be expressed as

$$new\_schedule := \{S_1(h, w) \to (0, h, w); S_2(h, w) \to (1, h, w, 0, 0, 0); S_3(h, w, kh, kw) \to (1, h, w, kh, kw, 1); S_4(h, w) \to (1, h, w, KH - 1, KW - 1, 2)\},$$
(2)

and we use  $(\{S_1\}, \{S_2, S_3, S_4\})$  to represent the fusion result. One can now apply rectangular tiling using tile sizes  $T_0 \times T_1$  to the first group  $\{S_1\}$  and  $T_2 \times T_3$  to the second group  $\{S_2, S_3, S_4\}$ , with the tiling schedule expressed as

$$new\_schedule := \{S_1(h, w) \to (0, h/T_0, w/T_1); S_2(h, w) \to (1, h/T_2, w/T_3, 0, 0, 0); S_3(h, w, kh, kw) \to (1, h/T_2, w/T_3, kh, kw, 1); S_4(h, w) \to (1, h/T_2, w/T_3, KH - 1, KW - 1, 2)\},$$
(3)

and the tiled code shown on the left of Figure 1(b). We use ht, wt to represent the tile loops (iterating among tiles) and hp, wp the point loops (iterating within a tile) after loop tiling. In practice, tile sizes should be replaced using fixed integer values; otherwise, affine relations (3) would be non-affine. We use symbolic tile sizes here to illustrate that our work can handle any integer tile sizes.

The cost model of such a conservative heuristic is to maximize fusion without sacrificing the parallelism of the fused loops. When targeting CPUs, the compiler can add OpenMP pragmas before each group as shown in Figure 1(b). While the tiled OpenMP code benefits from the maximal parallelism preserved by this fusion heuristic, tensor A cannot be allocated in small scratchpads but has to be stored as full buffers. When targeting GPUs, the polyhedral model generates CUDA code by mapping the parallel loops to the two-level hardware parallelism on GPUs. The GPU mapping strategy enabled by the model is shown on the right of Figure 1(b), with the tile loops ht, wt mapped to thread blocks (red arrows) and the point loops hp, wp to threads (blue arrows). However, tensor A cannot be allocated in the shared memory.

On the contrary, an aggressive heuristic like the *maxfuse* strategy of Pluto [13] maximizes data locality by fusing all statements through the combination of loop interchange, shifting, and skewing, with the generated code shown in Figure 2. This policy reduces the number of tilable dimensions and loses the outer parallelism; the interchange transformation also results in another problem: the dimensions of the fused loop nests mismatch with the tile sizes specified by users due to the permutation of dimensions, especially in the case of domain-specific frameworks for deep neural networks.

We therefore argue that the tiling-after-fusion strategy in many existing polyhedral optimizers is a suboptimal solution that cannot fully exploit the memory hierarchy, and intend to take another way to avoid the tradeoff between tilability/parallelism and locality by reordering the sequence of tiling and fusion. To put such a reordering into practice, we choose to leverage schedule trees [25].

```
/* (1) c0,c1 are not parallelizable, nor tilable.
   (2) c0,c1 cannot be mapped to GPU grid.(3) if conditionals expand the code size.
      and make tiling difficult. */
for(c0=0;c0<H;c0++)
  for(c1=0:c1<W:c1++){
   if(c0<=H-KW && c1<=W-KW)
      S_2(c0,c1);
    if(c1>KW){
      if(c0<KH-1)
        S_1(c0,c1);
      for(c2=max(KH-1,c0);c2<min(H,c0+KH);c2++){
        for(c3=0;c3<KW;c3++){
          S_3(c2-KH+1,c1-KW+1,c0-c2+KH-1,c3);
            if(c2==c0 && c3==KW-2)
             S_1(c0,c1);
        if(c0<H-1 && c1<W-1 && c2==c0)
          S_4 (c0-KH+1,c1-KW+1);
     }
      S_1(c0,c1);
```

Fig. 2. The code produced by an aggressive fusion heuristic.

#### 2.2 Schedule Trees

The polyhedral model uses multi-dimensional affine schedules for representing the lexicographical order of a program, but this representation cannot be easily extended to automate memory managements on GPUs (e.g., the automatic insertion of thread-level synchronizations for CUDA code). Affine schedules can be explicitly encoded using a tree structure [25], which can simplify the modeling of automatic memory managements in polyhedral compilers.

Building an initial schedule tree for a program on the basis of a multi-dimensional affine schedule is straightforward. A schedule tree starts with a so-called *domain* node containing all statement instances (i.e., iteration domains). A *sequence* node is introduced to explicitly express the scalar dimensions used in multi-dimensional affine schedules, defining a particular order on its children. Each child of a sequence node has to be a *filter* node, which collects a subset of statement instances introduced by its outer domain or filter node. A *band* node is used to encode the variable and/or constant dimensions of multi-dimensional affine schedules, in the form of a piecewise multi-dimensional affine function over the iteration domain. A band node can retain the permutability and/or parallelism properties.

Code generation in the polyhedral model is also referred to as *polyhedral scanning*. The polyhedral code generators like those found in other works [8, 25] take as input the iteration domain and the new schedule produced by an affine scheduling algorithm for generating imperative code. The schedule tree representation uses band nodes and sequence nodes for encoding the classical multi-dimensional affine schedules, but also introduces a domain node to represent the iteration domain. Encoding an iteration domain and schedules together makes it possible to generate code by only scanning schedule trees. In addition, one can introduce custom nodes in a schedule tree to facilitate non-polyhedral semantics, which simplifies the code generation for different architectures.

Still consider Figure 1(a) as an example. It is easy to obtain its initial schedule tree representation, depicted in Figure 3(a) from the multi-dimensional affine schedule representation shown in affine

5:8 J. Zhao et al.



Fig. 3. Schedule trees of the code in Figure 1(a).

relations (1). The domain node can be expressed as

$$domain := \{S_{1}(h, w) : 0 \leq h < H \land 0 \leq w < W; S_{2}(h, w) : 0 \leq h \leq H - KH \land 0 \leq w \leq W - KW;$$

$$S_{3}(h, w, kh, kw) : 0 \leq h \leq H - KH \land 0 \leq w \leq W - KW \land 0 \leq kh < KH \land 0 \leq kw < KW;$$

$$S_{4}(h, w) : 0 \leq h \leq H - KH \land 0 \leq w \leq W - KW\}.$$

$$(4)$$

An initial schedule tree can be automatically transformed into a new one, as shown in Figure 3(b). The information about the parallelism and tilability of the loop nest is attached in band nodes, which we do not show here for the sake of simplicity.

A band node is used to represent a loop nest, and the attached information is used to guide the compiler to apply loop transformations and determine the parallelism of each loop dimension. The content of the band node below  $\{S_1\}$  is

$$band_0 := \{S_1(h, w) \to (h, w)\},$$
 (5)

which is the same as that of Figure 3(a) but attached with two attributes, a Boolean value *permutable* and a vector *coincident*, where *permutable* is used to indicate whether the loop nest is tilable, and each component of *coincident* is used for expressing the parallelism of a single loop. A component of the *coincident* vector is also represented using a Boolean value: it indicates that the corresponding loop dimension is parallelizable using value 1; the corresponding loop dimension cannot be parallelized when it is 0. In Figure 3(b), *permutable* is 1 and *coincident* should be [1, 1] for affine relations (5).

The band node below  $\{S_2, S_3, S_4\}$  can be written as

$$band_1 := \{S_2(h, w) \to (h, w); S_3(h, w, kh, kw) \to (h, w); S_4(h, w) \to (h, w)\},$$
(6)

with the attributes represented as 1 and [1, 1]. It means that the two loop nests in Figure 3(b) are both tilable and possess 2D parallelism.

We only present the node types that will be used in this work but invite the readers to refer to the work of Grosser et al. [25] for a detailed description of all node types of a schedule tree. In particular, an extension node is designed to add additional statements not covered by the domain node of a schedule tree using an affine relation. It can be used to automate memory managements by retaining itself to a suitable position in schedule trees. We will leverage extension nodes to implement a post-tiling fusion algorithm.

#### 2.3 Motivation

Our initial intention is to address the dilemma of the polyhedral model, but how a reordering should be implemented using schedule trees is not clear. The recent domain-specific frameworks,

Halide [54] and TVM [14], adopt an ad-hoc composition of loop tiling and fusion that lies between the conservative and aggressive fusion heuristics, and have demonstrated its effectiveness for image processing pipelines and deep learning applications. Unfortunately, these domain-specific frameworks also face challenges as introduced in Section 1.

As such, in this article, we generalize the approach of these domain-specific frameworks in the polyhedral model by targeting a wider class of programs. An input program to our approach should satisfy the "static affine control" requirement [13, 61] of the polyhedral model and be at least composed of two loop nests; otherwise, talking about fusion does not make sense. We also expect that some of the loop nests are tilable such that the tradeoff between loop fusion and tiling can be exposed. This class of programs is a super set of function/operator pipelines of image processing/deep learning, which are considered by existing optimizing frameworks like Halide [54], TVM [14], and PolyMage [44]. In particular, dynamic counted loops can also be handled by our approach when the non-affine code generation strategy [67] is integrated into the polyhedral model, as will be demonstrated in the experiments.

Besides, our approach can take as input programs written in a general-purpose language or a DSL. Some DSLs of existing optimizing compilers (e.g., TVM) fall short in expressing an in-place update fashion, thus calling for two buffers to implement time-iterated stencils, but this is not an issue for programs written in a general-purpose language. With these more general scenarios in mind, we model the interplay of loop tiling and fusion in polyhedral compilation and further strengthen the power of this mathematical abstraction.

#### 3 OVERVIEW OF THE BASTELN STRATEGY

To implement the basteln strategy as a general approach in the polyhedral model, we do not need to build everything from scratch but can leverage existing off-the-shelf libraries to extract sufficient information from a program of loop nests, on top of which our algorithms can be implemented to better model the interplay between loop tiling and fusion. Among such polyhedral libraries, we choose *isl* [61] that can automatically express a program of loop nests using affine sets and relations as the basis of our approach. By expressing a loop program as a schedule tree [25], *isl* also offers scheduling algorithms that can automatically transform a program into a form that loop tiling can be applied, with the validity and tile shapes (parallelogram or rectangular) automatically determined. Users do not need to provide domain specific knowledge on how affine sets and relations should be extracted.

Our approach resorts to *isl* to automatically extract affine sets and relations of a given program. In addition, we also use its scheduling algorithms to determine the tilability and parallelism of each loop nest. Before proceeding into the details of our approach, we depict the overview of the basteln strategy in Figure 4. Denoted by the gray region, the basteln strategy resorts to Algorithm 4 (i.e., the compose engine of loop tiling and fusion) as its core algorithm to systematically implement a complex composition of loop tiling and fusion. Algorithm 4 takes as input a fusion plan produced by Algorithm 1, which performs pre-tiling fusion on the optimized schedule tree representation of an input program. The schedule tree is first optimized by a scheduling algorithm of the polyhedral model without fusion and tiling. We only use such scheduling algorithms to determine the parallelism/tilability attributes of each band node.

For each output iteration space, Algorithm 4 invokes Algorithm 2 to construct tiling schedules and infer the tile shapes of all intermediate iteration spaces. The output of Algorithm 2 is processed by Algorithm 3 to execute post-tiling fusion, with some corner cases handled by Algorithm 4 for evading redundancy. The output of Algorithm 4 is delivered to the pretty printer of a compiler to generate final code. In particular, Algorithm 5 is introduced before delivering the output of Algorithm 4 to the pretty printer if the target is GPU, which performs operator inlining.

5:10 J. Zhao et al.



Fig. 4. Overview of the approach.

#### 4 CONSTRUCTING TILE SHAPES

Constructing complex tile shapes using the traditional tiling-after-fusion manner is non-trivial, which encourages us to reorder these two loop transformations. Some loop nests with simple dependence patterns, however, do not require complex tiling techniques; they can be well modeled by performing loop fusion before tiling. A conservative pre-tiling fusion process can be implemented by the default fusion heuristic of *isl* [61] or other techniques [12], which we can leverage to achieve our purpose. Unfortunately, using *isl*'s fusion heuristic makes the basteln strategy not stand alone, although *isl* is an integer set library extensively employed by plenty of software. We realize that the conservative fusion heuristic can be easily modeled before constructing tile shapes, and thus design a pre-tiling fusion algorithm.

## 4.1 Pre-Tiling Fusion

Once the scheduling algorithms [63] of isl have determined lexicographical execution dates for statement instances, its conservative fusion heuristic tries to group as many loop nests as possible without losing the parallelism along a pre-defined number of loop dimensions and their tilability. Recall that the tilability of a loop nest is denoted using a Boolean variable permutable of its band node in schedule trees, as described in Section 2.2. The tilability of multiple loop nests still holds if permutable of the fusion result's band node is 1. The number of parallelizable loop dimensions d can be defined in accordance with the hardware parallelism of a target platform. We let d be equal to 1 when our code generators target CPU or Huawei Ascend 910 [42], the deep learning accelerator discussed in this article, or instantiate it using 2 when targeting for GPU. This is because 1D parallelism is sufficient for OpenMP code on CPU or parallel programs executable on Ascend 910, whereas 2D parallelism is used to guarantee parallel execution of CUDA code across both GPU threads and thread blocks.

The polyhedral model assumes that a loop is parallelizable when there exists no loop-carried or inter-iteration dependences along this dimension. In many programs, especially those image processing pipelines and deep learning applications, the dependences between some image functions or network operators are rather simpler, which makes it possible to decide whether some loop nests can be grouped without sacrificing anything before loop tiling.

Specifically, these rather simpler dependences establish *quasi-identity relations* between each pair of statement instances across different loop nests, which never requires overlapped accesses across the fused loop nests.

*Definition 4.1.* A quasi-identity relation  $\{R(v_1, \ldots, v_m) \to T(v'_1, \ldots, v'_n)\}$  is a function between two affine sets  $T(v_1, \ldots, v_m)$  and  $R(v'_1, \ldots, v'_n)$  such that

- (1)  $v_i' = v_i$  holds for each i that satisfies  $1 \le i \le \min(m, n)$ , and
- (2) each  $v_j$  (min $(m, n) < j \le m$ ) or  $v_j'$  (min $(m, n) < j \le n$ ) is a constant,

where v and v' are the index variables of the two affine sets.

For instance, the dependences between each pair of statement instances of  $\{S_3\}$  and  $\{S_4\}$  are a set of quasi-identity relations that can be expressed as

identity := 
$$\{S_3(h, w, kh = KH - 1, kw = KW - 1) \rightarrow S_4(h' = h, w' = w) : 0 \le h \le H - KH \land 0 \le w \le W - KW\}.$$
 (7)

On the contrary, the dependences between  $\{S_1\}$  and  $\{S_3\}$  are a set of trival affine relations, which can be written as

$$trival := \{S_1(h, w) \to S_3(h', w', kh = h - h', kw = w - w') : \\ 0 \le h' \land h - KH < h' \le h \land h' \le H - KH \land 0 \le w' \land w - KW < w' \le w \land w' \le W - KW\}.$$
 (8)

Quasi-identity relations like affine relations (7) indicate the absence of loop-carried dependences along the loop dimensions represented by the variables in the relations. We can safely merge the statement pair of a quasi-identity relation when they belong to different loop nests. Otherwise, we do not fuse the statement pair and proceed to Algorithm 2. This pre-tiling fusion can be easily implemented on top of a schedule tree similar to Figure 3(a). Algorithm 1 formally describes how pre-tiling fusion is performed, with its detailed explanation spelled out in Appendix A.1.

Algorithm 1 produces a fusion strategy ( $\{S_1\}$ ,  $\{S_2, S_3, S_4\}$ ) for the example in Figure 1(a), which is the same as that generated by the default fusion heuristic of *isl*. It also makes the basteln strategy independent of *isl*'s fusion heuristics. Indeed, Algorithm 1, as well as many fusion heuristics of existing polyhedral compilers, may miss some fusion possibilities even no overlapped accesses are required, but Algorithm 2 will capture these missed opportunities. Importantly, this pre-tiling fusion does not require manual efforts and guarantees validity due to the use of polyhedral scheduling algorithms, which can retain the *coincident* and *permutable* attributes to each band node.

# 4.2 Tiling Output Iteration Spaces

With the fusion configuration suggested by Algorithm 1, one can perform tiling to each iteration space, and this is what the tiling-after-fusion manner does. To better illustrate how this implementation results in an issue, we assume H = W = 6 and KH = KW = 3. A rectangular tiling can be applied using piecewise affine relations:

schedules := 
$$\{S_1(h, w) \to (h/T_0, w/T_1, h, w)\}, \{S_2(h, w) \to (h/T_2, w/T_3, h, w); S_3(h, w, kh, kw) \to (h/T_2, w/T_3, h, w, kh, kw); S_4(h, w) \to (h/T_2, w/T_3, h, w)\},$$

$$(9)$$

which tile the first group with tile sizes  $T_0 \times T_1$  and the second with  $T_2 \times T_3$ . Each piece of affine relations (9) represents a tiling schedule, expressed within a pair of braces. We call such piecewise affine relations *tiling schedules*.

We use the term *quantization space* to denote the first group and the term *reduction space* to denote the second. The tiled iteration spaces are shown at the bottom of Figure 5 when given tiles sizes  $T_0 = T_1 = 4$  and  $T_2 = T_3 = 2$ . The quantization space is separated into four blocks, with one full tile (in yellow) and three partial tiles. Full tiles are entirely covered by the iteration space; partial tiles are on the boundaries of an iteration space and interleave with the latter [38]. The reduction space is composed of four full tiles.

There exists dependences caused by tensor A between the two iteration spaces, which is written by  $S_1$  and read by  $S_3$ . We show the data space of tensor A on the top of Figure 5. Let us focus on the red tiles in both iteration spaces. The access relations between each red tile and tensor A are represented using dotted and dashed lines, respectively. The red tile of the



Fig. 5. Tiling iteration spaces individually.

5:12 J. Zhao et al.

# ALGORITHM 1: The pre-tiling fusion algorithm

```
Input: schedule tree—An initial schedule tree
 1 schedule tree ← Apply a polyhedral scheduling algorithm without fusion heuristics;
 n ← the number of the top sequence node's children;
 g \leftarrow 1;
 4 foreach i \in [1, n] do
        visited<sub>i</sub> \leftarrow false;
   foreach i \in [1, n] do
         if visited_i = true then
              continue;
 8
         group_q \leftarrow filter_i;
         num_q \leftarrow 1;
10
         visited_i \leftarrow true;
11
         foreach j \in [i+1, n] do
12
              if the dependences between group<sub>q</sub> and filter; are quasi-identity relations like affine relations (7)
13
                   group_q \leftarrow group_q \cup filter_i;
14
                   num_q \leftarrow num_q + 1;
15
                    visited_i \leftarrow \mathbf{true};
16
              else
17
                   q \leftarrow q + 1;
18
19
                   break;
    schedule tree \leftarrow Replace the subtree of the top sequence node using the q groups;
20
    foreach k \in [1, q] do
         if num_k = 1 then
22
              schedule tree \leftarrow Insert the original subtree of group<sub>k</sub> under filter<sub>k</sub>;
23
         else
24
              band_1 \leftarrow \text{Construct a band node like affine relations (6) under filter_k};
25
              band_1 \leftarrow \text{Insert a sequence node under } band_1;
26
              foreach stmt in filter, do
27
                   band_1 \leftarrow \text{Insert the original subtree of } stmt \text{ as a child of } band_1;
28
              schedule tree \leftarrow Insert band<sub>1</sub> under filter<sub>k</sub>;
29
    Output: schedule tree—A schedule tree after pre-tiling fusion
```

quantization space writes to four points of tensor *A*, whereas the red tile of the reduction space requires 16 points. The conflict between the memory footprints requires a gather-scatter communication of tenor *A* between two iteration spaces, which prevents the fusion of the two red tiles.

Such a conflict is due to the oversight of transformations on data spaces. The memory footprints of tensor A required by each tile of the reduction space are known when the reduction space is already tiled, and these memory footprints can be used to determine the tile shapes of the quantization space in conjunction with the access relation between  $S_1$  and tensor A. The tiles with the same color from different iteration spaces can now be fused. Constructing the tile shape of the quantization space this way can also reduce the magnitude of the tile size space, since users only need to specify the tile sizes for the reduction space and the search space of an auto-tuner can be compressed. Hence, we use

$$\{S_2(h, w) \to (h/T_2, w/T_3, h, w); S_3(h, w, kh, kw) \to (h/T_2, w/T_3, h, w, kh, kw); S_4(h, w) \to (h/T_2, w/T_3, h, w)\}$$
(10)

to apply rectangular/parallelogram tiling to the reduction space and also use it to infer the tile shapes of the quantization space.

# 4.3 Extracting Upward Exposed Data

The data accessed within the reduction space can be divided into read and write accesses. We use *upward exposed data* to refer to those data read by the reduction space but defined in the quantization space. One can easily construct the relation between the iteration tiles and the upward exposed data by assembling the dependence relations and the read access relation of the reduction space. The data space of tensor *A* is depicted in the middle of Figure 6, with each data tile represented using the same color as its corresponding iteration tile in the reduction space.

Let us go through the process in the example. We only discuss statement  $S_3$  that reads tensor A in the reduction space for the sake of simplicity. The tiling transformation applied to the reduction space can be expressed using affine relations (10). Meanwhile, one can extract the affine relations over statement  $S_3$  to tile dimensions from affine relations (10), which should be



Fig. 6. Constructing tile shapes via upward exposed data.

$$t := \{S_3(h, w, kh, kw) \to (o_0 = h/T_2, o_1 = w/T_3)\},\tag{11}$$

where  $o_0$  and  $o_1$  represent the tile dimensions. The access relations over  $S_3$  to the upward exposed data are written as

$$a := \{S_3(h, w, kh, kw) \to A(h+kh, w+kw) : 0 \le h \le H - KH \land 0 \le w \le W - KW \land 0 \le kh < KH \land 0 \le kw < KW\}.$$

$$(12)$$

The relation between the tile dimensions  $(o_0, o_1)$  and the upward exposed data tensor A can be constructed by composing the reverse of affine relations (11) with affine relations (12), producing

$$f := \{ (o_0, o_1) \to A(h', w') : 0 \le o_0 < \lceil (H - KH + 1)/T_2 \rceil \land 0 \le o_1 < \lceil (W - KW + 1)/T_3 \rceil \land T_2 \cdot o_0 \le h' < T_2 \cdot o_0 + KH + T_2 - 1 \land T_3 \cdot o_1 \le w' < T_3 \cdot o_1 + KW + T_3 - 1 \},$$

$$(13)$$

which represents the dashed lines between the reduction space and the data space in Figure 6. It allows the overlapped memory footprints between two consecutive iteration tiles.

We continue by focusing on the blue and red tiles of the reduction space. Without loss of generality, we assume  $T_2 = T_3 = 2$ , and the blue tile can be represented using a coordinate  $(o_0 = 1, o_1 = 0)$  in the space of tile dimensions while the red tile can be represented using  $(o_0 = 1, o_1 = 1)$ . One can apply affine relations (13) to these tiles and obtain their memory footprints, which can be written as  $\{A(h', w') : 2 \le h' \le 5 \land 0 \le w' \le 3\}$  and  $\{A(h', w') : 2 \le h' \le 5 \land 2 \le w' \le 5\}$ , respectively. In other words, their intersection is accessed by both tiles, which represents the interleaved region between the blue and red tiles in the data space.

#### 4.4 Tiling Intermediate Iteration Spaces

The memory footprints obtained by affine relations (13) can be used to construct the tile shape of the quantization space, which writes intermediate values to tensor A. With the polyhedral model, one can determine the tile shape of the quantization space using elementary operations on affine relations. Let us still use the example to illustrate the construction of the tile shape for the quantization space.

5:14 J. Zhao et al.

The polyhedral model can provide the write access relations of the quantization space over tensor A. Affine functions mapping tensor A to the statement  $S_1$  can be computed by reversing the write access relations, yielding

$$r := \{ A(h, w) \to S_1(h, w) : 0 \le h < H \land 0 \le w < W \}. \tag{14}$$

affine relations (14) is represented using the dotted lines between the data space and quantization space in Figure 6. Composing affine relations (13) with affine relations (14) generates another set of affine relations:

$$h := \{ (o_0, o_1) \to S_1(h, w) : 0 \le o_0 < \lceil (H - KH + 1)/T_2 \rceil \land 0 \le o_1 < \lceil (W - KW + 1)/T_3 \rceil$$

$$\land T_2 \cdot o_0 \le h < T_2 \cdot o_0 + KH + T_2 - 1 \land T_3 \cdot o_1 \le w < T_3 \cdot o_1 + KW + T_3 - 1 \},$$

$$(15)$$

each of which represents a map relating the tile dimensions  $(o_0, o_1)$  of the reduction space to a set of  $S_1$ 's instances. The tile dimensions of the reduction space divides the statement instances of  $S_1$  into multiple subsets/tiles, which implements the tiling of the quantization space without using the first piece of affine relations (9).

In more general cases, such as when such an intermediate iteration space is composed multiple statements, one can further compute the dependences between these statements and compute affine relations like affine relations (15) gradually. We believe, however, a safer handling strategy should consider an intermediate iteration space as a whole. Without loss of generality, consider Figure 7 as an example, where the quantization space is composed of two statements, one of which has been attached with affine relations like affine relations (15). We further construct a set of grouping relations

```
for(h=0;h<H;h++)
  for(w=0;w<W;w++){
    tmp=Quant(A[h][w]);    /* S<sub>0</sub> */
    A[h][w]=tmp;    /* S<sub>1</sub> */
}
```

Fig. 7. A loop nest composed of multiple statements connected by a scalar.

$$q := \{ G(h, w) \to S_x(h, w) : 0 \le x \le 1 \}$$
 (16)

that considers all statements of an intermediate iteration space as whole, and composing affine relations (15) with the reverse of affine relations (16) produces

$$e := \{ (o_0, o_1) \to G(h, w) : 0 \le o_0 < \lceil (H - KH + 1)/T_2 \rceil \land 0 \le o_1 < \lceil (W - KW + 1)/T_3 \rceil \land T_2 \cdot o_0 \le h < T_2 \cdot o_0 + KH + T_2 - 1 \land T_3 \cdot o_1 \le w < T_3 \cdot o_1 + KW + T_3 - 1 \}.$$

$$(17)$$

We use *extension schedules* to represent affine relations like affine relations (17), since they will be used by an extension node in Section 5. Constructing grouping relations like affine relations (16) is especially critical for cases like that shown in Figure 7, since these statements still stay in the same loop nest after post-tiling fusion. Note that statements grouped by relations like affine relations (16) do not have to have the same dimensions but only need to have some common loop dimensions. This allows for the grouping of statements with different dimensions (e.g.,  $S_2$  and  $S_3$  in Figure 1(a)) to be considered as a whole when propagating the tiling dimensions from output to intermediate iteration spaces.

Composing affine relations (13) with affine relations (14) can be interpreted as the conjunction of the dashed lines and dotted lines in Figure 6. All of the statement instances within the blue tile of the quantization space can thus be represented as  $\{S_1(h,w): 2 \le h \le 5 \land 0 \le w \le 3\}$ , and those of the red tile should be  $\{S_1(h,w): 2 \le h \le 5 \land 2 \le w \le 5\}$ . This means that the tile shape of the quantization space computed using affine relations (17) can overlap with each other. Performing loop tiling using an extension schedule is not possible in existing polyhedral compilation frameworks [6, 13, 24, 59, 62], but an extension schedule must be used together with the post-tiling fusion algorithm in Section 5.

# 4.5 The Tiling Algorithm

We assume that the result of Algorithm 1 contains only one output iteration space. The handling of multiple output iteration spaces will be discussed in Section 5.3 and Section 5.4. Algorithm 2 formally describes the construction of the tile shapes imposed by the dependences. As Algorithm 2 lays the foundation of the basteln strategy, we put its detailed explanation here.

## ALGORITHM 2: Construct tile shapes using upward exposed data

```
Input: spaces—Affine sets for iteration spaces
 1 output \leftarrow The output iteration space of spaces;
 2 spaces ← spaces \ output;
 3 untiled ← \emptyset;
 4 if output is tilable then
        mixed schedules ← A schedule performing rectangular/parallelogram tiling like affine relations (9)
        m \leftarrow Number of parallelizable loops of mixed schedules;
        t^{-1} \leftarrow A function associating a tile coordinate to its iteration space (i.e., the reverse of a function t
        like affine relations (11));
        a \leftarrow An access relation like affine relations (12) that maps output to its upward exposed data;
        f \leftarrow \text{Compute the relation between a tile coordinate and its memory footprint by composing } t^{-1}
        and a, like affine relations (13);
        foreach iteration space s in spaces do
10
             if s is not the topologically last one of spaces then
                 continue;
12
             n \leftarrow Number of parallelizable loops of s;
13
             if m > n then
14
                  untiled \leftarrow untiled \cup s;
15
                 continue;
16
             stmt \leftarrow The statement in s that defines data consumed by statements of output;
17
             r \leftarrow The reverse of the write accesses of stmt, like affine relations (14);
             h \leftarrow Compute the relation between a tile coordinate and its enclosed statement instances by
             composing f and r, like affine relations (15);
             g \leftarrow Construct relations grouping the merged dimension of every statement of s like affine
             relations (16);
             e \leftarrow Associate a tile coordinate to the group of intermediate iteration space by composing h
             and g^{-1}, like affine relations (17);
            f \leftarrow f \cup compute upward exposed data consumed by all statements in g;
             mixed\ schedules \leftarrow mixed\ schedules \cup \ e;
             spaces \leftarrow spaces \setminus s;
             output \leftarrow output \cup s;
25
        if untiled \neq \emptyset then
             mixed\ schedules \leftarrow mixed\ schedules \cup Apply\ Algorithm\ 2\ to\ untiled;
   else
28
        mixed schedules \leftarrow schedule of output \cup Apply Algorithm 2 to spaces;
   Output: mixed schedules
```

4.5.1 Finding a Tilable Output Iteration Space Backward. Algorithm 2 takes a group of affine sets spaces (i.e., the domain node of the schedule tree generated by Algorithm 1) and obtains the output iteration space output (line 1), which is subtracted from spaces (line 2) and the latter

5:16 J. Zhao et al.

becomes the set of intermediate iteration spaces. The *output* is unique as assumed at the beginning of Section 4.5. When *output* is tilable, Algorithm 2 first constructs a tiling schedule of *output* for simple rectangular/parallelogram tiling (line 5), which is used to initialize the output of this algorithm, *mixed schedules*, a union of tiling schedules and extension schedules. We use the *isl* scheduler [63] to construct rectangular/parallelogram tile shapes for *output*. The relation from upward exposed data to the tile dimensions of *output* is computed between lines 6 and 9, with an example discussed in Section 4.3. This process is similar to the idea of an earlier work on dataflow analysis [20] but different from the latter by computing in a reverse manner and performing on tiles rather than iteration spaces. Algorithm 2 is applied to *spaces* when *output* is not tilable (line 29).

4.5.2 Constructing Complex Tile Shapes. The construction of tile shapes for each intermediate iteration space s is implemented between lines 11 and 25. The conditional statement at line 11 is introduced because s may also define values that will be used by anther intermediate iteration space u. The u should be visited before s (line 12) because it also depends on s. This step visits the topological order of all intermediate iteration spaces in s paces in a reverse manner, thereby ensuring the correctness of the visited order.

For instance, suppose that there exists three intermediate iteration spaces  $I_1$ ,  $I_2$ , and  $I_3$ , and the output iteration space is O. The dependences between these iteration spaces are depicted in Figure 8, and their order in spaces is  $\{I_3, I_2, I_1\}$ . The s between lines 11 and 25 of Algorithm 2 is first instantiated by  $I_3$ , but its data are also consumed by  $I_2$ . According to the idea of the basteln strategy,  $I_2$  should be first evaluated, and we thus visit  $I_2$  before  $I_3$ , which implies an order  $\{I_2, I_3, I_1\}$ .



The comparison between m and n at line 14 is used to guarantee the correctness of Algorithm 2, which in turn restricts the number of fused loop dimension in Algorithm 3, as will be explained in Section 5.2. In particular, the condition m > n promises that an intermediate iteration space

Fig. 8. An example to illustrate the reordering of intermediate spaces.

with fewer parallel loops will not be fused with *output* that is of more parallel dimensions. Recall that we use two attributes, *permutable* and *coincident*, of a band node to represent its tilability and parallelism. A modern polyhedral scheduling algorithm [13] always prefers outer parallelism, which means that the n parallelizable loops always appear at the outermost n levels of a multi-dimensional loop nest after Algorithm 1.

In practice, the parallelizable dimensions of *output* may be greater than m. For example, *output* has a 3D parallelizable band. One can force m to be equal to 1 (i.e., only the outermost loop to be parallelizable) when targeting CPUs because OpenMP code only provides 1D parallelism. One can also let m be equal to 2 when generating CUDA code for GPUs, which allows for more aggressive fusion strategies without losing the two-level hardware parallelism. Comparing m and n preserves the parallelism of *output*, but it may lose the parallelism of a fused intermediate iteration space. In the worst case when n > m = 0, the parallelism of an intermediate computation is completely lost. In such cases, we only assume *output* is tilable if m is greater than 0 when targeting CPUs, or if m is greater than 1 when targeting GPUs.

4.5.3 Guaranteeing the Correctness and Effectiveness of the Basteln Strategy. The s should not be tiled using upward exposed data but be added to another set untiled that collects all affine sets like s (line 15) when m is greater than n, which means the output iteration space has more parallelizable dimensions than s. One may obtain an incorrect tiled version of s without this condition. Algorithm 2 is recursively applied to untiled when it is not empty (lines 26 and 27), which ensures that our basteln strategy will compute a tiling schedule/an extension schedule for each tilable

iteration space. Note that each s will either be considered to be fused with *output* (lines 17–25) or added to *untiled*.

Line 17 obtains the statement stmt that generates the dependences between output and intermediate iteration spaces. Lines 18 through 21 perform the elementary operations as mentioned in Section 4.4. The e is used to construct the possibly overlapped/scattered tile shape of an intermediate iteration space with multiple statements. The s should be tiled using e and added to mixed schedules. Line 22 combines the upward exposed data produced by all statements in the s with f, which ensures that those statements in s not directly depending on output will also be evaluated. Lines 17 through 22 guarantee that all statements in an intermediate iteration space are evaluated by Algorithm 2.

The union of s and output will be considered as output (line 25), which, together with the update of f (line 22), is computed for propagating tiling constraints. Considering multiple statements in one iteration space (lines 17–22) and the update of output greedily fuses the intermediate iteration spaces rather than only grouping the last one before output. This guarantees that each iteration space will be tiled and fused correctly.

Constructing tile shapes for intermediate iteration spaces greedily can proceed as further as possible. For a pre-defined group of image processing pipelines as considered by Halide [54]/ PolyMage [44] or a sub-graph produced by a graph compiler [66], we apply Algorithm 2 to all functions in a group or operators in a sub-graph, because these applications define a hard constraint on code generation. For instance, they expect for a single CUDA kernel for a pre-defined group/sub-graph.

However, we have to consider the negative effect of recomputation in more general scenarios, since an overlapped tile shape trades parallelism and locality using recomputation. We should also take into account the possibly lost parallelism of intermediate iteration spaces. We find that overlapped tile shapes are usually caused by convolution operations in deep learning, heterogeneous stencil operations in image processing, and matrix multiplications in general cases. Recomputation caused by overlapped tiling may be an issue when these operations appear consecutively. The amount of recomputation introduced by consecutive convolutions depends on the strides of the convolution. As an example, one can imagine the data space of Figure 6 as another intermediate iteration space of convolution. The size of an overlapped iteration space would increase according to the stride of this convolution. Consecutive heterogeneous stencils observe similar trends. It is nontrivial to conclude whether overlapped tiling on consecutive convolutions/heterogeneous stencils is beneficial. Fortunately, these domain-specific applications always come with well-defined autotuners [4, 15, 69], which can be used to address this issue.

Unfortunately, consecutive matrix multiplications may introduce too many recomputations due to its computation pattern. As an example, we consider the 2mm benchmark from PolyBench [52] to illustrate the effect this issue. Consider Figure 9 that computes D = D + A \* B \* C through two matrix multiplications. Our approach can perform rectangular tiling on the output iteration space of D = D + E \* C, which can be used to determine the tile shape of the intermediate iteration space E = E + A \* B. However, the amount of recomputations caused by this overlapped tiling is significant, and the parallelism of the loop iterating the column of E in the intermediate iteration space is also lost. As such, we do not encourage to use the basteln strategy in the case of consecutive matrix multiplications, although it can be implemented easily.

4.5.4 Another Example for Illustrating the General Applicability. To illustrate that Algorithm 2 can work in more general cases, we assume that the output of Algorithm 1 suggests a fusion strategy ( $\{S_1\}, \{S_2, S_3\}, \{S_4\}$ ) for the code shown in Figure 1(a). Algorithm 2 extracts the last loop nest as *output* and performs rectangular tiling on this loop nest. It first computes tile shapes for

5:18 J. Zhao et al.



Fig. 9. Redundant computations of consecutive matrix multiplications. A (dark)  $64 \times 64$  tile of matrix D requires a  $64 \times K$  tile of matrix E, which in turn requires all data of matrix E with size  $E \times K$ . The next (light) tile of matrix E0 still requires the full matrix of E1, leading to too many recomputations.

the second loop nest that executes  $S_2$  and  $S_3$ . The upward exposed data are produced by  $S_3$  and its grouping relations are constructed, based on which the extension schedule is computed. The self-dependences of  $S_3$  thus do not impact the greedy construction of tile sizes, since  $S_2$  and  $S_3$  are now considered as a whole. Next, Algorithm 2 considers the union of the second and last loop nests as *output* (line 19 of Algorithm 2), and the tile shape of the first loop nest executing  $S_1$  will be constructed, the process of which is the same as described in Section 4.3 and Section 4.4. Constructing tile shape for this second loop nest, together with the post-tiling fusion algorithm in Section 5.2, captures the missed fusion opportunities of Algorithm 1.

4.5.5 Comparison with Other Polyhedral Techniques. Algorithm 2 can construct overlapped tile shapes without knowing the number of functions as the PolyMage framework [44] requires. It constructs overlapped tile shapes of the intermediate iteration spaces through the tiles of upward exposed data, avoiding the introduction of complicated constraints to the affine relations implemented in previous work [64]. More importantly, unlike existing tiling techniques for constructing complex tile shapes [11, 23, 39, 64], Algorithm 2 provides the ability to construct complex tile shapes and the general applicability to more application domains due to the consideration of transformations in data spaces. The tile shapes are determined by the access manner of upward exposed data. For example, one can convert the example shown in Figure 1(a) into matrix multiplication code by fine-tuning the kh, kw loops and the corresponding subscripts. The readers will find that Algorithm 2 can still apply to the code by constructing rectangular tile shapes.

## 5 POST-TILING FUSION

A tiling schedule like affine relations (10) maps a fusion group to a higher-dimensional tiled space. While used to construct complex tile shapes, Algorithm 2 also implies a more aggressive fusion plan than Algorithm 1. The number of fusion groups suggested by Algorithm 2 is equal to its number of invocation times. The *mixed schedules* should be the union of the tiling schedule affine relations (10) and the extension schedule affine relations (15) for the example in Figure 1(a). The tiling schedule maps each statement instance of the reduction space to a lexicographical execution date; the extension schedule defines a mapping over the tile dimensions  $(o_0, o_1)$  to the statement instances of  $S_1$ , which has to be integrated with post-tiling fusion to apply loop tiling to the quantization space.

## 5.1 Facilitating Post-Tiling Fusion Using Schedule Trees

We leverage schedule trees [25] to implement post-tiling fusion by modifying the output of Algorithm 1 using the result of Algorithm 2. Let us first continue with the example for an illustration purpose. The schedule tree shown in Figure 3(b) is the result of applying Algorithm 1, with the fusion strategy  $(\{S_1\}, \{S_2, S_3, S_4\})$  represented using the children of the top sequence node.

According to Algorithm 2, we first apply rectangular tiling using affine relations (10) to the reduction space, which is represented as the second child of the top sequence node. affine relations (6)



Fig. 10. The schedule tree of post-tiling fusion.

in Figure 3(b) should be replaced by this tiling schedule, which is in turn split into two parts, with one represented as

$$tile\_band := \{S_2(h, w) \to (h/T_2, w/T_3); S_3(h, w, kh, kw) \to (h/T_2, w/T_3); S_4(h, w) \to (h/T_2, w/T_3)\},$$
(18)

and the other as

$$point\_band := \{S_2(h, w) \to (h, w); S_3(h, w, kh, kw) \to (h, w, kh, kw); S_4(h, w) \to (h, w)\}. \tag{19}$$

This isolates the tile dimensions, as shown in Figure 10, making it possible to implement tilewise fusion between iteration spaces or the functionality of the  $compute\_at$  primitive of Halide/TVM. affine relations (18) represents the dimensions iterating among tiles, and affine relations (19) denotes the dimensions iterating within tiles. Those nodes between affine relations (18) and affine relations (19) are introduced to implement the post-tiling fusion. Note that we have not yet applied tiling to the quantization space; the first filter node  $\{S_1(h, w)\}$  still uses affine relations (5) as its band node. However,  $S_1(h, w)$  is foreign to the subtree rooted under the second filter node. We mentioned in Section 2.2 that we would use extension nodes to facilitate post-tiling fusion. We extend the expressiveness of an expansion node to introduce additional statements under a filter node.

As shown on the left of Figure 10, the extension node is inserted underneath affine relations (18), with its content filled using affine relations (17). This simple manipulation on the schedule tree implements overlapped tiling of  $S_1$  and the tile-wise fusion of the original two iteration spaces. Such an extension to the expressiveness of the schedule tree representation makes it possible to implement post-tiling fusion in the polyhedral model, which is not possible in existing polyhedral compilation frameworks [59, 62] that also use schedule trees.

One can now schedule the statement instances of  $S_1$  under the extension node. As we expect to implement tilewise fusion, a sequence node has to be introduced underneath the extension node. The first child should be the filter node of the original quantization space (i.e.,  $\{S_1(h, w)\}$ ), whereas the second should be the original reduction space. The subtree rooted at  $point\_band$  is attached under the new introduced filter node  $\{S_2(h, w); S_3(h, w, kh, kw); S_4(h, w)\}$ . The original subtree rooted at affine relations (5) of the extended filter node  $\{S_1(h, w)\}$  is also introduced to instruct how the extended filter node is scheduled. Duplicating such subtrees guarantees that the multiple statements encompassed by the extended filter will be scheduled together. Introducing a sequence node under affine relations (18) also benefits for the intra-tile distribution transformation [70], which is used to exploit the spatial locality in small scratchpads or shared memories.

5:20 J. Zhao et al.

However, the schedule tree executes (at least part of) the statement instances of  $S_1$  twice with its current form, which may generate incorrect code. In particular, the iteration space of  $S_1$  will not be fully covered by the range of affine relations (17) when Algorithm 2 constructs a scattered tile shape for  $S_1$  and the extension node in Figure 10 only fuses part of  $S_1$ 's statement instances. In this case, the fused statement instances should not be executed by the original subtree of  $S_1$ , whereas those not covered by the range of Algorithm 2 should stay unchanged. To achieve this, we compute the following affine sets,

$$reserved := \{S_1(h, w) \cap affine sets (4)\} \setminus range affine relations (17),$$
 (20)

which subtracts the range of affine relations (17) from the iteration space of  $S_1$  that is computed by an intersection  $\{S_1(h, w) \cap \text{affine sets (4)}\}$ . affine sets (20) is then used to replace the original filter node of  $S_1$ , as shown in Figure 10.

An alternative to introducing affine sets (20) into the schedule tree is inserting a mark node below the original filter node of  $S_1$ . A mark node can attach a string to instruct the code generator to bypass the subtree rooted at this mark node. The benefit of introducing a mark node resides in the possible dead code elimination for image processing pipelines and deep learning applications. However, a safer handling should use affine sets (20) in case the statement instances not fused by an extension node possibly define live-out data in more general scenarios.

## 5.2 The Post-Tiling Fusion Algorithm

Algorithm 3 formally describes the post-tiling fusion strategy, whose detailed explanation is offered in Appendix A.2. It returns a fusion strategy of  $\{S_1, S_2, S_3, S_4\}$  for the illustrative example, and the tiled and fused code is shown on the right of Figure 10. The red and blue arrows represent the relations between band nodes with the loops they represent. Unlike the code shown in Figure 1(b), this code fuses all three loop nests into one group, allowing tensor A to be allocated in small scratchpads. As affine sets (20) is empty, the first filter node does not generate any instructions. In addition, the post-tiling fusion strategy does not lose the parallelism of the fused loop dimensions, and one can add an OpenMP pragma before the outermost loop when targeting CPUs. When generating CUDA code for GPUs, the entire loop nest can be executed by launching a single kernel, with ht, wt mapped to thread blocks and each pair of hp, wp mapped to threads, and tensor A can be declared in the share memory.

## 5.3 Evading Redundancy

So far, we have always assumed that there exists only one output iteration space. We now discuss the case of multiple output iteration spaces.

Without loss of generality, we assume that there exists two output iteration spaces:  $output_0$  and  $output_1$ . The intermediate iteration spaces can be divided into three categories: the first is composed of all intermediate iteration spaces of  $output_0$ , the second is a collection of those used by  $output_1$ , and the third consists of all intermediate iteration spaces that used by both. The difficulty is how to handle an intermediate iteration space of the third category.

Consider the scenario shown in Figure 11(a) where the values defined by  $op_0$  are used by both  $op_1$  and  $op_2$ . We use  $op_0'$  to represent the subset of  $op_0$  that computes the values used by  $op_1$ , and  $op_0''$  to represent the subset that writes to the values read by  $op_2$ . With Algorithm 3, one can still apply post-tiling fusion as shown in Figure 11(b) when  $op_0'$  and  $op_0''$  do not intersect with each other, which does not bring about redundant



Fig. 11. Fusion strategy for multiple uses.

## ALGORITHM 3: The post-tiling fusion algorithm

```
Input: (1) schedule tree—Output of Algorithm 1;
           (2) mixed schedules—Output of Algorithm 2
 1 foreach schedule in mixed schedules do
       if schedule is a tiling schedule then
 2
            band ← Replace the original band node using schedule;
 3
            m \leftarrow Number of parallelizable loops in schedule;
 4
            tile_band, point_band ← Split band into tile dimensions and point dimensions;
 5
            intermediates ← All intermediate iteration spaces that are to be fused;
            foreach i in intermediates do
                schedule \leftarrow Extract all extension schedules related to i like affine relations (15) from mixed
 8
                schedules;
                n \leftarrow Number of parallelizable loops in schedule;
                if m > n then
10
                     Replace the union of extension schedules related to i with a tiling schedule in mixed
11
                     schedules;
                     continue:
12
                Insert an extension node to schedule tree;
                Insert sequence and filter nodes to schedule tree;
14
                reserved \leftarrow Compute the affine sets that represent the statement instances that still need
15
                to be executed, like affine sets (20);
                Insert reserved to schedule tree;
16
   Output: schedule tree-A schedule tree after tiling and fusion
```

computations. Formally, we construct  $h_1$  and  $h_2$  like affine relations (15) to denote the relations between  $op_1$  and  $op'_0$ ,  $op_2$  and  $op''_0$ , respectively, and use

range 
$$h_1 \cap \text{range } h_2 = \emptyset$$
 (21)

as the condition to guide the fusion without redundancy in this case. One point to care about is the repeated application of affine sets (20), which subtracts the set of statement instances of  $op'_0$  and  $op''_0$  from op.

Another opportunity that does not introduce redundancy is when  $op = op'_0 = op''_0$  in Figure 11. This scenario is seen in machine translation applications that change a given sentence into another language using the self-attention mechanism [60]. The *attention mechanism* is widely used to find the words of importance when given a query word. It uses three input vectors to compute the relationships, which are used to translate the query word within in the context, between the query and each word of the sentence. This is done by performing multiple matrix multiplications, each of which takes as input the same data from the input vectors. A matrix multiplication in such tasks is usually followed by an activation function, and sometimes an input vector is processed by another elementwise operator, leading to a computation pattern similar to that shown in Figure 11(a).

To enable fusion in this scenario, we still build  $h_1$  and  $h_2$  as explained previously. The equality can be inspected by checking whether

$$range h_1 = range h_2 \tag{22}$$

holds, and one can establish the affine relations between tile coordinates of two output iteration spaces by composing  $h_1$  and the reverse of  $h_2$  or vice versa. The tile coordinates of  $op_1$  or  $op_2$  can also be mapped to a set of statement instances by reversing affine relations (11), which can be further composed with the affine relations between tile coordinates of the two output iteration spaces. As such, multiple output iteration spaces can also be fused in such cases.

5:22 J. Zhao et al.

We did not observe any other fusion opportunities that do not introduce redundant computations except the two scenarios discussed earlier. To avoid redundancy, we currently do not allow fusion when the intersection of  $op'_0$  and  $op''_0$  is not empty, which will produce redundancy. We also prevent fusion when  $op_0$  cannot be fused to either of its uses since the generated code cannot benefit from aggressive memory optimizations. In summary, the preceding decision lets our basteln strategy never introduce redundancy to the code when handling the cases of multiple output iteration spaces. Algorithm 4 describes the basteln strategy, and its description is provided in Appendix A.3.

## ALGORITHM 4: The basteln strategy

```
Input: schedule tree—Output of Algorithm 1
 1 domain ← Domain node of schedule tree;
 2 groupsset ← Ø;
 3 foreach output in domain do
        spaces ← output and its intermediate iteration spaces;
        groupsset \leftarrow groupsset \cup Apply Algorithm 2 to spaces;
   foreach sharedspace in groupsset do
        intersect \leftarrow Compute the intersection for shared space like affine sets (20);
 7
        if intersect = \emptyset then
 8
             continue;
        else if intersect = sharedspace then
10
             oset \leftarrow The set of output iteration spaces depend on sharedspace;
             o \leftarrow \text{Extract} one iteration space from oset;
12
             oset \leftarrow oset - o;
13
             h_1 \leftarrow \text{Compute affine relations between } o \text{ and } shared space like affine relations (15);}
14
             foreach s in oset do
15
                 h_2 \leftarrow \text{Compute affine relations between } s \text{ and } shared space like affine relations (15);}
16
                  t \leftarrow Compute affine relations over statement in s to its tile dimensions like affine
17
                  relations (11);
                  shape \leftarrow h_1 \cdot h_2^{-1} \cdot t;
18
                  Propagate shape to spaces with s as its output iteration space;
19
20
        else
             Replace each extension schedule of sharedspace with a tiling schedule;
21
   foreach groups and sharedspace in groupsset do
22
        schedule tree \leftarrow Apply Algorithm 3 to groups;
23
        if sharedspace cannot be fused with its uses then
24
             Remove the introduced nodes related to sharedspace in schedule tree;
25
   Output: schedule tree—After tiling and fusion
```

## 5.4 Generating Single Fusion Group

By evading redundancy, Algorithm 4 may still leave some output iteration spaces not merged with each other. The system design of optimizing compilers, however, sometimes allows for slight redundant computations, and a refinement here is to define a threshold for the percentage of affine sets (21) over the full iteration space of  $op_0$ . In particular, a sub-graph of a deep learning application is expected to be executed by a single GPU kernel, for which the threshold can be set to 100% to perform fusion. That is to say, we do not care about how many recomputations are introduced in such cases. When such a percentage is smaller than this threshold, post-tiling fusion like in Figure 11(b) is still possible.

Another scenario not considered by Section 5.3 is the case of *independent groups*. Specifically, Algorithm 4 performs both pre-tiling and post-tiling fusion algorithms by making use of the dependences between iteration spaces. In some cases, the result of Algorithm 4 may be composed of multiple independent fusion groups, where there exist no dependences between any pair of iteration spaces from two fusion groups. In other words, independent groups do not have a common data space as shown in Figure 6 and the basteln strategy cannot compute affine relations (15) by composing affine relations (13) and the reverse of affine relations (14). The typical example that leads to such a scenario is the batch normalization operation in deep learning. This scenario is also outside the scope of existing polyhedral fusion heuristics [3, 12], since they also exploit loop fusion based on data dependence graphs.

We can still construct relations like affine relations (15) for independent groups by specifying the same tile sizes to each of them. Without loss of generality, we suppose that there exists  $op_1$  and  $op_2$  and they do not depend on each other. The affine relations like affine relations (11), namely  $t_1$  and  $t_2$ , associate the statements in  $op_1$  and  $op_2$  to their  $d_1$ -and  $d_2$ -dimensional spaces of tile coordinates. As we use the same tile sizes, the first  $d_2$  ( $d_2 < d_1$  by assumption) dimensions of each tile coordinate space are identical. The extension schedule of  $op_2$  can be expressed using  $t_2^{-1}$ . To perform post-tiling fusion, the tile band of  $op_1$  is first split into two,  $b_1$  (with  $d_2$  members) and  $b_2$  (with  $d_1 - d_2$  members), and the extension node with  $t_2^{-1}$  as its content is introduced between  $b_1$  and  $b_2$ . The implementation can follow Algorithm 3 by considering  $op_1$  as the output iteration space and  $op_2$  as its intermediate iteration space.

# 5.5 Potentials and Limitations of the Approach

Now we discuss the potentials and limitations of the basteln strategy. In this article, we introduced Algorithm 1 as the starting fusion of the basteln strategy. Even though Algorithm 1 makes our approach independent of *isl's* fusion heuristics, its functionality can also be achieved by Algorithm 2. We design Algorithm 1 to handle the fusion with simple dependence patterns such that the algorithmic flow of the basteln strategy can be modularized and thus simplified.

The tile shapes constructed by Algorithm 2 can be rectangular/parallelogram, overlapped, and scattered. In the first case, the basteln strategy generates the same composition of tiling and fusion implemented in general-purpose compilers [13, 24, 62]. In the second case, our approach enables tilewise concurrent start [39] by minimizing the recomputations required by overlapped tiling. In the last case, the set of not merged statement instances in an intermediate iteration space is also reserved in case their data are accessed outside the considered program fragment.

As introduced in Section 2.3, the proposed basteln strategy requires the presence of producer-consumer relations across loop nests. Algorithm 4 thus cannot construct complex tile shapes for a single loop nest of stencils that was studied by existing techniques [11, 23, 39]. One can unroll the time dimension of a stencil kernel to transform it into multiple loop nests, to which our work is still applicable. The approach presented in this work is well suited for a wide set of applications with multiple loop nests, as will be demonstrated in the experiments.

#### **6 CODE GENERATION**

We implement the basteln strategy using the *isl* library [61] due to its ability to generate AST by scanning schedule trees. This allows us to generate code for different architectures by first generating AST and then converting the AST to imperative code or **Intermediate Representation (IR)** using a pretty-print scheme. The PPCG compiler [62] is a polyhedral code generator that wraps *isl* for manipulating integer sets/maps and generating AST; it finally converts the AST generated by *isl* to OpenMP C code or CUDA code.

5:24 J. Zhao et al.

Our algorithms are integrated into PPCG to generate OpenMP code for CPU and CUDA code for GPU. We also integrate the algorithms with AKG [68], an optimizing compiler for deploying deep neural networks, which originally targets a dedicated accelerator [42] but is added with a CUDA backend, as will be introduced in Section 6.2. Note that our approach can be implemented in other tools that use *isl*. For example, it can be integrated into Tensor Comprehensions [59] to generate CUDA code for deep neural networks.

PPCG can identify parallel regions in the C code which is converted from the AST generated by *isl*, and an *omp parallel* pragma can be added before each parallel region automatically. Figure 10 shows the OpenMP code generated by our approach. A weakness of PPCG's OpenMP backend is that it cannot exploit automatic vectorization. We identify the innermost parallel loop and add an *ivdep* directive for indicating the absence of loop-carried dependences. The compilers used to compile the generated OpenMP code like Intel *icc* can thus execute the innermost loop with SIMD instructions.

Generating CUDA code for GPUs requires the ability to map parallel loops to thread blocks and threads on GPUs. The CUDA backend of the PPCG compiler models the GPU mapping by leveraging mark nodes of schedule trees. The outermost parallel band node *tile\_band* in Figure 10 is marked using a *kernel* string, which instructs the code generator to map the *ht* and *wt* loops represented by *tile\_band* to GPU thread blocks. A *thread* mark is introduced before the *point\_band* node and the *band*<sub>0</sub> node, which tells the code generator to map the *hp* and *wp* loops to GPU threads.

## 6.1 Domain-Specific Code Generation

AKG borrows DSL of TVM [14] for expressing tensor computation to generate code for neural networks on domain-specific accelerators. AKG can take as input a deep learning model expressed using popular deep learning frameworks like TensorFlow [1] and PyTorch [49]; it can also work with MindSpore [31], a full-stack, all-scenario deep learning framework develop by Huawei. The DSL will then be transformed into HalideIR, which can be optimized and scheduled either by an expert that is familiar with the target architecture using the schedule primitives provided by the framework, or by the AKG compiler. For example, one can apply loop tiling using the *tile* primitive or loop fusion with the *fuse* primitive. However, an expert is responsible for guaranteeing the validity of the transformations he/she applies.

Previously, AKG only targeted a dedicated accelerator to boost neural networks—Huawei Ascend 910 chip, of which the DaVinci architecture [42] is depicted in Figure 12. Cube Unit is a specialized execution unit for performing tensor/matrix operations by taking as input the data in L0A and L0B, of which the output is stored into L0C. L0A/L0B can fetch data from L1 Buffer. The data in L0C can also be transferred to Vector Unit. Vector/Scalar Units are designed for executing vector/scalar operations. They are allowed to read/write data from/to Unified Buffer, which in turn can exchange data with L0C. L1 Buffer and Unified Buffer are serving as on-chip lower-level caches and used for exchanging data with external memory, which is not shown in Figure 12. Data exchange between L1 Buffer and Unified Buffer is permitted.

The programming model of the accelerator is designed by fully considering the domain-specific properties of applications and the underlying architecture. For example, a convolution operator can be implemented by emitting a single vector instruction using the programming model. The generated CCE code will be compiled with native compilers on the chip, with the same compilation options set for all the versions used in the experiments.



Fig. 12. Overview of the DaVinci architecture.

Unlike a manual scheduling approach, we introduce another pass in AKG by converting HalideIR into the schedule tree representation, which will be optimized using our approach. The output schedule tree will be transformed back to HalideIR for the follow-up code generation. The target imperative code of this accelerator, which we refer to as CCE code, is then generated from the automatically optimized HalideIR. Generating CCE code using *isl* can also be conducted by scanning AST without the involvement of HalideIR. We introduce the conversion pass of AST to HalideIR to provide the compatibility with possible low-level transformations on HalideIR that are outside the scope of the polyhedral model. For example, the memory latency hiding strategy was introduced by TVM to optimize the fine-grained synchronizations between pipelined executed statements. This optimization, which is implemented on top of the optimized HalideIR, is also integrated into AKG. The CCE code generation workflow of AKG is also composed of some preprocessing steps to make a program amenable to the polyhedral model, each of which is enabled by default.

## 6.2 The CUDA Backend of AKG

Deep neural networks are usually expressed in the domain-specific deep learning frameworks. We used PPCG to generate CUDA code for the benchmarks written in a general-purpose language, but PPCG does not provide the compatibility with deep learning frameworks, nor the ability to preserve the domain-specific knowledge lowered from the latter. It thus fails to generate CUDA code for deep neural networks. Meanwhile, AKG did not support code generation for GPU at the time of its publication [68]. These together make the experiments on deep learning models not available before, and the comparison between the polyhedral approach with highly tuned CUDA libraries and TVM not clear.

As GPU is still a mainstream platform to accelerate the training and inference tasks of deep neural networks, we implement a CUDA backend for the AKG compiler to address this issue. Following the workflow of CCE code generation for Huawei Ascend 910 chips, we reroute the lowering flow of HalideIR obtained from then high-level tensor expression language to the schedule tree representation, and progressively perform loop transformations like what Tensor Comprehensions [59] does for generating CUDA code.

Specifically, we first build an initial schedule tree from the HalideIR expression and leverage the scheduling algorithm of *isl* to generate a new schedule tree for each sub-graph, with Algorithm 4 used to implement the basteln strategy. Binding the parallel loops with the two-level hardware abstraction of GPU is implemented in a similar way to PPCG. The schedule tree is then converted into AST, which is again transformed back to HalideIR. The target CUDA code is finally generated from the optimized HalideIR. The reason we leverage the optimized HalideIR to generate final imperative code was explained in Section 6.1. The most significant difference between the CUDA backend of AKG and Tensor Comprehensions lies in the proposed basteln strategy.

#### 6.3 Inlining Elementwise Operators

Operator inlining is a technique used to substitute a tensor expression at the place it is used. It is an effective optimization to reduce the number of intermediate computations, but this technique was not considered by AKG when generating code for the deep learning accelerator it targets. This is because the programming model used by the dedicated accelerator requires the program expressions to be written in the form of three-address code. This restriction does not exist when generating CUDA code for GPU. Our approach results in more intermediate computations, the number of which can be decreased by operator inlining.

A large number of operator types are used in deep neural networks. Inlining each kind of operators is impractical. It is usually safe to inline an elementwise operator to its consumers. We thus mainly focus on the inlining of elementwise operators. However, excessive inlining may result in

5:26 J. Zhao et al.

redundant computations. We define that an elementwise operator can always be propagated to its user in the same fusion group if the operator has only one consumer. Propagating the expression of an elementwise operator followed by multiple consumers introduces redundant computation when the consumers read overlapped regions from the data space written by their common producer. We do not allow the inlining of an elementwise operator when such overlapped accesses from its multiple users happen. Inlining is still possible when multiple consumers of an elementwise operator require distinct data regions. The inlining algorithm is summarized in Algorithm 5 and explained in Appendix A.4. Note that operator inlining can be implemented within a generalized fusion pass like that of MLIR [40]. We isolate this functionality from our fusion algo-

**ALGORITHM 5:** The inlining algorithm for elementwise operators

```
Input: groups—The fusion configuration
 1 foreach g in groups do
        while g \neq \emptyset do
              op \leftarrow \text{Extract one operator from } g;
 3
4
              g \leftarrow g \setminus op;
 5
              if op is not an elementwise operator then
                  continue;
              num \leftarrow The number of consumers of op;
              if num = 0 then
8
                   continue:
              else if num = 1 then
                  Do inlining:
              else
                   if overlapped accesses then
13
                        continue:
14
                   else
                        Do inlining;
   Output: groups after inlining
```

rithms because the latter is performed on top of schedule trees while tensor substitution is carried out when this polyhedral representation is converted into another IR.

# 6.4 Aggressive Memory Optimizations

The basteln strategy maximizes data locality without hampering tilability/parallelism of output iteration spaces, but compilation optimizations may not be very effective without storage reductions for the memory hierarchy due to the streaming nature of applications.

The values produced by an intermediate iteration space are only used within a tile and can thus be discarded after the computation of a tile. We automatically allocate such values in small scratchpads when generating OpenMP code and the original memory allocation is eliminated, with the memory footprint of a tile used to compute the size of an allocated buffer. The indexing expressions are determined using the ranges of the affine relations generated by Algorithm 2.

The CUDA backend of PPCG provides us a software-controlled scheme to use the shared/private memory on GPU effectively. PPCG computes an over-approximated rectangular box for complex tile shapes that access non-rectangular blocks of data and therefore enables the allocation of the intermediate values in the shared/private memory. This strategy is also used by existing compilation techniques [23, 64] and has been integrated into the CUDA backend of AKG.

We also automate the memory promotion to higher-level caches of the DaVinci architecture using schedule trees. Explaining the details of memory promotion is outside the scope of this work. Generally speaking, we resort to mark nodes of schedule trees for managing the data flow between different computation units and extension nodes for the memory optimization and allocation, just like what Tensor Comprehensions [59] does for generating CUDA code. The integration of *isl* into AKG implements the deployment of a neural network on Ascend 910 by only manipulating schedule trees, facilitating the fully automatic compilation of deep neural networks.

#### 6.5 The Auto-Tuning Strategy

While enabling the construction of complex tile shapes, our approach does not model the selection of the best tile sizes in the polyhedral model. Tuning tile sizes is another essential step to achieve better performance, but it is outside the scope of polyhedral compilation. We use an automatic,

machine learning based tuning strategy to select the best tile sizes. The tuning space for a tensor computation is usually huge. Our approach can narrow the tuning space because the number of tile sizes to be tuned can be reduced. The auto-tuning strategy works as follows.

The code generator first constructs an initial tuning space using the tile sizes specified by the user using the DSL or the default values embedded in AKG. The auto-tuner uses a two-round sampling strategy of tile sizes to select the best ones. In the first round, samples are extracted randomly, and the performance of each sample is evaluated. The results are then used to train a machine learning model to guide the selection of the samples in the next round. In the second round, a sample is chosen in either of the following two ways. It is derived, with probability p, from one of the top N samples that perform best among all candidates in the first round, or it is selected randomly with probability 1-p from the complete set of the first-round samples. The performance of each second-round sample is also measured to update the machine learning model. The auto-tuner iterates this two-round sampling strategy until a pre-defined threshold is hit or performance is no longer improved.

The probability p is determined by a formula that varies during the sampling process. Another initialized parameter, which is set to 0.5 in our experiment, is used to define the formula of p. As a result, p is always ranging from 0 to the exponential constant e. The N is set to 64. The autotuner finally returns the tile sizes that can obtain the best performance in the tuning process. This guarantees that resulting tile sizes are much better than those specified by the user of default enabled in the compiler, but this tuning strategy is not meant to find the optimal sizes. In fact, determining the optimal tile sizes are still an open problem. We adopt an auto-tuning strategy because this is a practical and effective solution to this problem, which has been integrated into a variety of state-of-the-art tools. This auto-tuner is used by both backends of AKG.

#### 7 EXPERIMENTAL EVALUATION

We evaluate the effectiveness of our approach by conducting experiments on many applications extracted from the domains of deep learning and image processing. The deep neural networks we use in the experiments will be introduced in Section 7.2. The image processing pipelines are obtained from the repository of PolyMage benchmarks [10]. In addition, we apply our approach to the benchmarks from PolyBench and SPEC CPU2000 [18] to validate its general applicability.

The platform for evaluating OpenMP code is a 32-core, dual-socket workstation, of which each CPU is a 2.10-GHz 16-core Intel Xeon E5-2683 v4. The OpenMP code is compiled with Intel *icc* compiler 18.0.1 with options -*qopenmp* -*ipo* -*O*3 enabled. AKG is used to generate CUDA code for deep neural networks. The generated CUDA code is compiled by the NVIDIA CUDA toolkit version 11.4 with -*O*3 flag and the executable is run on an NVIDIA Tesla V100 GPU, with cuBLAS version 11.4 and cuDNN version 8.6.0 used. PPCG is used when experimenting the remaining benchmarks by targeting an NVIDIA Quadro P6000 GPU. Each benchmark is executed 11 times with the first run used as a warm-up execution and discarded. We report the average of the remaining 10 executions of each benchmark. By considering seven possible tiles sizes including 8, 16, 32, 64, 128, 256, and 512 for each dimension, the PolyMage framework uses an auto-tuning strategy for tile size selection. Such auto-tuned tile sizes are also listed in Table 1.

#### 7.1 Performance on CPU

The benchmarks we use in the experiments cover numerous application domains. We report the results in accordance with their application domains. As training deep neural networks is usually performed on GPU or dedicated accelerators, we have not yet implemented a code generator for CPU. The evaluation of deep learning models is thus not included in this section.

5:28 J. Zhao et al.

| ch.  | es     | tile            | GPU<br>grid<br>params  |          |            |            |            | GPU execution time |        | Compilation time |             |      | Speedup over |            |            |               |               |
|------|--------|-----------------|------------------------|----------|------------|------------|------------|--------------------|--------|------------------|-------------|------|--------------|------------|------------|---------------|---------------|
| benc | stages |                 |                        | naïve    | PolyMage   | Halide     | Our work   | PPCG               | Halide | Our              |             |      |              | Our        | PolyMage   | Halide        | Halide        |
| þe   | st     |                 |                        | (1 core) | (32 cores) | (32 cores) | (32 cores) | (min)              | папае  | work             | min smart m | шах  | work         | (32 cores) | (32 cores) | (GPU)         |               |
| BG   | 7      | $8 \times 128$  | $8 \times 64$          | 66.01    | 5.57       | 4.23       | 4.11       | 5.07               | 3.79   | 4.09             | 0.15        | 120  | > 24h        | 0.86       | 1.36×      | 1.03×         | 0.93×         |
| CP   | 32     | $64 \times 256$ | $16 \times 32$         | 116.32   | 4.68       | 4.76       | 4.40       | 3.51               | 2.47   | 2.38             | 0.15        | 120  | > 24h        | 4560       | 1.36×      | 1.03×         | $0.93 \times$ |
| HC   | 11     | $32 \times 256$ | $16 \times 32$         | 246.88   | 5.10       | 10.71      | 5.10       | 1.79               | 1.68   | 1.60             | 0.03        | 0.06 | 0.12         | 435        | 1.00×      | $2.10 \times$ | $1.05 \times$ |
| LF   | 99     | $8 \times 256$  | $8 \times 64$          | 480.48   | 35.35      | 29.12      | 27.08      | 16.73              | 12.53  | 11.12            | 6.94        | 90.8 | > 24h        | 89.3       | 1.31×      | 1.08×         | 1.13×         |
| MI   | 49     | $32 \times 128$ | $32 \times 16$         | 209.10   | 16.44      | 20.07      | 14.87      | 15.75              | 25.65  | 13.37            | 0.68        | 1.40 | > 24h        | 3.30       | 1.11×      | 1.35×         | 1.92×         |
| UM   | 4      | $8 \times 512$  | $8 \times 32 \times 3$ | 142.16   | 5.01       | 5.02       | 3.68       | 2.03               | 1.94   | 2.01             | 0.06        | 0.08 | 0.10         | 0.05       | 1.36×      | 1.36×         | $0.97 \times$ |

Table 1. Results of the PolyMage Benchmarks

minfuse, smartfuse, and maxfuse are denoted using min, smart, and max, respectively. Execution time is in milliseconds; compilation time is in seconds. BG, Bilateral Grid; CP, Camera Pipeline; HC, Harris Corner Detection; LF, Local Laplacian Filter; MI, Multiscale Interpolation; UM, Unsharp Mask.

7.1.1 Image Processing Pipelines. An image processing pipeline performs a given task on input images, using a variety of operations like stencils and complex reductions. We use six image processing pipelines extracted from the PolyMage benchmarks, which vary widely in structure and complexity. One of the PolyMage benchmarks is not considered in this work, as it is not found in the repository of Halide. Table 1 lists the PolyMage benchmarks considered in this work. We compare the performance with a domain-specific compiler, PolyMage [44], and the auto-tuned Halide schedules [26]. The OpenCV version is 2.4.9.1. The PolyMage compiler generates both naïve and optimized OpenMP codes by taking a DSL as input. The sequential code of a naïve version is used as the baseline and also the input of PPCG that implements our approach, as a naïve version is generated by PolyMage without applying tiling or fusion. On the contrary, an optimized version is generated by fully exploiting fusion and overlapped tiling opportunities.

The tile sizes, vector lengths, and unroll factors of both PolyMage and Halide have been tuned for our platform. We use the same auto-tuned tile sizes for the output iteration spaces when possible and keep the code generation parameters identical with PolyMage for a fair comparison. This isolates the effect of the fusion strategies and tile shapes. The tile sizes and the execution times of different versions are also reported. One can obtain the speedups of over PolyMage and Halide using such numbers, which are shown in Figure 13. The geometric means of improvements over PolyMage and Halide are 24% and 20%, respectively.

The performance improvements over these state-of-the-art approaches come from tighter over-lapped tile shapes and aggressive fusion strategies. We first explain why the basteln strategy can compute tighter tile shapes. In Halide's DSL, the bounds of an intermediate iteration space are specified as infinite; its compiler and auto-scheduler [4] infer bounds of these intermediate iteration spaces using interval analysis. As such, Halide's bound inference strategy cannot (and does not have to) compute affine relations like affine relations (15) that may result in non-rectangular loop nests. The result of interval analysis, however, may be an over-approximation of affine relations (15), thus probably introducing more recomputations than our approach. PPCG that has integrated the basteln strategy takes as the naïve code generated by PolyMage and written in C as input, not allowing the specification of a loop nest with no bounds. We have to consider bound inference of Halide from a more general perspective and use the polyhedral model, which can construct affine relations (15).

PolyMage also produces looser overlapped tile shapes. Its DSL considers an image processing pipeline as a directed acyclic graph, with each node representing an image processing function and every edge a dependence between a pair of nodes. Next, PolyMage determines the number of nodes or functions, say h, that should be fused. The tiling schedules of these h functions are expressed as affine expressions of h. To mitigate the effect of recomputations caused by overlapped tiling, PolyMage investigates all dependence vectors between each pair of consecutive functions. A lower bound of a producer iteration space is determined by the minimum non-negative vectors,



Fig. 13. Performance of PolyMage benchmarks on CPU (x-axis: no. of threads; y-axis: speedup over naïve code).



Fig. 14. Comparison of tile shape constructions.

and its upper bound is determined by the minimum non-positive vectors. Figure 14 depicts a shape constructed by PolyMage in green, an example used in the PolyMage paper [44].

However, a minimum non-negative/non-positive dependence vector sometimes is not the rightmost/leftmost one among all dependence vectors between the considered pair of functions, and the overlapped tile shapes constructed by PolyMage are still not the tightest ones. Consider the functions  $f_{out}$  and  $f_{\uparrow}$ . There exists two data dependences, one upward and the other upper right. PolyMage determines the lower bound of  $f_{\uparrow}$ 's overlapped tile shape by considering the dependence along the upper right direction, but the rightmost dependence is the upward one. The overlapped tile shape obtained by PolyMage is looser than the shape imposed by dependences, as represented by the cyan region in Figure 14. affine relations (15) avoids this issue by exactly expressing the dependences between the iteration spaces, which produces tighter overlapped tile shapes (i.e., the cyan region) than PolyMage, making the basteln strategy outperform PolyMage for Camera Pipeline even when the two approaches find the same fusion strategy. The fusion strategy of these two approaches in turn is better than that found by Halide.

Second, we explain how our approach suggests aggressive fusion strategies. PolyMage executes tilewise fusion once the tile sizes of the h functions have been determined, but it does not consider reduction operations during this process. Our basteln strategy, as well as Halide, takes this into

5:30 J. Zhao et al.

account and produces a more aggressive fusion strategy for Bilateral Grid [48] than PolyMage. A manual schedule specification enabling post-tiling fusion between image processing functions is possible in Halide and can be used as the search start of its auto-tuner [4, 54]. The auto-tuning heuristic of Halide next searches new schedules, as well as tile sizes and parallel/vectorize labels of each loop dimension. The search start has a great impact on the final result. However, the complexity of writing manual schedules as the search start of Halide increases with the growth of the number of functions. As a result, the tuned Halide schedules fail to generate fusion strategies found by PolyMage and our work for Multiscale Interpolation, Local Laplacian Filter [47], and Unsharp Mask.

PolyMage and our work can also automatically apply an inlining transformation to Harris Corner Detection [27], which was missed by the manual schedule of Halide. This inlining transformation results in two remaining functions, and our work generates the same code as PolyMage and thus obtains the same result, outperforming Halide's manual schedule by 2.1×. The result of this example can be considered as an ablation study of the operator inlining algorithm.

7.1.2 Finite Element Method. The benchmark equake [7] is extracted from SPEC CPU2000. It performs a finite element method using a 3D sparse matrix-vector (SpMV) computation. The 3D SpMV computation updates an unstructured mesh using a reduction array, which is followed by a group of affine loop nests performing elementary operations on the mesh. The imperfect loop nest of the 3D SpMV computation consists of three components, with the first one initializing the reduction array, the second performing reduction using a while loop, and the third gathering the reduction variables to update the global mesh. The reduction step involves a dynamic condition along the second dimension due to the use of a while loop, which is handled by PPCG as a black box. PPCG distributes each of these loop nests and parallelizes each of the outermost loop using the minfuse fusion strategy. This version is used as the baseline.

One can manually permute the while loop into the innermost dimension to create fusion opportunities for PPCG that provides three different fusion heuristics. The default heuristic that is represented as *smartfuse* tries to maximize fusion without hampering the parallelism or tilability. A more conservative strategy, *minfuse*, does not fuse any loop nests. The most aggressive heuristic maximizes fusion regardless of the parallelism or tilability, represented as *maxfuse*.

The *smartfuse* fuses the three components of the 3D SpMV computation together. On the contrary, *maxfuse* fuses the gathering component with the follow-up affine loop nests. The fusion strategy found by our approach is identical to that of *maxfuse*. The speedups over the baseline version of different fusion heuristics are shown in Figure 15, with the *x*-axis representing the problem sizes. As only the outermost loop is tilable, all versions did not apply loop tiling. Algorithm 2 returns an extension schedule with an empty domain, allowing the fusion without loop tiling. This also validates that our post-tiling fusion scheme is applicable without tiling.



Fig. 15. Performance of equake on CPU (32 cores).

Without the manual permutation of the while loop, PPCG cannot exploit loop fusion due to the dynamic condition introduced by the while loop. However, this permutation transformation is harmful to data locality, which makes the performance of PPCG's fusion heuristics fall behind our approach. Our approach does not require such manual permutation as a preprocessing step.

7.1.3 Linear Algebra and Data Mining. The benchmarks extracted from PolyBench are summarized in Table 2. These benchmarks are selected by considering the following criteria. First, a benchmark is composed of multiple loop nests such that loop fusion will make sense. Second, a polyhedral compiler may find different fusion strategies using an aggressive fusion heuristic; otherwise, our approach may generate the same code as the default fusion heuristic.

PolyBench is a collection of micro kernels for linear algebra, stencil computation, and physical simulation. Twenty out of the 30 PolyBench benchmarks are excluded since they do not require loop fusion due to the structure of perfectly nested loops. Our approach generates the same fusion results as *smartfuse* for 3 of the remaining 10, which are not considered in the evaluation. This validates that the basteln strategy falls back to *smartfuse* in the worst case. We choose three representative kernels that our approach generates different fusion results from *smartfuse*. The others perform similarly to those shown here.

Table 2. CPU Execution Time of the PolyBench Benchmarks

|            | 2mm  |     |     | gemver |      |       | covariance |     |     |
|------------|------|-----|-----|--------|------|-------|------------|-----|-----|
| Threads    | 1    | 8   | 32  | 1      | 8    | 32    | 1          | 8   | 32  |
| sequential | 4.9  | -   | -   | 0.07   | -    | -     | 8.1        | -   | _   |
| icc        | 2.8  | 2.5 | 2.4 | 0.06   | 0.07 | 0.21  | 8.4        | 8.5 | 8.9 |
| minfuse    | 15.3 | 2.6 | 1.1 | 0.08   | 0.03 | 0.03  | 10.3       | 2.7 | 1.1 |
| smartfuse  | 15.3 | 2.6 | 1.1 | 0.08   | 0.03 | 0.03  | 10.3       | 2.7 | 1.1 |
| maxfuse    | 15.4 | 2.5 | 1.0 | 0.46   | 8.86 | 16.88 | 10.7       | 3.6 | 4.6 |
| hybridfuse | 9.1  | 1.6 | 0.7 | 0.11   | 0.03 | 0.03  | ×          | ×   | ×   |
| Our work   | 15.3 | 2.5 | 1.1 | 0.08   | 0.03 | 0.03  | 10.5       | 2.7 | 1.1 |

We still compare the performance with different fusion heuristics of PPCG. Besides, we also compare with *hybridfuse*, the hybrid fusion heuristic used by the Pluto optimizer [13] which fuses outer loop dimensions using a conservative heuristic but maximizes the fusion at the inner level of a loop nest. We use the same tile size ( $32 \times 32$ ) default enabled by these compilers for each benchmark; tuning the tile sizes does not impact the execution time too much.

The kernel 2mm performs two matrix-matrix multiplications. We did not observe significant variations in its execution time when using different fusion heuristics of PPCG or our approach, since the parallelism/tilability is preserved by each fusion heuristic. The hybridfuse achieves the best performance, since maximizing fusion at the innermost level benefits for the automatic vectorization of the icc compiler. Integrating with a hybrid fusion heuristic may be an interesting direction for our approach to follow. Another example, 3mm, observes similar result because it has a similar computation pattern with 2mm, both composed of multiple reduction operations that the basteln strategy decides not to fuse any of them.

The *gemver* is composed of four loop nests performing vector multiplications, additions, and matrix-vector multiplications. The *covariance* is used to compute the covariance of data samples from different populations in data mining. One can observe that maxfuse suffers from significant performance degradation due to the lose of parallelism for these two benchmarks. Our approach enables rectangular/parallelogram tile shapes for these benchmarks. The fusion strategy found by our approach is more aggressive than that of smartfuse, but we did not lose parallelism or tilability. The hybridfuse generates a segmentation fault (represented as  $\times$ ) for covariance.

We did not apply aggressive storage optimizations for these micro kernels. This demonstrates that the composition of tiling and fusion exploited by our approach can also improve the performance of programs by only maximizing data locality.

## 7.2 Performance on GPU

We now evaluate the performance on GPU. The performance of those benchmarks extracted from the PolyBench benchmark suites follows the same trend with that of the CPU case; we thus do not discuss them here. The large-scale benchmarks are as follows.

7.2.1 Deep Neural Networks. The deployment of deep learning models on underlying hardware is usually composed of two parts. First, a deep learning model is expressed as a computational

5:32 J. Zhao et al.

| Types            | Names     | No. of ops. | Patterns                          | Shapes                                           | Baseline time (μs) |
|------------------|-----------|-------------|-----------------------------------|--------------------------------------------------|--------------------|
|                  | s2m       | 3           | sub+mul+mul                       | (32,12,128,128)                                  | 351.4              |
|                  | 2add      | 2           | add+add                           | (32,128,768)                                     | 101.56             |
|                  | 4madd     | 5           | mul+mul+mul+add                   | (1024)                                           | 10.32              |
|                  | 2m        | 2           | mul+mul                           | (32,12,128,128)                                  | 105.54             |
| multiple         | csm       | 3           | cast+sub+mul                      | (32,1,1,128)                                     | 8.43               |
| elementwise      | cadd      | 2           | cast+add                          | (640,21128)                                      | 155.2              |
|                  | maxmin    | 3           | max+min+cast                      | (32,128,768)                                     | 17.85              |
|                  | madd      | 2           | mul+add                           | (32,12,128,128)                                  | 108.43             |
|                  | rdadd     | 2           | realdiv+add                       | (1024)                                           | 5.53               |
|                  | indep     | 6           | reshape+cast+add,reshape+cast+mul | (256,7,7,2048)                                   | 1325.0             |
|                  | mmbias1   | 2           | matmul+bias                       | $(32,768) \times (768,2)$                        | 150.78             |
| elementwise      | mmbias2   | 2           | matmul+bias                       | $(32,2) \times (768,2)$                          | 89.58              |
| &                | mmbias3   | 2           | matmul+bias                       | $(640,21128) \times (21128,768)$                 | 3600.7             |
| matmul           | transbmm  | 2           | transpose+batchmatmul             | $(32,128,12,64) \times (32,12,128,64)$           | 141.76             |
| mannu            | bmmtrans  | 2           | batchmatmul+transpose             | $(32,12,128,128) \times (32,12,128,64)$          | 172.22             |
|                  | attention | 3           | mul+(batchmatmul,batchmatmul)     | $(32,12,128,128) \times (32,12,128,64)$          | 312.63             |
| .1               | conv1     | 3           | add+conv+relu                     | $(64,512,7^{(1)},7^{(1)}) \times (3,3)_{(1)}$    | 1518.3             |
| elementwise<br>& | conv2     | 3           | add+conv+relu                     | $(64,256,14^{(1)},14^{(1)}) \times (3,3)_{(1)}$  | 1316.5             |
| convolution      | conv3     | 3           | add+conv+relu                     | $(64,64,56^{(1)},56^{(1)}) \times (3,3)_{(1)}$   | 1027.4             |
| convolution      | conv4     | 3           | add+conv+relu                     | $(64,64,224^{(2)},224^{(2)}) \times (7,7)_{(3)}$ | 11653              |

Table 3. Summary of the Sub-Graphs

graph by a graph engine. Second, the computation graph is separated into sub-graphs, each of which is optimized using a tensor compiler. We first evaluate the performance of sub-graphs composed of multiple operators and then show the overall effect of our approach on end-to-end workloads.

Table 3 summarizes the information of the sub-graphs used in this experiment. We consider three types that happen frequently in deep neural networks. The first category is only composed of elementwise operators, the second category is the compositions of elementwise operators and (batched) matrix multiplications, and the third category comprises the patterns involving elementwise operators and convolutions.

The second column of Table 3 represents the name of each sub-graph, followed by the number of operators and the composition pattern of each sub-graph. Tensor operations that constitute the first category include addition (add), subtraction (sub), elementwise multiplication with a constant (mul), elementwise division using real numbers (realdiv), casting (cast), reshaping (reshape), and the elementwise maximum/minimum operators (max/min) between an element and a constant of a tensor. We consider matrix multiplication (matmul) or its batched form (batchmatmul) with auxiliary operators like matrix transpose (transpose) and elementwise bias addition/subtraction (bias) for the second category. The third category is composed of 2D convolutions and some auxiliary elementwise operators.

We also list the shape configurations we use for each sub-graph in the fifth column of Table 3. These shape configurations are typical values used in practice. In particular, we extract these configurations from the YOLO9000 model [56]. The padding factors and the strides of each NCHW convolution are denoted using superscripts and subscripts, respectively. For instance, conv1 represents an NCHW convolution over a 512-channel,  $7 \times 7$  input image using a kernel of size  $3 \times 3$ . The batch size, padding factor, and stride are 64, 1, and 1, respectively. We collect the shapes of the input tensors used by each sub-graph. One can conclude the shape of each output tensor using these configurations. We also examine other shape configurations and observe similar results.

Except matrix multiplications and their batched variants that use the cuBLAS routines, each operator considered in this experiment is backed by cuDNN. As vendor libraries developed by NVIDIA, cuBLAS, and cuDNN are highly optimized and tuned for GPU, but they cannot exploit fusion across operators. We execute each operator using its implementation of cuBLAS/cuDNN and calculate the total execution time of each sub-graph. This execution time is referred to as the



Fig. 16. Performance of sub-graphs on GPU.

baseline time and collected in the last column of Table 3. Matrix multiplications and convolutions can also be mapped to tensor cores using polyhedral compilation techniques [9]. We implement a strategy similar to that of Bhaskaracharya et al. [9] in AKG but did not make use of tensor cores in the experiments. This is because fusion in such cases has to be implemented at the fragment level, and the effect is the same as traditional tiling-after-fusion strategy. We believe that this effect should not be interpreted as the contributions of this work.

We compare the performance with two baselines: one is the implementation of each sub-graph using cuBLAS and cuDNN, and the other is the code generated by TVM. The code of TVM is fully optimized using its auto-tuner [69]. In addition, we provide two versions of the code generated by AKG. The first version applies loop tiling based on the *smartfuse* fusion heuristic of *isl*, which is used in AKG by default, and the second is integrated with the approach of this work. The *minfuse* is not considered because it always distributes operators into different loop nests. It thus cannot benefit from the shared/private memories of GPU. The *maxfuse* is usually not a good choice for GPU due to the sacrifice of multi-dimensional parallelism. A large number of optimizations outside the scope of this work have also been integrated into the AKG compiler, which also contribute to the overall performance of AKG. These optimizations are enabled for both versions of the AKG code to isolate the effect of loop ting and fusion.

The purpose of the comparison between our approach and the default fusion heuristic of AKG is to demonstrate that our approach can indeed bring about significant performance improvement or boil down to the default setting in the worst case, whereas comparing with cuBLAS/cuDNN and TVM is used to illustrate that generating code using the polyhedral model can achieve competitive or even better performance than manual approaches. Figure 16 shows the speedup of each version over the baseline time. We only discuss the overall performance in the following context.

We also introduce a preprocessing step that performs loop coalescing before polyhedral compilation, which is triggered when processing tensors with relatively smaller shape configurations using elementwise operators. Loop coalescing, also know as loop linearization, is a transformation that coalesces a deep loop nest into a single dimension, which is not favored by the scheduling algorithms [63]. However, loop coalescing is profitable when the iteration numbers of each loop dimension are smaller than the grid/block sizes of GPU.

Our approach boils down to the default fusion heuristic of AKG when used to fuse multiple elementwise operators, and thus obtains the same results as the latter for most of the first category of sub-graphs. They fuse all operators into a single group and thus create more intermediate values. These intermediate values defined by elementwise operators are propagated to their uses by our inlining algorithm, resulting in one computation expression for each sub-graph of the first category. As a preprocessing step, loop coalescing improves the performance of AKG by making full use of

5:34 J. Zhao et al.

| Workloads    | Domains                     | Batch | Baseline   | No of karnals  | Lines of CUDA code | Compilation time (s) |                |  |
|--------------|-----------------------------|-------|------------|----------------|--------------------|----------------------|----------------|--|
| Workloads    | Domanis                     | sizes | times (ms) | No. of Kerners | Lines of CODA code | AKG                  | AKG + Our work |  |
| MobileNet-v3 | image classification        | 150   | 151.41     | 95             | 2041               | 25.40                | 73.99          |  |
| LeNet-5      | image classification        | 32    | 1.56       | 6              | 78                 | 2.85                 | 2.81           |  |
| VGG16        | image classification        | 32    | 70.4       | 27             | 432                | 2.79                 | 2.73           |  |
| BERT         | natural language processing | 64    | 352.17     | 123            | 2424               | 61.59                | 77.92          |  |
| Wide&Deep    | recommendation system       | 16000 | 22.41      | 81             | 2098               | 12.91                | 26.33          |  |

Table 4. Results of the End-to-End Workloads on GPU

the available hardware resources on GPU. It is unnecessary to enable loop coalescing in the case of larger shape configurations, as the iteration numbers of loops are greater than the grid/block sizes on GPU. For example, it is turned off when experimenting on the *cadd* sub-graph. On the contrary, fusion across these operators is not exploited by cuBLAS or cuDNN. As such, highly tuned libraries cannot benefit from faster memories.

The exceptional case is the *indep* example, a sub-graph lowered from a batch normalization layer [32]. Unlike other sub-graphs of the first category, this example involves a multiplication that computes the square of the input tensor and another elementwise addition that takes as input the same tensor. These two operators are fused with their auxiliary elementwise operators (reshape and cast) but do not depend on each other. The optimization strategy proposed in Section 5.4 results in a speedup of 31.6× over cuDNN, and this optimization can also be implemented using the schedule primitives of TVM. In summary, AKG can obtain a mean speedup of 6.6× for the sub-graphs of the first category over the cuDNN implementation. TVM performs similar to AKG, as multiple elementwise operators can also be fused by TVM, with inlining fully considered.

Like fusion of multiple elementwise operators, a (batched) matrix multiplication can be fused with its auxiliary operators by our approach. Loop coalescing is disabled in these cases because matrix multiplication is not an elementwise operator. Our approach fuses all operators into a single group for these sub-graphs, increasing the number of intermediate values that can be allocated in the shared memory. The default fusion heuristic of AKG sometimes fails to perform fusion for the sub-graphs of the second category. On average, our approach can achieve a speedup of  $10.2\times$  over the compound of cuBLAS and cuDNN. Notably, our approach outperforms the NVIDIA libraries by  $41.7\times$  for mmbias2. The reason is because the cuBLAS implementation of matrix multiplication performs poorly when given a non-trivial input shape configuration. This demonstrates that vendor libraries cannot scale with different shape configurations. Our approach can still bring about a mean speedup of  $2.3\times$  for the sub-graphs of the second category if this exceptional case is excluded.

TVM requires a transposed matrix input when scheduling matrix multiplications. It can fuse the matrix multiplication routine with its follow-up elementwise operators but fails to perform fusion on the preceding transpose operators. As a result, our approach achieves an average improvement of 7.4% over TVM for the sub-graphs of the second category. In particular, the *attention* example represents a pattern of the self-attention mechanism. The output of the elementwise multiplication is used by two batched matrix multiplications. Without the second optimization introduced in Section 5.3, our method would not fuse these operators into a single group, just like what the default fusion heuristic of AKG does. Our approach can outperform the cuDNN/cuBLAS libraries by 1.4×. Note that this fusion strategy can also be specified using the schedule primitives of TVM, which obtains the same result as AKG.

The computation pattern of the third category is similar to the example shown in Figure 1(a). Our approach and TVM can maximize the parallelism and locality simultaneously for such a pattern, outperforming the highly tuned libraries by 1.8× on average. The end-to-end workloads used in this experiment are summarized in Table 4, where we also report the number of kernels produced by AKG and their lines of CUDA code.

We consider five popular large-scale deep neural networks including MobileNet-v3 [30], Lenet-5 [41], VGG16 [57], BERT [19], and Wide&Deep [16]. These models are widely used to solve difficult problems for image classification, natural language processing, and the recommendation system. Each model is expressed using the MindSpore framework, which leverages cuDNN and cuBLAS when generating code for GPU and is also capable of performing simple fusion patterns involving operators like elementwise, reductions, and so forth. We refer to the execution time of this version as the baseline time. The batch sizes are also listed in Table 4.

We compare the performance with TVM and the default fusion strategy of AKG. The overall execution time is the sum of each sub-graph. Figure 17 shows the speedup of each version over the baseline time. Loop fusion is employed by each approach except the highly tuned libraries. Comparison with TensorFlow and XLA has been reported in our another publication [66] that is orthogonal to this work by focusing on graph-level optimizations.

The result demonstrates that exploiting aggressive fusion strategies with storage optimizations is always beneficial to the performance. The default fusion heuristic used by AKG is already able to outperform the highly tuned libraries but falls behind that of TVM. Our approach obtains an average performance improvement of 41.5% over this default fusion strategy, and the overall performance of AKG with our approach outperforms the baseline version and TVM by 1.8× and 1.1× on average, respectively. This is in line with the result of sub-graphs shown in Figure 16, since matrix



Fig. 17. Performance of the end-to-end workloads on  $\ensuremath{\mathsf{GPU}}.$ 

multiplications and convolutions consume most of the execution time of a deep learning model. In particular, the self-attention mechanism is adopted by the BERT model, and the optimization strategy proposed in Section 5.3 improves the performance of cuBLAS/cuDNN libraries by  $3.2\times$  and that of the default fusion heuristic of AKG by  $2.3\times$ .

7.2.2 Image Processing Pipelines. As PolyMage does not target GPUs, we only compare the performance with Halide's manual schedule. The baseline version is generated by PPCG without our approach, which implements rectangular/parallelogram tiling and the minfuse heuristic. The auto-tuned tile sizes are identical with those shown in Table 1, with the auto-tuned GPU grid parameters listed in the fifth column. The CUDA code is generated with private/shared memory optimizations enabled. The results are shown in Figure 18. The numbers of smartfuse and maxfuse are missing for some of the benchmarks because they cannot terminate within a reasonable amount of time. We will explain the time complexity issue in Section 7.4.

The *minfuse* does not fuse any of the four stages of Unsharp Mask, failing to benefit from the shared/private memory. Whereas *smartfuse* exploits 3D parallelism by fusing the four stages into two groups, *maxfuse* groups all of the stages together but boils down to 2D parallelism and using  $128 \times 3$  as the GPU grid parameters. The *maxfuse* suffers from performance degradation due to the loss of parallelism. None of the fusion



Fig. 18. Performance of PolyMage benchmarks on GPU.

5:36 J. Zhao et al.

| Models       | Batch |      | Executio | on time (ms)   | Speedi | ıp over       | Compilation time (s) |                |  |
|--------------|-------|------|----------|----------------|--------|---------------|----------------------|----------------|--|
| Models       | sizes | AKG  | TVM      | AKG + our work | AKG    | TVM           | AKG                  | AKG + Our work |  |
| MobileNet-v2 | 150   | 181  | 175      | 172            | 1.05×  | 1.02×         | 44.8                 | 56.9           |  |
| VGG16        | 32    | 86   | 140      | 72             | 1.19×  | 1.94×         | 2.87                 | 2.71           |  |
| BERT-v1      | 64    | 151  | 147      | 140            | 1.08×  | $1.05 \times$ | 61.6                 | 77.9           |  |
| BERT-v2      | 64    | 413  | 457      | 380            | 1.09×  | 1.20×         | 61.6                 | 77.9           |  |
| ResNet-50    | 16    | 20.1 | 17.7     | 17.7           | 1.13×  | $1.00 \times$ | 736                  | 487            |  |

Table 5. Results of the End-to-End Workloads on the Huawei Ascend 910 Chip

heuristics of PPCG applies fusion to Harris Corner Detection, which is prevented by overlapped memory footprints. Without losing parallelism, our approach obtains superior performance because our CUDA code maximizes the utilization of the shared/private memory due to the aggressive fusion results and overlapped tile shapes.

Halide outperforms our approach slightly for Bilateral Grid and Unsharp Mask since it manually applies unrolling transformations to the channel dimension of the input images after tiling. This can benefit for the instruction-level pipelined parallelism of the benchmarks and is an interesting direction for our approach to follow in the future. Our approach provides a mean performance improvement of 17% over Halide.

7.2.3 Finite Element Method. PPCG cannot generate effective CUDA code for equake due to the presence of the while loop. Its enhancement [67] converts the while loop into a so-called dynamic counted loop using a preprocessing step, which allows the exploration of loop tiling and fusion in the polyhedral model. The code generation algorithm then introduces a goto statement to eliminate the over-approximated iterations caused by the preprocessing. This may achieve a speedup of 2.3× over the default setting of PPCG. However, the fusion strategy is exploited by hand in previous work [67]. Our approach achieves the same result as the manual approach but automates the composition of tiling and fusion, which benefits the performance by taking advantage of faster memory on GPU.

## 7.3 Performance on a Deep Learning Accelerator

The execution of deep neural networks demands for enormous computing power, resulting in the hardware race between tech giants to pursue dedicated chips. The Huawei Ascend 910 chip is one of such domain-specific accelerators designed for deep neural networks. AKG can also be used to generate CCE code for the Huawei Ascend 910 chips. We now evaluate the performance of deep neural networks on this deep learning accelerator.

Due to the complicated memory hierarchy, developing high-performance libraries by hand for neural processing units becomes more challenging. As such, the vendor developers of the Huawei Ascend 910 chips resort to the schedule primitives of TVM to exploit the parallelism and locality. However, deploying deep neural networks using TVM still requires a tremendous amount of human effort. Tens to hundreds of engineers have been involved in writing schedule templates for GPU or Ascend 910 chips independently. This makes the deep neural networks optimized by the engineers for the deep learning accelerator not totally the same as those used for GPU.

Four deep neural network models, including ResNet-50 [29], MobileNet-v2, VGG16, and BERT, are considered in this experiment. The BERT model is evaluated using two vocabulary sizes: 21,128 (BERT-v1) and 30,522 (BERT-v2). These models are fully tuned by the vendor developers using TVM's schedule primitives. Except for the ResNet-50 model, the remaining three deep networks have been introduced when experimenting on GPU. The ResNet-50 model is a 50-layer deep network used for image classification. The detailed information of these models and the experimental results are shown in Table 5. The execution time is reported for a single training epoch.

Similar to the code generation for GPU, these deep neural networks are also first split into sub-graphs, each of which is then compiled using AKG or optimized by TVM schedule primitives. Operator inlining is disabled since the programming model of the target accelerator requires that each instruction has to be written in the three-address form. The optimization that transforms a give program into the three-address form is also performed automatically by the code generator. Tile sizes are first specified by hand in the DSL, which will be optimized by the auto-tuner of AKG or TVM. AKG exploits the fusion opportunities between operators within each sub-graph using the *smartfuse* fusion heuristic of *isl*. Our approach enables overlapped tiling and thus results in more aggressive fusion strategies. As listed in Table 5, our approach can always outperform the default fusion heuristic of AKG and improves the performance of AKG by 11% on average.

The code generator of AKG exploits vectorization by emitting a single statement for a convolution operator. It always packs an initialization statement with its reduction statement when handling convolutions or matrix multiplications. The *minfuse* may prevent the vectorization of the CCE code since it isolates the initialization and reduction statements. We therefore do not compare with it. The numbers of *maxfuse* are missing due to the tedious compilation time of the heuristic. The dimensions of loop nests would also mismatch the tile sizes specified in the DSL and/or the tilability of loop nests would be lost even if the *maxfuse* heuristic can find a fusion strategy.

The execution time of the code generated by TVM schedules is also collected. The performance of manual scheduling approach relies heavily on the experience of developers. The engineers have devoted more than 2 months to develop good schedule templates for the ResNet-50 and MobileNet-v2 models using their domain-specific knowledge and experience. Their optimal schedule templates exploit similar tiling and fusion results like our technique does, at the expense of enormous human effort. However, their experience on the optimizations of VGG and BERT is rather limited, making the performance fall behind that of our approach. The mean speedup of our approach over TVM on the Huawei Ascend 910 chip is 1.2×.

Note that the impact of optimal tile sizes is essential to the overall performance of both AKG and TVM. The tile sizes used in the experiments are fully optimized by both approaches. The autotuner of AKG was introduced in Section 6.5. TVM uses a machine learning based model to find the optimal schedule configurations and best tile sizes. The manual scheduling approach may suffer from a heavier performance degradation without tuning strategies.

## 7.4 Comparison of Time Complexity

Our approach can also benefit for the compilation time of the polyhedral model. We mainly discuss the image processing pipelines and the deep neural networks that challenge the scalability of aggressive fusion heuristics. The compilation overheads for the remaining benchmarks are lightweight, and we thus do not talk about them here.

Table 1 collects the compilation overhead for each image processing pipeline. We report the compilation time for generating OpenMP code; the compilation overhead for generating CUDA code follows a similar trend with the CPU case. The *smartfuse* heuristic generates the exact same fusion result as *minfuse* for Bilateral Grid and Multiscale Interpolation; it fuses some of the functions for Camera Pipeline, Harris Corner Detection, and Unsharp Mask. However, it cannot finish within 1 day for Local Laplacian Filter.

The *maxfuse* cannot finish within 24 hours for most of the image processing pipelines, including Bilateral Grid, Camera Pipeline, Multiscale Interpolation, and Local Laplacian Filter. The *smartfuse* also suffers from the tedious compilation time for two of them. Our approach can always terminate within 8 minutes (except Camera Pipeline), and the compilation time is sometimes competitive with or even better than *smartfuse* but exhibits better locality. The reason is as follows.

5:38 J. Zhao et al.

First, unlike *maxfuse* that requires auxiliary loop transformations to enable fusion for the whole considered program, our approach uses the scheduling algorithms of *isl* for each individual iteration space before pre-tiling fusion. This leads to the improvement of scheduling time for the benchmarks including Bilateral Grid, Local Laplacian Filter, and Unsharp Mask. Second, our algorithms exploit better fusion strategies when compared with conservative fusion heuristics, generating fewer iteration spaces to be used for generating code. This reduces the time of code generation for Unsharp Mask.

The exceptional case is Harris Corner Detection, for which our approach takes longer time than the heuristics. This is because the complex access pattern presented in the benchmark will lead to a tedious computation of upward exposed data when using *isl* [61] that computes complex dependence transitive closures. The recent study [50] optimizing the simplex solver of *isl* may be a possible solution to address this problem.

The rightmost two columns of Table 4 (for generating CUDA code) and Table 5 (for generating CCE code) represent the compilation overheads in seconds of each deep neural network. By default, AKG uses the *smartfuse* heuristic of *isl*. One can find that our approach requires less compilation time than *smartfuse* for some of the workloads. Our approach enables a more aggressive fusion strategy than *smartfuse* when integrated into AKG, generating fewer iteration spaces that need to be scanned by the code generator. This reduces the time of code generation for LeNet-5, VGG16, and ResNet-50 moderating the compilation time of these workloads on GPU and our deep learning accelerator by alleviating the cost to compute extension schedules and perform manipulations on schedule trees. Note that the compilation time is collected with tuning overhead not taken into account.

#### 8 RELATED WORK

As presented in the previous sections, our approach exploits a novel combination of loop tiling and fusion in polyhedral compilation to maximize the utilization of the memory hierarchy. Loop fusion was revisited widely in the past few years for optimizing locality on modern domain-specific chips. The fusion heuristics were well studied for both general-purpose optimizers [3, 12, 55] and domain-specific frameworks [14, 54], but a well-defined cost model for an optimal, general solution to different architectures has not yet been found. The underlying principle is due to the conflict demands of parallelism and locality, as demonstrated in this work. Designing aggressive fusion heuristics cannot avoid this difficulty, and we thus implemented a post-tiling fusion strategy which well models the tradeoff between parallelism/tilability, locality, and recomputation.

Tiling [33] was unified into the polyhedral model using affine relations [13], followed by numerous publications on complex tile shapes [11, 23, 39] and parameterized tile sizes [28, 38, 43]. The tile size selection issue was partially addressed by auto-tuning tools [4, 5, 14, 59] that can be used as a complementary optimization for our approach, but complex tile shapes that benefit for aggressive storage optimizations like overlapped tiling were insufficiently integrated within polyhedral frameworks. PolyMage [44] is a polyhedral implementation of Halide's schedules [54]; it constructs overlapped tile shapes for a given number of image processing functions. The overlapped tile shapes may still be looser, as explained in our experiments. One of the PolyMage's enhancements optimizes the construction of overlapped tile shapes using the *expansion nodes* of schedule trees. While restricted to image processing and simple stencils, this expansion node based implementation cannot integrate itself with post-tiling fusion.

Some recent developments of PolyMage [44] either enhance its node grouping heuristic [34] or enable CUDA code generation using hybrid overlapped/parallelogram tiling [35]. The auto-tuner proposed in the work of Jangda and Bondhugula [34], which was also adopted in another work [35], is orthogonal with the topic studied in this work. It is likely to find better fusion strategies than the

current implementation of our approach in PPCG [62] due to the dynamic programming heuristic. Fortunately, this high-level node grouping heuristic can be used as a preceding optimization of PPCG, just like integrating a graph engine with a tensor compiler. Our extension to PPCG will still be applicable to each group suggested by this high-lever optimizer. One can expect for tighter overlapped tile shapes within each group. Integrating PPCG with high-level node grouping heuristics eligible for powerful auto-tuning methods in the future is compelling, and the hybrid tiling technique introduced in the work of Jangda and Guha [35] also provides an interesting direction to follow, which exploits the wrap-level optimizations of GPU not considered by many polyhedral approaches. As the extension to GPU of PolyMage [35] has not been made publicly accessible, we did not compare the performance with this approach.

The manual scheduling approach is an alternative to the fusion heuristics of polyhedral compilation. Halide [4, 54] was proposed to describe and optimize image processing pipelines in an easier way by isolating schedules from algorithms. Such an idea of isolation was also adapted by later deep learning compilers like TVM [14]. Some of these frameworks [14, 54] only provide users with schedule primitives for transforming computations; they are not able to compute extension schedules like affine relations (15) but use interval analysis to determine overlapped tile sizes. A similar problem also arises in XLA [22]. The basteln strategy described in this work is a generalization of the schedules manually specified in Halide [54] or TVM [14], with no domain-specific knowledge required to exploit more fusion opportunities [45]. In addition to fully automating the composition of tiling and fusion, we consider a larger set of programs and demonstrated the effectiveness of the approach in both general-purpose and domain-specific optimizing compilers. Our approach overcomes this weakness.

The polyhedral model was also integrated into the MLIR [40] infrastructure. Similar to our earlier publication [65], MLIR implements a backward slicing strategy by expressing dependences between computations as implicit affine relations, based on which the slices of intermediate iteration spaces are pulled into their output iteration space. To prevent aggressive fusion, a cost model evaluating the amount of introduced recomputation is defined and used by this fusion pass. Although MLIR does not directly compose loop fusion and tiling, one can offer it with a tiled output iteration space, and the tile shapes and sizes of intermediate iteration spaces can be automatically determined, with aggressive storage optimizations managed without human intervention. By tightly integrating with an SSA (static single assignment) compiler IR, the affine fusion pass of MLIR differs from our work in that it does not rely on heavyweight libraries like *isl* [61], and it can thus achieve the same quality of code in a much faster way. However, this also restricts the general applicability of MLIR to only deep learning and image processing applications domains, as pointed out in Section 1. As the MLIR infrastructure also leverages a simplified schedule tree representation, implementing our approach in MLIR to extend its generality will not cost too much effort.

#### 9 CONCLUSION

Fusion and tiling are two important loop transformations in optimizing compilers. In this article, we presented a new composition of fusion and tiling in the polyhedral model by constructing non-trivial tile shapes through the cooperation with data tiles and implementing a hierarchical fusion strategy. Starting with a pre-tiling fusion heuristic, our approach facilitates the construction of tile shapes without refining scheduling algorithms nor complicating code generation in the polyhedral model, followed by an implementation of post-tiling fusion on top of a general-purpose IR. The technique proposed in this work avoids the need of aggressive fusion heuristics for minimizing data transfers across memory hierarchies, mitigating the time complexity of polyhedral frameworks. The extension to the general-purpose IR for implementing post-tiling fusion

5:40 J. Zhao et al.

also strengthens the expressiveness and compatibility of the polyhedral model for modeling more compositions of loop transformations. The approach described in this work was implemented as an ad-hoc method for image processing [54] and deep learning [14]. This work reflects significant progress toward fully automating the basteln strategy for a much wider set of scenarios and both general-purpose and domain-specific optimizing compilers.

This study is the extended work of our previous publications [65, 68]. We enhanced the power and general applicability of the initial idea [65] through some significant optimizations including designing a pre-tiling fusion, better modeling the fusion of multiple output iteration spaces and enabling the fusion of independent groups. The contributions of this work are orthogonal with AKG [68] by developing a CUDA backend and an operator inlining algorithm, demonstrating the portability of AKG to GPU. The experimental results on GPU V100 are only presented in this work. Experiments results on Ascend 910 chips are reported in both papers, but the evaluated benchmarks are not identical. While the AKG paper demonstrated the compound effects of many optimizations, the experimental results in Section 7.3 isolate the effect of the enhanced basteln strategy from AKG.

#### **APPENDIX**

#### A EXPLANATION OF THE ALGORITHMS

To help the readers better understand our approach, we explain the proposed algorithms in detail.

## A.1 Description of Algorithm 1

Algorithm 1 takes an initial schedule tree like Figure 3(a) as input and produces a new schedule tree like Figure 3(b). It first applies a polyhedral scheduling algorithm (line 1) without fusion heuristics. We use the variant [63] of Pluto algorithm in isl, but it can also be replaced by the Pluto algorithm [13] or other enhancements [2]. The n at line 2 is the number of filters of the original schedule tree (e.g., n=3 in Figure 3(a)), and g at line 3 represents the number of fusion groups before tiling (e.g., g=2 in Figure 3(b)). The visited at line 5 is a set of flags for each filter node of the original schedule tree.

The second loop between lines 6 and 19 is used to check the fusion possibility of each filter node in the original schedule tree. Each  $filter_i$  is added to an individual group,  $group_g$ , if  $filter_i$  is the first visited filter node (line 9) or has quasi-identity dependence relations with  $group_g$  (line 14). A new group is created if no quasi-identity dependence relations hold (line 18). Once the g groups are obtained,  $schedule\ tree$  is updated using the filters of these groups (line 20), which is followed by the third loop (lines 21–29) that updates the subtree of each filter node. One can copy the original subtree of a  $filter_k$  if  $num_k = 1$  (line 23), which is used to record the number of filter nodes of the original schedule tree fused in  $group_k$ .

A band node like affine relations (6) is constructed (line 25) if  $num_k > 1$ . The number of loop dimensions is extracted from affine relations (7). A sequence node is next introduced underneath this band node (line 26), with each child instantiated using the inner loop at lines 27 and 28. The subtree rooted at this band node is finally inserted under each  $filter_k$  (line 29), and the output will be a subtree after performing pre-tiling fusion.

## A.2 Description of Algorithm 3

Algorithm 3 requires two inputs: one is the output of Algorithm 1, and the other is the output of Algorithm 2. We still assume that there exists only one output iteration space.

The number of tiling schedules in *mixed schedules* is exactly the number of fusion groups suggested by Algorithm 2. For each group, Algorithm 3 first replaces the original band node using

schedule (line 3) and splits it into two parts as described in Section 5.1 (line 5). The inner loop (lines 7–16) iterates over the intermediate iteration spaces to be fused with the current output iteration space. An extension schedule of i should not be fused when m > n (line 11), with m and n representing the numbers of parallelizable loops of an output iteration space and i, respectively. The purpose of comparing m and n has been explained in Section 4.5. Lines 13 through 16 perform the manipulations on schedule trees.

# A.3 Description of Algorithm 4

Algorithm 4 takes a schedule tree produced by Algorithm 1 as input and performs the basteln strategy in this work. It implements a novel combination of tiling and fusion in the polyhedral model using three steps.

First, each output iteration space and its intermediate iteration spaces are extracted from the iteration domain of the input schedule tree and saved in *spaces* (line 4), to which Algorithm 2 is then applied (line 5). This prevents the fusions between output iteration spaces, and it indeed makes sense because live-out values do not necessarily need to be allocated on small scratchpads or shared memories. The *groupsset* is a set of *groups*—that is, the output of Algorithm 2 for each *spaces*.

The second step is to handle each intermediate iteration space *sharedspace* that is used by multiple output iteration spaces. Line 7 is computing the intersection of all extension schedules with respect to *sharedspace*. If the intersection is empty, Algorithm 4 does nothing and continues to the next iteration (line 9). If the intersection is exactly equal to *sharedspace*, a tile shape inferred using o is identical with each s in oset. In this case, Algorithm 4 computes *shape* (i.e., the tile shape of s using the elementary operations of affine relations (lines 16–18)) and propagates *shape* to *spaces* with s as the single output iteration space (line 19). Algorithm 4 replaces all extension nodes of *sharedspace* using tiling schedules (line 21) in other cases. This means that *sharedspace* will not be fused.

The final step is the last loop in Algorithm 4. It first applies Algorithm 3 to *groups* (line 23) for constructing schedule trees. If *sharedspace* cannot be fused to any of its uses, the algorithm will remove the introduced nodes related to *sharedspace* in *schedule tree* (line 25), which prevents the possible fusion and avoids redundant computations.

## A.4 Description of Algorithm 5

Inlining can be performed during the code generation process, meaning that loop tiling and fusion have been finished. The code generator is allowed to alter the program expressions within each fusion group. Algorithm 5 takes as input the fusion configuration obtained by our approach and iterates over each fusion group. The while loop is used to perform inlining of consecutive elementwise operators. It does not consider other types of operators (line 6), nor an output elementwise operator (line 9). The case of a single consumer is handled at line 11. Lines 12 through 16 are used to deal with an elementwise operator with multiple consumers. Inlining should be performed when the consumers do not access overlapped data regions (line 16) or discarded otherwise (line 14).

The core inlining operation happens at lines 11 and 16. It can be carried out by first rewriting each occurrence of an elementwise operator's expression with its computation and then degenerating the code of the elementwise operator.

## **ACKNOWLEDGMENTS**

The authors are grateful to the anonymous reviewers of ACM TOCS for commenting on this article. Jinchen Xu, Peng Di, and Xuefeng Jin are the corresponding authors of this work. The authors would also like to acknowledge the input and discussions from the Huawei MindSpore team.

5:42 J. Zhao et al.

## **REFERENCES**

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16)*. 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

- [2] Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Polyhedral auto-transformation with no integer linear programming. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '18). ACM, New York, NY, 529-542. https://doi.org/10.1145/3192366.3192401
- [3] Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2020. Effective loop fusion in polyhedral compilation using fusion conflict graphs. *ACM Transactions on Architecture and Code Optimization* 17, 4 (Sept. 2020), Article 26, 26 pages. https://doi.org/10.1145/3416510
- [4] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics 38, 4 (July 2019), Article 121, 12 pages. https://doi.org/10.1145/3306346.3322967
- [5] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, 303–316. https://doi.org/10.1145/2628071.2628092
- [6] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '19). 193–205. https://doi.org/10.1109/CGO.2019.8661197
- [7] Hesheng Bao, Jacobo Bielak, Omar Ghattas, Loukas F. Kallivokas, David R. O'Hallaron, Jonathan R. Shewchuk, and Jifeng Xu. 1998. Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers. Computer Methods in Applied Mechanics and Engineering 152, 1 (1998), 85–102. https://doi.org/10.1016/S0045-7825(97) 00183-7
- [8] Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). IEEE, Los Alamitos, CA, 7–16. https://doi.org/10.1109/PACT.2004.11
- [9] Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic kernel generation for Volta tensor cores. arXiv:cs.PL/2006.12645 (2020).
- [10] Uday Bondhugula. 2015. PolyMage Benchmarks. (commit d20264ef). Retrieved December 7, 2023 from https://github.com/bondhugula/polymage-benchmarks
- [11] Uday Bondhugula, Vinayaka Bandishti, and Irshad Pananilath. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 1285–1298. https://doi.org/10.1109/TPDS.2016.2615094
- [12] Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, 343–352. https://doi.org/10.1145/1854273. 1854317
- [13] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, 101–113. https://doi.org/10.1145/1375581.1375595
- [14] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18). 578–594. https://www.usenix.org/conference/osdi18/presentation/chen
- [15] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400
- [16] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS '16). ACM, New York, NY, 7–10. https://doi.org/10.1145/2988450.2988454

- [17] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv:cs.NE/1410.0759 (2014).
- [18] Standard Performance Evaluation Corporation. 2007. SPEC CPU2000 V1.3. Retrieved December 7, 2023 from http://www.spec.org/cpu2000/
- [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://doi.org/10.18653/v1/N19-1423
- [20] Paul Feautrier. 1991. Dataflow analysis of array and scalar references. *International Journal of Parallel Programming* 20, 1 (Feb. 1991), 23–53. https://doi.org/10.1007/BF01407931
- [21] Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 1581–1592. https://doi.org/10.1007/978-0-387-09766-4\_502
- [22] Google. 2017. XLA: Optimizing Compiler for Machine Learning. Retrieved December 7, 2023 from https://www.tensorflow.org/xla
- [23] Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, New York, NY, Article 66, 10 pages. https://doi.org/10.1145/2544137.2544160
- [24] Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. *Parallel Processing Letters* 22, 4 (2012), 1250010.
- [25] Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4, Article 12 (July 2015), 50 pages. https://doi.org/10.1145/2743016
- [26] Halide. 2013. Halide Benchmarks (commit 8c23a197). Retrieved December 7, 2023 from https://github.com/halide/ Halide
- [27] Chris Harris and Mike Stephens. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 147–151.
- [28] Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric multi-level tiling of imperfectly nested loops. In *Proceedings of the 23rd International Conference on Supercomputing (ICS '09)*. ACM, New York, NY, 147–157. https://doi.org/10.1145/ 1542275.1542301
- [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770–778. https://doi. org/10.1109/CVPR.2016.90
- [30] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:cs.CV/1704.04861 (2017).
- [31] Huawei. 2020. MindSpore. Retrieved December 7, 2023 from https://www.mindspore.cn/en
- [32] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML '15), Vol. 37. 448–456
- [33] François Irigoin and Rémi Triolet. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '88). ACM, New York, NY, 319–329. https://doi.org/10. 1145/73560.73588
- [34] Abhinav Jangda and Uday Bondhugula. 2020. An effective fusion and tile size model for PolyMage. ACM Transactions on Programming Languages and Systems 42, 3 (Nov. 2020), Article 12, 27 pages. https://doi.org/10.1145/3404846
- [35] Abhinav Jangda and Arjun Guha. 2020. Model-based warp overlapped tiling for image processing programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT '20). ACM, New York, NY, 317–328. https://doi.org/10.1145/3410463.3414649
- [36] Michael Karr. 1976. Affine relationships among variables of a program. Acta Informatica 6, 2 (June 1976), 133–151. https://doi.org/10.1007/BF00268497
- [37] Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. 301–320
- [38] DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07). ACM, New York, NY, Article 51, 12 pages. https://doi.org/10.1145/1362622.1362691

5:44 J. Zhao et al.

[39] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). ACM, New York, NY, 235–244. https://doi.org/10. 1145/1250734.1250761

- [40] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '21). 2–14. https://doi.org/10.1109/CGO51591.2021.9370308
- [41] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. *Proceedings of the IEEE* 86, 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791
- [42] Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: A scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA '21). 789–801. https://doi.org/10.1109/HPCA51647.2021.00071
- [43] Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. 2013. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization 10, 4 (Dec. 2013), Article 35, 27 pages. https://doi.org/10.1145/2541228.2555292
- [44] Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, 429–443. https://doi.org/10.1145/2694344.2694364
- [45] Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI '21). ACM, New York, NY, 883–898. https://doi.org/10.1145/3453483.3454083
- [46] NVIDIA. 2013. cuBLAS. Retrieved December 7, 2023 from https://developer.nvidia.com/cublas
- [47] Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. *Communications of the ACM* 58, 3 (Feb. 2015), 81–91. https://doi.org/10.1145/2723694
- [48] Sylvain Paris, Pierre Kornprobst, and Jack Tumblin. 2009. Bilateral Filtering. Now Publishers, Hanover, MA.
- [49] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026–8037
- [50] Arjun Pitchanathan, Christian Ulmann, Michel Weber, Torsten Hoefler, and Tobias Grosser. 2021. FPL: Fast presburger arithmetic through transprecision. *Proceedings of the ACM on Programming Languages* 5, OOPSLA (Oct. 2021), Article 162, 26 pages. https://doi.org/10.1145/3485539
- [51] Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '11). ACM, New York, NY, 549–562. https://doi.org/10.1145/1926385.1926449
- [52] Louis-Noël Pouchet and Tomofumi Yuki. 2016. PolyBench/C 4.2. Retrieved December 7, 2023 from https://sourceforge.net/projects/polybench
- [53] William Pugh and Evan Rosser. 2000. Iteration space slicing for locality. In Languages and Compilers for Parallel Computing, Larry Carter and Jeanne Ferrante (Eds.). Springer, Berlin, Germany, 164–184.
- [54] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, 519–530. https://doi.org/10.1145/2491956.2462176
- [55] Samyam Rajbhandari, Jinsung Kim, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, Robert J. Harrison, and P. Sadayappan. 2016. On fusing recursive traversals of K-d trees. In Proceedings of the 25th International Conference on Compiler Construction (CC '16). ACM, New York, NY, 152–162. https://doi.org/10.1145/2892208.2892228
- [56] Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 6517–6525. https://doi.org/10.1109/CVPR.2017.690
- [57] Karen Simonyan and Andrew Zisserman. 2014. Very Deep convolutional networks for large-scale image recognition. arXiv:cs.CV/1409.1556 (2014).
- [58] Ramakrishna Upadrasta and Albert Cohen. 2013. Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '13). ACM, New York, NY, 483–496. https://doi.org/10.1145/2429069.2429127

- [59] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Transactions on Architecture and Code Optimization 16, 4 (Oct. 2019), Article 38, 26 pages. https://doi.org/10.1145/3355606
- [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS '17). 6000–6010.
- [61] Sven Verdoolaege. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS'10). Springer-Verlag, Berlin, Heidelberg, 299–302. https://doi. org/10.1007/978-3-642-15582-6\_49
- [62] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article 54 (Jan. 2013), 23 pages. https://doi.org/10.1145/2400682.2400713
- [63] Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. KU Leuven.
- [64] Jie Zhao and Albert Cohen. 2019. Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (Dec. 2019), Article 47, 25 pages. https://doi.org/10.1145/3369382
- [65] Jie Zhao and Peng Di. 2020. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In *Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '20)*. IEEE, Los Alamitos, CA, 427–441. https://doi.org/10.1109/MICRO50266.2020.00044
- [66] Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, and Xuefeng Jin. 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization. In Proceedings of Machine Learning and Systems, Vol. 4. 1–19.
- [67] Jie Zhao, Michael Kruse, and Albert Cohen. 2018. A polyhedral compilation framework for loops with dynamic data-dependent bounds. In *Proceedings of the 27th International Conference on Compiler Construction (CC '18)*. ACM, New York, NY, 14–24. https://doi.org/10.1145/3178372.3179509
- [68] Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI '21). ACM, New York, NY, 1233–1248. https://doi.org/10.1145/3453483.3454106
- [69] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20). 863–879. https://www.usenix.org/conference/osdi20/presentation/zheng
- [70] Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC '18). ACM, New York, NY, 3–13. https://doi.org/10. 1145/3178372.3179507

Received 21 March 2022; revised 11 April 2023; accepted 1 November 2023