Optimization passes

With a program in hand as a computation graph, the compiler makes it faster by applying a sequence of optimization passes: graph-to-graph rewrites that leave the computed result unchanged while lowering its cost. Some are classic compiler transformations that would speed up any program; others are specific to CKKS, where the costs that matter most are multiplicative depth (which forces expensive bootstraps), the level each ciphertext sits at, and the sheer volume of polynomial arithmetic feeding the GPU. The passes run at both levels of the IR: the first three rewrite the ciphertext graph before it is lowered, and the rest rewrite the polynomial graph afterward, closer to the hardware.

Classic simplifications

[WORK IN PROGRESS]

The standard compiler cleanups, applied to the ciphertext graph: common-subexpression elimination so a repeated computation runs only once, dead-code elimination to prune nodes whose results are never used, and constant folding to evaluate any plaintext-only subgraph ahead of time.

Depth reduction

[WORK IN PROGRESS]

Multiplicative depth is what forces bootstraps, so this pass restructures the graph to use as little of it as possible, for example by rebalancing chains of multiplications into trees and computing products of many ciphertexts at optimal depth.

Level and scale management

[WORK IN PROGRESS]

This pass decides where level drops and rescales go, placing them as late as the scale bookkeeping allows so that ciphertexts spend more of the computation at lower levels, where every polynomial operation is cheaper.

NTT and inverse-NTT minimization

[WORK IN PROGRESS]

Working on the polynomial graph, this pass inserts a transform only where an operation actually needs the other representation, and removes redundant ones by keeping a polynomial in evaluation form across consecutive multiplications instead of converting back and forth.

Kernel fusion

[WORK IN PROGRESS]

Adjacent pointwise polynomial operations are merged into a single GPU kernel, so that intermediate results stay in registers instead of being written out to memory and read back, cutting the memory traffic that these bandwidth-bound kernels are limited by.

Scheduling and liveness

[WORK IN PROGRESS]

The order in which independent operations run determines how many large polynomials are live at once, so this pass schedules the graph to keep the peak working set within the GPU's memory.