Part 4. How is CKKS implemented?

Part 3 explained how CKKS works. But a description of what CKKS is doesn't translate directly into fast, easy-to-use, and flexible code. There are two questions our system answers.

First: when a user writes a sequence of ciphertext operations, what's the best sequence of polynomial operations to actually run? When should ciphertext levels be dropped? Where should NTTs and inverse NTTs be inserted? We want the compiler to make these decisions automatically.

Second: how should polynomials be stored in GPU memory? GPU architecture rewards certain data layouts and punishes others, and the right choice can be the difference between an order of magnitude in performance.

Our system addresses both questions. Developers express their computations as Python programs on ciphertexts. The compiler turns that into a graph, rewrites it through a series of optimization passes, decides on the data layout, and compiles the result into CUDA kernels.

By the end of this section, you should have a clear sense of how a CKKS program goes from a high-level computation graph to fast GPU code. We'll cover:

the IR our system uses to represent CKKS programs (4.1),
the optimization passes we apply to it (4.2), and
how polynomials are laid out in GPU memory and why it matters (4.3).

Part 4. How is CKKS implemented?

Quick links