Cost model
The cost of a CKKS program has two parts: the time it takes to run and the space it needs to hold its keys and ciphertexts. This page covers both at the default parameters.
Time cost
It is most useful to reason about running time at three increasing levels of granularity.
Level 1: minimize bootstrapping
In typical applications, bootstrapping accounts for roughly $50\%$ to $80\%$ of the total computation time. The first-order rule is thus to minimize the number of bootstraps. This mostly comes down to keeping the multiplicative depth low.
Level 2: minimize key-switching
Once bootstrapping is under control, the next dominant cost is key-switching, the subroutine underlying rotation and the multiplication of two ciphertexts. A useful accounting is in terms of how many key-switches each operation incurs:
- A single rotation or conjugation incurs one key-switch.
- A dot product of two length-$k$ lists of ciphertexts, which computes the sum of pointwise products $\sum_{i=1}^{k} \mathbf{u}_i \odot \mathbf{v}_i$, incurs a single key-switch regardless of $k$.
- Polynomial evaluation of degree $d$ costs about $\tfrac{3}{2}\sqrt{2d}$ key-switches.
- Matrix multiplication costs about $2\sqrt{m}$ key-switches, where $m$ is the number of nonzero diagonals of the matrix.
Level 3: benchmark it
The heuristics above are only approximate. In particular, every computation runs more slowly at a higher level than the same computation at the lower level, and this relationship is hard to capture in a simple formula. For anything more precise, benchmark it directly.
Space cost
The space a program needs is the sum of two things: the keys it has generated and the ciphertexts it holds live at any one time.
- Each public key (a rotation, relinearization, or conjugation key) is about $126$ MB.
- A ciphertext at level $\ell$ takes about $(\ell + 1)$ MB.
Adding these up gives the working set, which must fit in GPU memory. A typical GPU has on the order of $16$ to $80$ GB. When a computation's working set is too large for a single GPU, whether from many keys or many large intermediate ciphertexts held at once, the work can be sharded across additional GPUs.