Composite operations

This section covers four composite ciphertext operations: $k$-ary product, dot product, matrix multiplication, and polynomial evaluation. Each of these can be implemented by stringing together the fundamental operations from Section 3.3, but is exposed as its own operation for two reasons:

convenience: they show up often enough in practice that callers shouldn't have to assemble them from scratch every time, and
efficiency: a fused implementation can be more efficient than the naive composition.

$k$-ary product

The $k$-ary product operation multiplies $k$ ciphertexts together with optimal depth. The algorithm maintains a priority queue of the input ciphertexts keyed by level, repeatedly popping the two highest-level ciphertexts, multiplying them, and pushing the result back. When all $k$ ciphertexts start at the same level, this takes $\lceil \log_2 k \rceil$ depth.

Exercise: the multiplications above use level drops whenever the two input ciphertexts are not at the same level. Find an algorithm that uses at most one level drop in the entire process.

Dot product

The dot product operation evaluates the pointwise dot product of two lists, where each list element is either a plaintext or a ciphertext. Internally, it is delegated into two parts: the plaintext–ciphertext part (covering the index pairs where one side is a plaintext) and the ciphertext–ciphertext part (covering the index pairs where both sides are ciphertexts). Fusing the $k$ summands into a single accumulation also avoids the per-multiplication rescales and key-switches that the naive version would incur.

To compute a plaintext–ciphertext dot product of $k$ summands, first level-match the plaintexts and ciphertexts at a common level $\ell$ by re-encoding the plaintexts and level-dropping the ciphertexts as needed. Then, given level-$\ell$ plaintexts $m^{(1)}, \ldots, m^{(k)}$ encoding vectors $\mathbf{u}^{(1)}, \ldots, \mathbf{u}^{(k)} \in \mathbb{C}^{32768}$ and level-$\ell$ ciphertexts $(c_0^{(1)}, c_1^{(1)}), \ldots, (c_0^{(k)}, c_1^{(k)})$ encrypting vectors $\mathbf{v}^{(1)}, \ldots, \mathbf{v}^{(k)} \in \mathbb{C}^{32768}$, compute

$$ \begin{aligned} c_0' &= \underset{q_0 \cdots q_\ell \,\to\, q_0 \cdots q_{\ell - 1}}{\mathrm{rescale}_{\mathrm{approx}}}\!\Bigl(\textstyle\sum_{i=1}^{k} m^{(i)}\, c_0^{(i)}\Bigr), \\ c_1' &= \underset{q_0 \cdots q_\ell \,\to\, q_0 \cdots q_{\ell - 1}}{\mathrm{rescale}_{\mathrm{approx}}}\!\Bigl(\textstyle\sum_{i=1}^{k} m^{(i)}\, c_1^{(i)}\Bigr). \end{aligned} $$

The pair $(c_0', c_1')$ encrypts $\sum_{i=1}^{k} \mathbf{u}^{(i)} \odot \mathbf{v}^{(i)}$, where $\odot$ denotes the pointwise (Hadamard) product. Fusing the $k$ multiplications into a single accumulation uses only two rescales instead of $2k$ rescales.

To compute a ciphertext–ciphertext dot product of $k$ summands, first level-drop all $2k$ input ciphertexts to a common level $\ell$. Then, given level-$\ell$ ciphertexts $(c_0^{(1)}, c_1^{(1)}), \ldots, (c_0^{(k)}, c_1^{(k)})$, $(d_0^{(1)}, d_1^{(1)}), \ldots, (d_0^{(k)}, d_1^{(k)})$ encrypting vectors $\mathbf{u}^{(1)}, \ldots, \mathbf{u}^{(k)} \in \mathbb{C}^{32768}$, $\mathbf{v}^{(1)}, \ldots, \mathbf{v}^{(k)} \in \mathbb{C}^{32768}$ respectively, form the triple

$$ \begin{aligned} c_0' &= \textstyle\sum_{i=1}^{k} c_0^{(i)}\, d_0^{(i)}, \\ c_1' &= \textstyle\sum_{i=1}^{k} \bigl(c_0^{(i)}\, d_1^{(i)} + c_1^{(i)}\, d_0^{(i)}\bigr), \\ c_2' &= \textstyle\sum_{i=1}^{k} c_1^{(i)}\, d_1^{(i)}. \end{aligned} $$

The same expansion as before shows that $c_0' + c_1'\, s + c_2'\, s^2$ approximates $\sum_{i=1}^{k} m^{(i)} n^{(i)}$ to relative error $O(2^{-40})$, where $m^{(i)}, n^{(i)}$ are the plaintexts encrypted by the $i$th ciphertexts. Then, applying a key-switch to $c_2'$ with $\mathrm{rlk}$ and rescaling down by one prime gives a ciphertext encrypting $\sum_{i=1}^{k} \mathbf{u}^{(i)} \odot \mathbf{v}^{(i)}$. Fusing the $k$ multiplications into a single accumulation saves $k - 1$ key-switches.

Exercise: this is not the most optimal way to evaluate a general dot product. Optimize it by carefully combining level drops.

Matrix multiplication

The matrix multiplication operation takes a plaintext $32768 \times 32768$ matrix $M$ and a level-$\ell$ ciphertext $\boxed{\mathbf{v}} = \mathrm{Encrypt}(\mathrm{Encode}(\mathbf{v}))$ encrypting a vector $\mathbf{v} \in \mathbb{C}^{32768}$, and produces a level-$(\ell - 1)$ encryption of $M\mathbf{v}$. The algorithm is most efficient when $M$ is diagonal-sparse.

To describe the matrix multiplication algorithm, first define some notation. For $d \in \{0, 1, \ldots, 32767\}$, define the $d$th diagonal of $M$ to be the vector

$$ \mathbf{m}_d = (M_{0,\, d},\; M_{1,\, d+1},\; M_{2,\, d+2},\; \ldots,\; M_{32767,\, d+32767}) \in \mathbb{C}^{32768}, $$

with indices taken modulo $32768$. Define $D \subseteq \{0, 1, \ldots, 32767\}$ to be the set of $d$ for which $\mathbf{m}_d \neq \mathbf{0}$. The expression

$$ \sum_{d \in D} \mathrm{Encode}(\mathbf{m}_d) \odot \mathrm{rot}_d(\,\boxed{\mathbf{v}}\,) $$

is then an encryption of $M\mathbf{v}$, where $\odot$ denotes plaintext–ciphertext pointwise (Hadamard) multiplication. Evaluating this expression naively takes $|D \setminus \{0\}|$ rotations; we can do better with a baby-step giant-step algorithm.

Exercise: verify that the above expression is an encryption of $M\mathbf{v}$.

Baby-step giant-step

The baby-step giant-step algorithm reduces the rotation count by factoring each rotation. For each $d \in D$, choose a representative pair $(b_d, g_d) \in \{0, 1, \ldots, 32767\}^2$ with $b_d + g_d \equiv d \pmod{32768}$. Let $B = \{b_d : d \in D\}$ and $G = \{g_d : d \in D\}$, and choose the representatives so that $|B \setminus \{0\}| + |G \setminus \{0\}|$ is minimal.

Define $\sigma : \mathbb{C}^{32768} \to \mathbb{C}^{32768}$ to be the cyclic shift by one index, so that

$$ \sigma(v_0, v_1, \ldots, v_{32767}) = (v_1, v_2, \ldots, v_{32767}, v_0). $$

Then the encryption of $M\mathbf{v}$ rearranges to

$$ \sum_{g \in G} \mathrm{rot}_g\!\left( \sum_{\substack{d \in D \\ g_d = g}} \mathrm{Encode}(\sigma^{-g}(\mathbf{m}_d)) \odot \mathrm{rot}_{b_d}(\,\boxed{\mathbf{v}}\,) \right). $$

The plaintexts $\mathrm{Encode}(\sigma^{-g}(\mathbf{m}_d))$ can be precomputed. At runtime, the work splits into three stages:

baby steps: for each $b \in B \setminus \{0\}$, compute $\mathrm{rot}_b(\,\boxed{\mathbf{v}}\,)$,
inner sums: for each $g \in G$, fuse the inner $b$-sum into a single plaintext–ciphertext dot product against the precomputed plaintexts, and
giant steps: for each $g \in G \setminus \{0\}$, apply $\mathrm{rot}_g$ to the corresponding inner sum, and accumulate the $|G|$ rotated sums into one ciphertext.

This uses $|B \setminus \{0\}|$ rotations in the baby steps and $|G \setminus \{0\}|$ in the giant steps, for a total of $|B \setminus \{0\}| + |G \setminus \{0\}|$ rotations. For most structured matrices, $B$ and $G$ can be chosen so that this is much better than the naive approach.

Exercise: show that any $B, G$ covering $D$ via the sumset satisfy $|B| + |G| \geq 2\sqrt{|D|}$, and explain why $\lceil 2\sqrt{|D|}\,\rceil$ rotations suffice when the nonzero diagonal indices form an arithmetic progression.

Hoisting

Hoisting is an optimization that shares the modulus-raising step of the rotation key-switch across many rotations of the same ciphertext, such as what happens in the baby-step stage of matrix multiplication. To hoist many rotations of a ciphertext $\boxed{\mathbf{v}}$, let $\boxed{\mathbf{v}} = (c_0, c_1)$ and recall from Section 3.3 that each rotation key-switch begins by lifting $\mathrm{rot}_b(c_1)$ into the extended ring $\tilde R_\ell$ via $\lfloor \ell / 3 \rfloor + 1$ approximate modulus raises:

$$ (\tau_{5^b}(c_1))^{(i)} \;=\; \underset{Q_i \,\to\, q_0 \cdots q_\ell\, p_0 p_1 p_2}{\mathrm{modraise}_{\mathrm{approx}}}\!\bigl(\tau_{5^b}(c_1) \bmod Q_i\bigr). $$

Because $\tau_{5^b}$ acts on polynomial coefficients by permuting them and flipping some signs, it commutes with both modular reduction and approximate modulus raising, giving

$$ (\tau_{5^b}(c_1))^{(i)} \;=\; \tau_{5^b}\bigl(c_1^{(i)}\bigr). $$

So the lifts $c_1^{(0)}, c_1^{(1)}, \ldots, c_1^{(\lfloor \ell/3 \rfloor)}$ can be computed once on $c_1$, and for each $b \in B \setminus \{0\}$ the per-rotation lifts can be obtained by applying the cheap permutation $\tau_{5^b}$. This reduces the baby-step modulus-raising work from $|B \setminus \{0\}|$ batches of lifts to one lift.

Exercise: check that $\tau_{5^b}$ commutes with approximate modulus raising.

Double hoisting

Double hoisting is an optimization that shares the rescaling step of the rotation key-switch across rotations whose outputs are summed, such as what happens in the giant-step stage of matrix multiplication. To double-hoist a sum of rotations of ciphertexts

$$ \sum_{g \in G \setminus \{0\}} \mathrm{rot}_g(\,\boxed{\mathbf{w}_g}\,) $$

at the same level, let $(\tilde k_0^{(g)}, \tilde k_1^{(g)}) \in \tilde R_\ell \times \tilde R_\ell$ denote the unrescaled output of the $g$th rotation key-switch and recall from Section 3.3 that the rescaled outputs are

$$ \begin{aligned} k_0^{(g)} &= \mathrm{rescale}_{\mathrm{approx}}(\tilde k_0^{(g)}), \\ k_1^{(g)} &= \mathrm{rescale}_{\mathrm{approx}}(\tilde k_1^{(g)}) \end{aligned} $$

for $g \in G \setminus \{0\}$. Since

$$ \begin{aligned} \sum_{g \in G \setminus \{0\}} \mathrm{rescale}_{\mathrm{approx}}\!\bigl(\tilde k_0^{(g)}\bigr) &\approx \mathrm{rescale}_{\mathrm{approx}}\!\Bigl(\textstyle\sum_{g \in G \setminus \{0\}} \tilde k_0^{(g)}\Bigr), \\ \sum_{g \in G \setminus \{0\}} \mathrm{rescale}_{\mathrm{approx}}\!\bigl(\tilde k_1^{(g)}\bigr) &\approx \mathrm{rescale}_{\mathrm{approx}}\!\Bigl(\textstyle\sum_{g \in G \setminus \{0\}} \tilde k_1^{(g)}\Bigr), \end{aligned} $$

the giant-step rescales can be deferred: accumulate the unrescaled $\tilde k_j^{(g)}$'s in $\tilde R_\ell$ first, then rescale once at the end. This reduces the giant-step rescaling count from $2\,|G \setminus \{0\}|$ rescales to two rescales.

Optimizing key count

Rotation keys take up a lot of space, so reducing the number of distinct rotation amounts used by the baby-step giant-step algorithm directly reduces the memory footprint of the matrix multiplication. The two sums in the algorithm — the inner sum over $B$ and the outer sum over $G$ — each admit different optimizations when their rotation amounts form arithmetic progressions.

Inner sum (over $B$). The baby-step rotations $\mathrm{rot}_b(\boxed{\mathbf{v}})$ for $b \in B$ can be computed in sequence from a single base rotation key. When $B$ is an arithmetic progression $\{0, d, 2d, \ldots, (k-1) d\}$, the rotations are obtained by iterated application of $\mathrm{rot}_d$:

$$ \mathrm{rot}_{i\, d}(\boxed{\mathbf{v}}) = \underbrace{\mathrm{rot}_d \circ \cdots \circ \mathrm{rot}_d}_{i \text{ times}}(\boxed{\mathbf{v}}). $$

This uses only the single rotation key $\mathrm{rot}_d$, saving $|B \setminus \{0\}| - 1$ keys. The trade-off is that the rotations now operate on a chain of distinct ciphertexts rather than on the same $\boxed{\mathbf{v}}$, so hoisting no longer applies.

Outer sum (over $G$). The giant-step sum $\sum_{g \in G} \mathrm{rot}_g(\boxed{\mathbf{w}_g})$, where each $\boxed{\mathbf{w}_g}$ is the inner sum at giant step $g$, cannot be reduced by precomputing rotations of a single ciphertext, because the $\boxed{\mathbf{w}_g}$'s themselves differ. Instead, when $G$ is an arithmetic progression $\{0, d, 2d, \ldots, (k-1) d\}$, a Horner-style nesting reduces the giant-step rotation amounts to one:

$$ \sum_{i=0}^{k-1} \mathrm{rot}_{i\, d}(\boxed{\mathbf{w}_i}) = \boxed{\mathbf{w}_0} + \mathrm{rot}_d\Bigl(\boxed{\mathbf{w}_1} + \mathrm{rot}_d\bigl(\boxed{\mathbf{w}_2} + \cdots\bigr)\Bigr). $$

This saves $|G \setminus \{0\}| - 1$ rotation keys for giant steps. The trade-off is that the chain of giant-step rotations is sequential and operates on distinct accumulators, so double-hoisting no longer applies.

Exercise: the optimizations above only work directly when the rotation amounts form an arithmetic progression. For a generic set of rotation amounts $\{r_1, \ldots, r_k\}$, explain how it becomes a Steiner tree problem over $\mathbb{Z}/32768\mathbb{Z}$.

Polynomial evaluation

Because CKKS's fundamental operations are limited to addition, subtraction, multiplication, and rotation, the only functions that can be evaluated on a ciphertext are polynomials. Non-polynomial functions like $\sqrt{\,\cdot\,}$, $\lfloor \cdot \rfloor$, or $|\cdot|$ have to be approximated by polynomials first, so polynomial evaluation is a workhorse worth optimizing.

The polynomial evaluation operation takes a degree-$d$ polynomial with plaintext coefficients and a level-$\ell$ ciphertext $\boxed{\mathbf{v}}$, and returns a level-$(\ell - \lceil \log_2(d+1) \rceil)$ ciphertext encrypting the slotwise application of the polynomial to $\mathbf{v}$. There are two regimes, distinguished by the basis the polynomial is expressed in: low-degree evaluation uses the monomial basis $\{1, x, x^2, \ldots\}$, while high-degree evaluation uses a numerically stable Chebyshev-like basis.

Low-degree evaluation

For low-degree evaluation, write the polynomial as $p(x) = \sum_{i=0}^{d} c_i\, x^i$. One way to evaluate the polynomial is to first precompute the powers $x^2, x^3, \ldots, x^d$ via $x^i = x^{\lceil i/2 \rceil}\, x^{\lfloor i/2 \rfloor}$ and then compute the dot product $\sum_i c_i \cdot x^i$. However, this has two inefficiencies:

it uses depth $\lceil \log_2 d \rceil + 1$, one more than the optimum $\lceil \log_2(d+1) \rceil$, and
it requires $d - 1$ ciphertext–ciphertext multiplications, which is expensive.

The Paterson–Stockmeyer algorithm optimizes the ciphertext–ciphertext multiplication count. To evaluate a polynomial using the Paterson–Stockmeyer algorithm, pick a block size $s$ equal to a power of two near $\sqrt{d+1}$, and precompute

the baby steps $x, x^2, \ldots, x^s$ via $x^i = x^{\lceil i/2 \rceil}\, x^{\lfloor i/2 \rfloor}$, and
the giant steps $x^s, x^{2s}, \ldots, x^{\lfloor d/s \rfloor s}$ via $x^{is} = x^{\lceil i/2 \rceil s}\, x^{\lfloor i/2 \rfloor s}$.

Then write

$$ p(x) = \sum_{k=0}^{\lfloor d/s \rfloor} \Bigl(\sum_{i=0}^{s-1} c_{ks + i}\, x^i\Bigr)\, x^{ks}. $$

The $\lfloor d/s \rfloor$ outer multiplications by giant steps all sum together, so they fuse into a single ciphertext–ciphertext dot product sharing one key-switch. The total key-switch count is therefore $s + \lfloor d/s \rfloor - 1 \approx \tfrac{3}{2}\sqrt{2d}$ (the constant degrades from the ideal $2\sqrt{d}$ because $s$ must be a power of two and so deviates from $\sqrt{d+1}$ by up to a factor of $\sqrt{2}$), much better than the $d - 1$ key-switches of the naive approach.

Exercise: verify that with $s$ constrained to a power of two, the choice of $s$ closest to $\sqrt{d+1}$ minimizes this key-switch count, giving the $\tfrac{3}{2}\sqrt{2d}$ bound above.

However, this is still not depth-optimal: the topmost giant step $x^{\lfloor d/s \rfloor s}$ sits at depth $\lceil \log_2 d \rceil$, and multiplying its block by it pushes the total depth to $\lceil \log_2 d \rceil + 1$. To save the last level, replace the flat block sum with a recursive partition. Let $L$ be the smallest power of two greater than $d$, and compute the giant steps $x^s, x^{2s}, \ldots, x^{L/2}$. Then define a recursive evaluator $\mathrm{eval}(a, b)$ that returns the partial sum $\sum_{i \in [a, b)} c_i\, x^{i - a}$ by:

if $b - a \le 2$, returning the plaintext–ciphertext dot product $\sum_{i \in [a, b)} c_i\, x^{i - a}$ directly;
otherwise, splitting at the midpoint $m = (a + L) / 2$ of the enclosing power-of-two interval $[a, L)$ (so $m - a$ is a power of two), decomposing the fully populated lower half $[a, m)$ as a flat block sum $\sum_{k=0}^{(m-a)/s - 1} \bigl(\sum_{i=0}^{s-1} c_{a + ks + i}\, x^i\bigr)\, x^{ks}$, recursively computing $\mathrm{eval}(m, b)$ for the partially populated upper half $[m, b)$, multiplying by $x^{m - a}$ (one of the precomputed giant steps, or a baby step if $m - a < s$), and summing the two halves.

Calling $\mathrm{eval}(0, d+1)$ gives $p(x)$. For example, for $d = 15$ with $s = 4$, the recursion unfolds as

$\mathrm{eval}(0, 16)$, $m = 8$: lower half $c_0 + c_1 x + c_2 x^2 + c_3 x^3 + (c_4 + c_5 x + c_6 x^2 + c_7 x^3)\, x^4$; recurse on $[8, 16)$ and multiply by $x^8$.
$\mathrm{eval}(8, 16)$, $m = 12$: lower half $c_8 + c_9 x + c_{10} x^2 + c_{11} x^3$; recurse on $[12, 16)$ and multiply by $x^4$.
$\mathrm{eval}(12, 16)$, $m = 14$: lower half $c_{12} + c_{13} x$ (here $m - a = 2 < s$, so the giant step $x^2$ is a baby step); recurse on $[14, 16)$ and multiply by $x^2$.
$\mathrm{eval}(14, 16)$: base case, returns $c_{14} + c_{15} x$.

Combining gives the expansion

$$ \begin{aligned} p(x) =\;& c_0 + c_1 x + c_2 x^2 + c_3 x^3 \\ &+ (c_4 + c_5 x + c_6 x^2 + c_7 x^3)\, x^4 \\ &+ \bigl[(c_8 + c_9 x + c_{10} x^2 + c_{11} x^3) + \bigl(c_{12} + c_{13} x + (c_{14} + c_{15} x)\, x^2\bigr)\, x^4\bigr]\, x^8. \end{aligned} $$

Exercise: check that this uses optimal depth $\lceil \log_2(d+1) \rceil$.

Exercise: verify that with $s$ a power of two, the optimal block size for the depth-aware version is $s \approx \sqrt{L/2}$, and that the total key-switch count at this optimum is at most $\sqrt{2d} + \log_2 d$.

Exercise: the depth-aware analysis above assumes $d = 2^k - 1$ for some $k$ (so that $L = d + 1$). Optimize the algorithm further for $d$ not of this form, where $L$ may be much larger than $d + 1$.

High-degree evaluation

For large $d$, monomial evaluation is numerically unstable: one of $c_n$ and $x^n$ can be very large while the other is very small, so the intermediate $c_n x^n$ products cannot be computed within CKKS's $\sim 2^{-40}$ precision. The fix is to switch to a well-conditioned basis.

The Chebyshev basis of the first kind on $[-2, 2]$ is the sequence $\{\tilde T_n\}_{n \ge 0}$ given by

$$ \tilde T_0 = 2,\quad \tilde T_1(x) = x,\quad \tilde T_{n+1}(x) = x\, \tilde T_n(x) - \tilde T_{n-1}(x), $$

obtained by rescaling the usual Chebyshev polynomials via $\tilde T_n(x) = 2\, T_n(x/2)$. The first few entries are

$$ \begin{aligned} \tilde T_0(x) &= 2, \\ \tilde T_1(x) &= x, \\ \tilde T_2(x) &= x^2 - 2, \\ \tilde T_3(x) &= x^3 - 3x, \\ \tilde T_4(x) &= x^4 - 4x^2 + 2, \\ \tilde T_5(x) &= x^5 - 5x^3 + 5x, \\ \tilde T_6(x) &= x^6 - 6x^4 + 9x^2 - 2, \\ \tilde T_7(x) &= x^7 - 7x^5 + 14x^3 - 7x, \end{aligned} $$

and so on. The relevant quantity for numerical stability is the intermediate product $c_n\, \tilde T_n(x)$, which is numerically stable on $[-2, 2]$ because both factors are bounded:

the basis itself is bounded: substituting $x = 2\cos\theta$ gives $\tilde T_n(2\cos\theta) = 2\cos(n\theta)$, so $|\tilde T_n(x)| \le 2$ on $[-2, 2]$;
the coefficients are bounded: for nonzero $n$, the orthogonality relation gives $$ |c_n| = \left|\frac{1}{2\pi} \int_{-2}^{2} \frac{p(x)\, \tilde T_n(x)}{\sqrt{4 - x^2}}\, dx\right| \le \frac{1}{2\pi} \int_{-2}^{2} \frac{2\,\max_{x \in [-2, 2]} |p(x)|}{\sqrt{4 - x^2}}\, dx = \max_{x \in [-2, 2]} |p(x)|. $$

Paterson–Stockmeyer carries over to this basis because the $\tilde T_n$'s satisfy the product identity

$$ \tilde T_{m+n}(x) = \tilde T_m(x)\, \tilde T_n(x) - \tilde T_{|m-n|}(x). $$

The baby and giant steps become $\tilde T_1, \ldots, \tilde T_s$ and $\tilde T_s, \tilde T_{2s}, \ldots, \tilde T_{L/2}$, built via this identity with depth-balanced splits.

The recursive partition of $[0, d+1)$ is otherwise identical to the monomial case, but each split at $\tilde T_N$ now produces a cross-term that has to be absorbed into the lower-index coefficients:

$$ \sum_{n=N}^{2N-1} c_n\, \tilde T_n = \tilde T_N \cdot \Bigl(c_N + \sum_{j=1}^{N-1} c_{N+j}\, \tilde T_j\Bigr) - \sum_{j=1}^{N-1} c_{N+j}\, \tilde T_{N-j}. $$

The final negative sum has indices in $[1, N-1]$, which get folded into the existing coefficients of $\tilde T_1, \ldots, \tilde T_{N-1}$.

For example, for $d = 15$ with $s = 4$, the recursion produces the expansion

$$ \begin{aligned} p(x) =\;& \gamma_0 + \gamma_1 \tilde T_1 + \gamma_2 \tilde T_2 + \gamma_3 \tilde T_3 \\ &+ (\gamma_4 + \gamma_5 \tilde T_1 + \gamma_6 \tilde T_2 + \gamma_7 \tilde T_3)\, \tilde T_4 \\ &+ \bigl[(\gamma_8 + \gamma_9 \tilde T_1 + \gamma_{10} \tilde T_2 + \gamma_{11} \tilde T_3) + \bigl(\gamma_{12} + \gamma_{13} \tilde T_1 + (\gamma_{14} + \gamma_{15} \tilde T_1)\, \tilde T_2\bigr)\, \tilde T_4\bigr]\, \tilde T_8, \end{aligned} $$

where the $\gamma_i$ are linear combinations of the $c_j$:

$$ \begin{aligned} \gamma_0 &= c_0, \\ \gamma_1 &= c_1 - c_7 + c_9 - c_{15}, \\ \gamma_2 &= c_2 - c_6 + c_{10} - c_{14}, \\ \gamma_3 &= c_3 - c_5 + c_{11} - c_{13}, \\ \gamma_4 &= c_4 - c_{12}, \\ \gamma_5 &= c_5 - c_{11}, \\ \gamma_6 &= c_6 - c_{10}, \\ \gamma_7 &= c_7 - c_9, \\ \gamma_8 &= c_8, \\ \gamma_9 &= c_9 - c_{15}, \\ \gamma_{10} &= c_{10} - c_{14}, \\ \gamma_{11} &= c_{11} - c_{13}, \\ \gamma_{12} &= c_{12}, \\ \gamma_{13} &= c_{13} - c_{15}, \\ \gamma_{14} &= c_{14}, \\ \gamma_{15} &= c_{15}. \end{aligned} $$

The encrypted runtime and depth match the monomial case, up to the extra subtraction in each Chebyshev recursion step.

Note that this basis is only numerically stable on $[-2, 2]$: outside this range, $|\tilde T_n(x)|$ grows quickly with $n$. To evaluate a polynomial stably on a different interval, the input ciphertext should first be rescaled and shifted so that its slot values land in $[-2, 2]$.

Exercise: why not use the standard Chebyshev basis $T_n$ on $[-1, 1]$ instead of $\tilde T_n$ on $[-2, 2]$?

Per-slot evaluation

The coefficients $c_i$ in either algorithm above can be plaintext vectors with different values in different slots, since they only enter as plaintext–ciphertext multiplications. Thus, it is possible to run a different polynomial on each of the $32768$ slots of $\boxed{\mathbf{v}}$ in parallel; the runtime and depth are set by the largest degree across all slots.