DFT Template: Geometric Resilience in Steganography

Abstract

Geometric transformations – rotation, scaling, cropping, and their compositions – are fatal to DCT-domain steganography. A 5-degree rotation or a 10% downscale destroys the 8x8 block grid that all coefficient-domain methods depend on, driving the bit error rate to 50% (random chance). This paper presents the DFT magnitude template embedding system used in Phasm’s Armor mode to solve the geometric synchronization problem: recovering the encoder-decoder coordinate mapping after an unknown affine transform. We embed 32 passphrase-derived peaks in a mid-frequency annular band of the 2D DFT magnitude spectrum, exploiting the translation invariance and rotation/scale covariance of the Fourier transform. At decode time, a least-squares estimator recovers the rotation angle and scale factor from detected peak displacements, allowing the decoder to invert the transform and proceed with standard DCT-domain extraction. We derive the closed-form estimation formulas, analyze the detection confidence metric, describe the Hermitian symmetry constraint for real-valued images, and present a secondary DFT ring payload channel that embeds 64 raw bits (8 bytes) in the magnitude spectrum using QIM on annular sectors – designed as a resize-robust fallback when the primary STDM payload cannot be recovered, though current frame overhead limits its usable capacity. The complete system is implemented in pure Rust with deterministic arithmetic, producing bit-identical results across native and WebAssembly targets.

1. The Geometric Synchronization Problem

1.1 Why Geometry Breaks DCT-Domain Steganography

JPEG steganography embeds data by modifying quantized DCT coefficients within 8x8 pixel blocks. The encoder and decoder must agree on which coefficients carry which bits – a mapping defined by the block grid, the coefficient selection algorithm, and the passphrase-derived permutation. This agreement is the synchronization between encoder and decoder.

Armor mode achieves high-reliability JPEG recompression survival (>99% same-QF, >95% cross-library) precisely because JPEG recompression preserves the block grid. The decoder reads the same 8x8 blocks from the same grid positions, finds the same coefficients (most of which have survived re-quantization), and successfully extracts the message through Reed-Solomon error correction.

Geometric transformations destroy this synchronization completely. Consider what happens when a stego image undergoes even a modest rotation:

The spatial-domain pixels are interpolated onto a rotated grid.
The rotated image is re-blocked into new 8x8 DCT blocks.
The new blocks have no spatial correspondence to the original blocks – they cover entirely different image regions.
The decoder reads coefficients from blocks that bear no relation to the ones written by the encoder.
The extracted bits are effectively random: BER approaches 50%.

The same argument applies to scaling. Downscaling a 4032x3024 image to 2016x1512 halves each dimension, producing a new 8x8 block grid that samples entirely different spatial regions. Even a scale factor of 0.99 – barely perceptible to the eye – shifts every block boundary by a fraction of a pixel, and after re-quantization, the accumulated positional error corrupts the majority of embedded bits.

Cropping is equally destructive unless the crop happens to be block-aligned (a multiple of 8 pixels on each edge). Non-aligned crops shift the entire block grid, and the decoder has no way to determine the offset. Even block-aligned crops require the decoder to know which blocks survived – information that the standard Armor pipeline does not encode.

1.2 Prior Art: Three Paradigms for Geometric Resilience

The watermarking and steganography literature offers three broad strategies for surviving geometric transforms.

Feature-point registration (SIFT/SURF). Detect geometrically invariant image features (SIFT keypoints), use feature matching to estimate the applied transform, invert it, then extract the payload from the corrected image. The advantage is that no additional signal is embedded – the synchronization comes from the image content itself. The disadvantage is high implementation complexity, content-dependent feature counts (uniform images yield few features), and instability under aggressive JPEG compression that shifts keypoint positions.

Invariant-domain embedding (Log-Polar DFT / Fourier-Mellin Transform). Transform the image into a domain that is inherently rotation/scale invariant, then embed directly in that domain. The Fourier-Mellin Transform (FMT) maps the DFT magnitude into log-polar coordinates where rotation becomes a circular shift and scaling becomes a linear shift. The advantage is that no transform estimation is needed at decode. The disadvantage is severe capacity limitation – typically 32-256 bits (4-32 bytes) – far below Phasm’s target of short text messages under 1 KB.

Template/pilot signal embedding. Embed a known pattern (a set of peaks at predetermined frequencies in the DFT magnitude spectrum) alongside the data payload. At decode, search for the template to estimate the geometric transform parameters, invert the transform, then extract the data payload using the standard algorithm. The advantage is high compatibility with existing DCT-domain embedding systems and high capacity (the template consumes negligible bandwidth compared to the payload). The disadvantage is the template estimation attack (Holliman and Memon, 2000): if an adversary knows the template structure, they can estimate and remove the peaks.

1.3 Phasm’s Approach: Passphrase-Derived DFT Template

Phasm adopts the template paradigm because it offers the best combination of compatibility with the existing Armor STDM pipeline, high capacity (the data payload is unaffected), and reasonable implementation complexity. The key innovation is that template peak positions are passphrase-derived – generated from a ChaCha20 PRNG seeded by an Argon2id-derived key – so an attacker who does not know the passphrase cannot estimate or remove the template.

The template system operates in the 2D DFT magnitude spectrum, exploiting three fundamental properties:

The DFT magnitude is translation invariant: shifting the image in the spatial domain does not change the magnitude spectrum.
Rotation covariance: rotating the spatial image by angle $\theta$ rotates the magnitude spectrum by the same angle $\theta$.
Scale covariance: scaling the spatial image by factor $s$ scales the magnitude spectrum by $1/s$.

These properties mean that a template embedded in the DFT magnitude can be detected after arbitrary rotation, scaling, and translation – and the peak displacements directly reveal the transform parameters.

2. DFT Magnitude Properties

2.1 The 2D Discrete Fourier Transform

The 2D DFT of an $M \times N$ image $f(x, y)$ is:

$$F(u, v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y) \, e^{-j2\pi(ux/M + vy/N)}$$

where $u \in [0, M-1]$ and $v \in [0, N-1]$ are frequency indices, and $j = \sqrt{-1}$. Each complex coefficient $F(u, v) = |F(u, v)| \, e^{j\phi(u,v)}$ has a magnitude $|F(u, v)|$ and phase $\phi(u, v)$.

Phasm computes the 2D DFT using a custom in-house FFT engine that replaces the rustfft crate entirely. The engine implements radix-2 Cooley-Tukey for power-of-two dimensions and Bluestein’s chirp-z transform for arbitrary dimensions. All twiddle factors are computed via FDLIBM-based deterministic trigonometric functions (det_sincos) from the det_math module, which provides a complete port of FDLIBM deterministic transcendentals that guarantee bit-identical results across native ARM64, x86-64, and WebAssembly targets. This deterministic arithmetic is critical: the encoder and decoder must agree on exact peak positions, and any platform-dependent rounding in twiddle factors would cause positional disagreement. Memory optimizations in the Phase 3 geometric decode pipeline achieved an 81-90% reduction in peak memory usage compared to naive implementations, making the full 2D FFT + template detection + geometric correction pipeline practical on mobile devices and in WebAssembly’s constrained memory environment. (For details on the cross-platform math challenge, see our deep dive on deterministic math in WASM, which covers the FDLIBM port, det_math module, and in-house FFT engine.)

2.2 Translation Invariance

The shift theorem states that translating the spatial image by $(x_0, y_0)$ multiplies the DFT by a complex exponential:

$$\mathcal{F}\{f(x - x_0, y - y_0)\} = F(u, v) \cdot e^{-j2\pi(ux_0/M + vy_0/N)}$$

The magnitude is:

$$\left| F(u, v) \cdot e^{-j2\pi(ux_0/M + vy_0/N)} \right| = |F(u, v)|$$

because the exponential factor has unit magnitude. This means that the DFT magnitude spectrum is completely unchanged by spatial translation – including the translation component of a crop-and-paste operation. The template peaks remain at exactly the same frequency positions regardless of how the image is repositioned.

2.3 Rotation Covariance

The rotation theorem for the continuous Fourier transform extends to the discrete case (with aliasing considerations at the boundaries): rotating the spatial image by angle $\theta$ rotates the frequency spectrum by the same angle $\theta$.

If we denote spatial rotation by $R_\theta$:

$$\mathcal{F}\{f(R_\theta(x, y))\} = F(R_\theta(u, v))$$

In Cartesian coordinates, if a template peak is originally at frequency $(u_0, v_0)$, after spatial rotation by $\theta$ it will be detected at:

$$\begin{pmatrix} u' \\ v' \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} u_0 \\ v_0 \end{pmatrix}$$

The magnitude is preserved: $|F(u', v')| = |F(u_0, v_0)|$. By measuring the angular displacement of the detected peaks relative to their expected positions, the decoder recovers $\theta$.

2.4 Scale Covariance

Scaling the spatial image by factor $s$ (uniform scaling) produces the dual scaling in the frequency domain:

$$\mathcal{F}\{f(x/s, y/s)\} = s^2 \, F(su, sv)$$

The magnitude at frequency $(u, v)$ in the original becomes the magnitude at frequency $(u/s, v/s)$ in the scaled image (modulo the $s^2$ amplitude factor, which affects all peaks equally). A peak originally at radius $r_0 = \sqrt{u_0^2 + v_0^2}$ from the origin will be detected at radius $r_0 / s$ after scaling by $s$.

This means that a 0.8x downscale pushes all peaks outward in the frequency domain (the spectrum is compressed, so the peaks appear at higher relative frequencies in the new, smaller DFT), while a 1.5x upscale pulls them inward. By measuring the radial displacement of detected peaks relative to their expected positions, the decoder recovers $s$.

2.5 Why Magnitude, Not Phase

The DFT magnitude carries all the properties we need: translation invariance, rotation covariance, and scale covariance. The DFT phase, by contrast, encodes spatial position information. The shift theorem shows that translation multiplies the DFT by a phase term – meaning the phase changes unpredictably under translation, making it unsuitable for a synchronization template.

Magnitude-domain embedding also has a robustness advantage: the magnitude spectrum is more stable than the phase spectrum under JPEG recompression, because quantization noise affects phase much more than magnitude (phase errors grow linearly with coefficient perturbation, while magnitude errors grow sublinearly due to the vector addition geometry).

3. Passphrase-Derived Peak Generation

3.1 The Security Requirement

If template peaks are placed at fixed, publicly known positions in the frequency spectrum – as in the original Pereira and Pun (2000) scheme – an adversary can estimate the template by averaging the magnitude in those positions across multiple stego images (or by comparing with a clean version of the same image). Once estimated, the template can be subtracted without significantly degrading image quality. This is the template estimation attack described by Holliman and Memon (2000).

Phasm mitigates this vulnerability by deriving peak positions from the passphrase. An attacker who does not know the passphrase cannot determine which frequency bins contain template energy, because the positions are generated by a cryptographically secure pseudorandom number generator seeded by a key derived from the passphrase through a memory-hard function.

3.2 Key Derivation Chain

The key derivation for template peak generation follows a two-stage process:

Argon2id key derivation. The user’s passphrase is processed through Argon2id with a fixed salt ("phasm-tmpl-v1") dedicated to the template subsystem. This produces a 32-byte key. The salt is distinct from the salts used for Ghost-mode permutation, Armor STDM spreading, Fortress BA-QIM, and encryption – ensuring that the template peak positions are cryptographically independent of all other passphrase-derived structures. The Argon2id parameters follow Phasm’s standard configuration (memory-hard, resistant to GPU brute-force).
ChaCha20 PRNG seeding. The 32-byte Argon2id output is used as the seed for a ChaCha20Rng instance. ChaCha20 is a cryptographically secure PRNG that produces a deterministic sequence given a fixed seed. Both encoder and decoder, given the same passphrase, will instantiate the same PRNG and generate identical peak positions.

3.3 Peak Position Generation

The PRNG generates $K = 32$ peak positions in the mid-frequency annular band of the 2D DFT spectrum. Each peak is defined by a random angle $\alpha \in [0, 2\pi)$ and a random radius $r \in [R_{\min}, R_{\max})$, where:

$$R_{\min} = 0.05 \cdot \min(W, H), \quad R_{\max} = 0.25 \cdot \min(W, H)$$

Here $W$ and $H$ are the image dimensions. The peak’s frequency coordinates relative to the spectrum center are:

$$u_k = r_k \cos\alpha_k, \quad v_k = r_k \sin\alpha_k$$

The trigonometric functions are computed using det_sincos – the FDLIBM-based deterministic implementation – to guarantee that the peak positions are bit-identical on all platforms.

3.4 Why the Mid-Frequency Band

The choice of the $[5\%, 25\%]$ radial band is governed by two competing constraints:

Avoid low frequencies. The DC and near-DC region of the DFT magnitude is dominated by image content (the average brightness, large-scale gradients). Template peaks embedded near DC would need to compete with enormous natural magnitudes, requiring disproportionate embedding energy to be detectable. Furthermore, modifying DC-region magnitudes would produce visible brightness or contrast shifts.

Avoid high frequencies. High-frequency DFT coefficients correspond to fine spatial detail. JPEG quantization aggressively attenuates high frequencies – the quantization table entries at positions (7,7) can be 10-50x larger than at (0,0) – meaning that high-frequency template peaks would be destroyed by any JPEG compression. Additionally, the DFT of a finite-size image has a limited number of frequency bins at high radii, and the discrete sampling grid makes peak detection unreliable near the Nyquist limit.

The $[5\%, 25\%]$ band represents the mid-frequency “sweet spot” where:

Natural image energy is moderate (not overwhelming)
JPEG quantization is mild (peaks survive recompression)
The frequency grid is dense enough for reliable peak detection
Sufficient radial range exists for accurate scale estimation

For a typical 2048x1536 image, this band spans radii from approximately 77 to 384 frequency bins – a 307-bin radial range containing roughly 445,000 frequency bins. The 32 template peaks occupy a vanishingly small fraction of this space.

3.5 Search Space Against Brute Force

An attacker attempting to locate the template peaks without the passphrase faces a combinatorial search problem. With 32 peaks to be found among approximately $N_{\text{bins}} \approx \pi(R_{\max}^2 - R_{\min}^2) \approx 445{,}000$ candidate frequency bins (for a 2048x1536 image), the number of possible peak configurations is:

$$\binom{N_{\text{bins}}}{K} = \binom{445{,}000}{32} \approx 10^{145}$$

This is computationally infeasible. The attacker cannot enumerate peak configurations, and without knowing at least some peak positions, blind spectral analysis cannot distinguish template energy from the natural image spectrum – especially because the embedding amplitude is scaled relative to the local magnitude (see Section 4), blending the peaks into the natural spectral landscape.

4. Embedding with Hermitian Symmetry

4.1 The Real-Valued IFFT Constraint

After embedding template peaks in the DFT spectrum, we must transform back to the spatial domain via the inverse FFT to obtain modified pixel values. For the spatial image to be real-valued (as all pixel images must be), the DFT must satisfy the Hermitian symmetry condition:

$$F(-u, -v) = \overline{F(u, v)}$$

where $\overline{(\cdot)}$ denotes complex conjugation. This means that for every modification we make at frequency $(u, v)$, we must make the conjugate modification at frequency $(-u, -v)$ (which, in the discrete index space of an $M \times N$ DFT, corresponds to position $(M - u, N - v)$).

Violating Hermitian symmetry would introduce imaginary components in the inverse FFT output, which manifest as numerical noise when taking the real part. While the magnitudes of these imaginary components might be small, they would introduce pixel-level errors that compound under JPEG recompression – exactly the kind of systematic distortion that the Armor pipeline is designed to avoid.

4.2 Local Magnitude Scaling

The embedding amplitude for each peak is proportional to the local spectral magnitude. For a peak at frequency indices $(s_u, s_v)$, the embedding magnitude is:

$$A_k = \alpha \cdot \bar{M}_{3 \times 3}(s_u, s_v)$$

where $\alpha = 0.4$ is the embedding strength parameter and $\bar{M}_{3 \times 3}$ is the mean magnitude in a $3 \times 3$ neighborhood around the peak position:

$$\bar{M}_{3 \times 3}(s_u, s_v) = \frac{1}{9} \sum_{du=-1}^{1} \sum_{dv=-1}^{1} |F(s_u + du, s_v + dv)|$$

A floor of 1.0 is applied to prevent division by zero or negligible embedding in spectral nulls: $A_k = \alpha \cdot \max(\bar{M}_{3 \times 3}, 1.0)$.

Local scaling serves two purposes. First, it makes the template perceptually adaptive – peaks embedded in spectrally active regions (where the local magnitude is high) receive proportionally larger amplitude, exploiting the visual masking effect. Second, it makes the template harder to detect by statistical steganalysis, because the embedding energy at each peak is proportional to the natural energy at that position rather than being a constant that could be identified as anomalous.

4.3 Phase-Preserving Additive Embedding

The template is embedded by adding energy along the existing phase direction of each frequency bin, rather than by replacing the bin’s value. This preserves the spatial structure encoded in the phase while increasing the magnitude.

For a peak at index $\text{idx}$ in the spectrum array, the existing coefficient is $F_{\text{idx}} = |F_{\text{idx}}| \, e^{j\phi}$. The phase unit vector is:

$$\hat{\phi} = \frac{F_{\text{idx}}}{|F_{\text{idx}}|} = e^{j\phi}$$

(with a fallback to $\hat{\phi} = 1 + 0j$ if $|F_{\text{idx}}| < 10^{-12}$). The modified coefficient is:

$$F'_{\text{idx}} = F_{\text{idx}} + A_k \cdot \hat{\phi}$$

The effect on magnitude is:

$$|F'_{\text{idx}}| = |F_{\text{idx}}| + A_k$$

which is a clean additive increase. Because the added energy is collinear with the existing coefficient, the phase $\phi$ is preserved exactly. This is important: the DFT phase encodes the spatial arrangement of image features, and modifying it would introduce visible artifacts (edge shifts, ringing) even at modest embedding strengths.

4.4 Hermitian Conjugate Mirroring

For each primary peak modification at index $\text{idx}$ corresponding to frequency $(s_u, s_v)$, the conjugate position at $(W - s_u, H - s_v)$ must receive the conjugate modification. The conjugate coefficient is treated identically:

$$F'_{\text{conj}} = F_{\text{conj}} + A_k \cdot \hat{\phi}_{\text{conj}}$$

where $\hat{\phi}_{\text{conj}}$ is the phase unit vector of the conjugate bin. This ensures $F'(-u, -v) = \overline{F'(u, v)}$ and the inverse FFT produces a real-valued spatial image.

The complete embedding rule for one peak can be written as a single formula. Let $\mathbf{e}_k = (s_u, s_v)$ be the primary frequency position and $\mathbf{e}_k^* = (W - s_u, H - s_v)$ be the conjugate position. Then:

$$F'(\mathbf{e}_k) = F(\mathbf{e}_k) + \alpha \cdot \max\!\bigl(\bar{M}_{3\times3}(\mathbf{e}_k),\, 1\bigr) \cdot \frac{F(\mathbf{e}_k)}{|F(\mathbf{e}_k)|}$$

$$F'(\mathbf{e}_k^*) = F(\mathbf{e}_k^*) + \alpha \cdot \max\!\bigl(\bar{M}_{3\times3}(\mathbf{e}_k),\, 1\bigr) \cdot \frac{F(\mathbf{e}_k^*)}{|F(\mathbf{e}_k^*)|}$$

Both modifications use the same amplitude $A_k$ (computed from the primary position), ensuring symmetric magnitude enhancement.

5. Peak Detection and Transform Estimation

5.1 Detection: Neighborhood Search

At decode time, the decoder computes the 2D DFT of the received (possibly transformed) image and searches for template peaks near their expected positions. Because the peaks may have shifted due to rotation, scaling, or both, the search covers a neighborhood of radius $R_s = 5$ bins around each expected position.

For each of the $K = 32$ expected peaks at centered frequency $(u_k, v_k)$, the detector:

Converts to spectrum index: $s_u = \lfloor W/2 + u_k \rceil$, $s_v = \lfloor H/2 + v_k \rceil$.
Scans all bins within an $(2R_s + 1) \times (2R_s + 1) = 11 \times 11$ square neighborhood.
Records the position $(s_u^*, s_v^*)$ of the maximum magnitude within the neighborhood.
Computes noise statistics from the outer ring of the neighborhood (bins at distance > 1 from center), excluding the 3x3 center region.

5.2 Confidence Scoring

The detection confidence for each peak is defined as the number of standard deviations the peak magnitude exceeds the local noise floor:

$$\text{conf}_k = \frac{|F(s_u^*, s_v^*)| - \bar{\mu}_{\text{noise}}}{\sigma_{\text{noise}}}$$

where $\bar{\mu}_{\text{noise}}$ and $\sigma_{\text{noise}}$ are the mean and standard deviation of magnitudes in the outer ring (bins with $|du| > 1$ or $|dv| > 1$). The noise statistics are computed per-peak from the local spectral neighborhood:

$$\bar{\mu}_{\text{noise}} = \frac{1}{N_{\text{noise}}} \sum_{(du,dv) \in \text{ring}} |F(s_u + du, s_v + dv)|$$

$$\sigma_{\text{noise}} = \sqrt{\frac{1}{N_{\text{noise}}} \sum_{(du,dv) \in \text{ring}} |F(s_u + du, s_v + dv)|^2 - \bar{\mu}_{\text{noise}}^2}$$

A peak is accepted if $\text{conf}_k \geq \tau$, where $\tau = 3.0$ (three sigma above the local noise). This threshold balances detection sensitivity against false alarm rate: for Gaussian-distributed noise, the probability of a single bin exceeding $3\sigma$ by chance is approximately 0.13%, and with 121 bins in the search window, the expected number of false alarms per peak search is about 0.16 – far below the true peak’s expected confidence of $\alpha / \sigma_{\text{noise}} \gg 3$.

5.3 Minimum Peak Requirement

The affine transform estimation (Section 5.4) requires at least $K_{\min} = 8$ successfully detected peaks out of the 32 embedded. This threshold is set conservatively: the least-squares estimator requires a minimum of 2 peaks for two unknowns (rotation and scale), but 8 peaks provide:

Redundancy against outlier peaks (misdetections in spectrally busy regions)
Statistical averaging that reduces estimation error
Confidence that the template is genuinely present (rather than 2-3 coincidental noise peaks passing the $3\sigma$ threshold)

If fewer than 8 peaks are detected, the template is considered undetectable and the decoder proceeds to the DFT ring extraction fallback (Section 6).

5.4 Least-Squares Affine Estimation

Given $n \geq 8$ detected peaks with expected positions $(u_k, v_k)$ and detected positions $(u'_k, v'_k)$, the decoder must estimate the rotation angle $\theta$ and scale factor $s$ of the geometric transform. The relationship between expected and detected positions is:

$$\begin{pmatrix} u'_k \\ v'_k \end{pmatrix} = s \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} u_k \\ v_k \end{pmatrix}$$

Expanding the matrix multiplication:

$$u'_k = s\cos\theta \cdot u_k - s\sin\theta \cdot v_k$$

$$v'_k = s\sin\theta \cdot u_k + s\cos\theta \cdot v_k$$

Define $a = s\cos\theta$ and $b = s\sin\theta$. Then:

$$u'_k = a \cdot u_k - b \cdot v_k$$

$$v'_k = b \cdot u_k + a \cdot v_k$$

The total squared error over all $n$ detected peaks is:

$$E(a, b) = \sum_{k=1}^{n} \left[ (u'_k - a u_k + b v_k)^2 + (v'_k - b u_k - a v_k)^2 \right]$$

Setting partial derivatives to zero:

$$\frac{\partial E}{\partial a} = 0: \quad a \sum_k (u_k^2 + v_k^2) = \sum_k (u_k u'_k + v_k v'_k)$$

$$\frac{\partial E}{\partial b} = 0: \quad b \sum_k (u_k^2 + v_k^2) = \sum_k (u_k v'_k - v_k u'_k)$$

The closed-form solution is:

$$a = \frac{\displaystyle\sum_{k=1}^{n} (u_k u'_k + v_k v'_k)}{\displaystyle\sum_{k=1}^{n} (u_k^2 + v_k^2)}, \quad b = \frac{\displaystyle\sum_{k=1}^{n} (u_k v'_k - v_k u'_k)}{\displaystyle\sum_{k=1}^{n} (u_k^2 + v_k^2)}$$

From which the transform parameters are recovered:

$$\theta = \text{atan2}(b,\, a)$$

$$s = \sqrt{a^2 + b^2}$$

The atan2 is computed using det_atan2 – the FDLIBM-based deterministic implementation – to ensure cross-platform agreement. The sqrt function is IEEE 754 correctly-rounded and therefore deterministic by specification across all compliant platforms, including WebAssembly.

5.5 Estimation Accuracy

The estimation accuracy depends on the number of detected peaks and the geometric spread of their frequency positions. The error in $\theta$ can be approximated by:

$$\text{Var}(\hat{\theta}) \approx \frac{\sigma_{\text{pos}}^2}{\displaystyle\sum_{k=1}^{n} (u_k^2 + v_k^2)}$$

where $\sigma_{\text{pos}}$ is the standard deviation of the peak position detection error (approximately 0.5-1.0 bins for the $5$-bin neighborhood search). For 16 detected peaks at an average radius of 150 bins (a 2048-wide image):

$$\text{Var}(\hat{\theta}) \approx \frac{1.0}{16 \times 150^2} \approx 2.8 \times 10^{-6} \text{ rad}^2$$

$$\sigma_{\hat{\theta}} \approx 0.0017 \text{ rad} \approx 0.096\degree$$

This sub-degree angular resolution is more than sufficient for the subsequent image resampling step, which must only correct to within half a pixel at the image edges.

The unit tests in Phasm’s implementation verify these bounds. The transform_estimation_identity test confirms that when detected positions exactly match expected positions, the estimated rotation is below 0.01 radians and scale is within 0.01 of 1.0. The transform_estimation_rotation test simulates a 15-degree rotation on 20 peaks at radius 25, recovering the angle to within 0.01 radians (0.57 degrees). The transform_estimation_scale test simulates 0.8x scaling on 16 peaks, recovering the scale factor to within 0.01.

6. The DFT Ring Payload

6.1 Motivation: A Resize-Robust Secondary Channel

The DFT template described in Sections 3-5 provides geometric synchronization – it tells the decoder how the image was transformed. But the template itself carries no data. The primary data payload is embedded in DCT-domain STDM coefficients, which require the 8x8 block grid to be restored before extraction.

For extreme geometric transformations – particularly aggressive downscaling that discards too much spatial information – the STDM payload may be unrecoverable even after the template-guided geometric correction. The block grid of the corrected image, while approximately aligned with the original, may have accumulated sufficient interpolation error from the resample step to corrupt the majority of coefficient values.

The DFT ring payload provides a last-resort fallback: a small secondary message embedded directly in the DFT magnitude spectrum, in a domain that inherently survives the same transforms that the template detects. Because the ring payload is read from the magnitude spectrum – which is translation-invariant, rotation-covariant, and scale-covariant – it can be extracted even when the spatial-domain STDM extraction fails.

6.2 Ring Geometry

The ring payload occupies an annular band in the DFT magnitude spectrum, defined by inner and outer radii:

$$R_{\text{inner}} = 0.08 \cdot \min(W, H), \quad R_{\text{outer}} = 0.20 \cdot \min(W, H)$$

This band overlaps substantially with the template peak band ($[0.05, 0.25]$ of $\min(W, H)$) but serves a different purpose. The ring is divided into $N_s = 256$ angular sectors, each covering an angular arc of $2\pi / 256 \approx 1.41\degree$.

Each sector aggregates the mean magnitude of all frequency bins that fall within its angular and radial bounds:

$$\bar{M}_i = \frac{1}{|S_i|} \sum_{(u,v) \in S_i} |F(u, v)|$$

where $S_i$ is the set of frequency bins in sector $i$.

6.3 QIM Embedding Per Sector

Data bits are embedded into sector magnitudes using scalar Quantization Index Modulation (QIM) with a quantization step of $\Delta_{\text{ring}} = 50.0$:

$$\bar{M}'_i = \begin{cases} \text{round}\!\left(\bar{M}_i / \Delta\right) \cdot \Delta & \text{if bit} = 0 \\ \left(\text{round}\!\left(\bar{M}_i / \Delta - 0.5\right) + 0.5\right) \cdot \Delta & \text{if bit} = 1 \end{cases}$$

The large quantization step ($\Delta = 50$) ensures that the QIM embedding structure survives moderate spectral perturbations from JPEG recompression and geometric interpolation. Extraction is blind:

$$\hat{b}_i = \text{round}\!\left(\bar{M}_i / (\Delta / 2)\right) \bmod 2$$

After QIM embedding, all frequency bins within the sector are rescaled to match the target magnitude while preserving their individual phases – maintaining Hermitian symmetry for the conjugate bins.

6.4 Spreading and Sector Assignment

Each data bit is spread across $L_s = 4$ sectors (the spreading factor), giving $N_s / L_s = 256 / 4 = 64$ effective data bits. The sector-to-bit assignment is determined by a Fisher-Yates permutation seeded by a passphrase-derived PRNG (ChaCha20, with a key derived from Argon2id using a salt independent of the template key: "phasm-ring-v1"). This ensures:

The sector assignment is deterministic given the passphrase.
The assignment is secret – an attacker cannot determine which sectors carry which bits.
Each data bit is uniformly distributed across the ring, providing spatial diversity against local spectral interference.

At extraction, the 4 QIM decisions for each data bit are combined by majority voting:

$$\hat{b}_k = \begin{cases} 0 & \text{if } \sum_{i : \text{sector } i \to \text{bit } k} \text{vote}_i \geq 0 \\ 1 & \text{otherwise} \end{cases}$$

where $\text{vote}_i = +1$ if the QIM extraction for sector $i$ yields 0, and $\text{vote}_i = -1$ if it yields 1.

6.5 Error Correction and Capacity

The 64 effective bits (8 bytes of raw data) are intended to pass through heavy Reed-Solomon coding with 192 parity symbols out of 255 – an RS(255, 63) configuration. However, the 8-byte channel capacity is smaller than the standard frame overhead of approximately 50 bytes (2-byte length + 16-byte salt + 12-byte nonce + 16-byte AES-GCM tag + 4-byte CRC), so the ring payload currently cannot carry usable plaintext with the standard frame format. The ring_capacity() function returns 0 for all image sizes.

This is a known limitation – a future compact ring frame format (without per-message salt and nonce) could reclaim approximately 1–2 bytes of usable capacity from the 8-byte raw channel. The ring payload is designed as a fallback channel for future development. Its intended value is diagnostic: confirming that the image was produced by Phasm and that the passphrase is correct, even when the full STDM payload cannot be recovered.

6.6 Why the Ring Survives Resize

The ring payload survives resizing because the DFT magnitude of a resized image is a scaled version of the original DFT magnitude. A 0.8x downscale compresses the spatial spectrum, effectively expanding the frequency-domain ring. The template detection system (Section 5) estimates the scale factor $s$, and the ring extraction adjusts the inner and outer radii accordingly:

$$R'_{\text{inner}} = R_{\text{inner}} / s, \quad R'_{\text{outer}} = R_{\text{outer}} / s$$

The angular sector boundaries are unaffected by uniform scaling (scaling is radial), so the sector-to-bit assignment remains valid. The QIM step $\Delta$ may need adjustment for the changed magnitude scale ($s^2$ factor), but the aggressive step size of 50.0 provides margin for scale factors in the range $[0.5, 2.0]$.

This scale resilience is the key advantage over the DCT-domain STDM payload. STDM depends on the 8x8 block grid, which is destroyed by resizing. The DFT ring depends only on the frequency-domain magnitude, which transforms predictably under scaling and can be compensated using the template-estimated scale factor.

7. The Complete Decode Pipeline

7.1 Three-Stage Fallback Architecture

Armor decode follows a three-stage fallback architecture, where each stage handles a progressively more severe class of image manipulation:

                graph TD
                    A["Receive stego JPEG"] --> B["Stage 1: Fortress BA-QIM
(magic byte check on DC)"]
                    B -->|Success| Z["Return message + quality"]
                    B -->|Fail| C["Stage 2: Standard STDM Decode
(delta sweep, Phase 1 + Phase 2 search)"]
                    C -->|Success| Z
                    C -->|Fail| D["Stage 3: Geometric Recovery"]
                    D --> E["Compute 2D FFT of luminance"]
                    E --> F["Generate expected template peaks
(passphrase → Argon2id → ChaCha20)"]
                    F --> G["Detect peaks in DFT magnitude
(11x11 neighborhood search, 3σ threshold)"]
                    G --> H{"≥ 8 peaks
detected?"}
                    H -->|No| Q["Return error:
FrameCorrupted"]
                    H -->|Yes| J["Estimate affine transform
(least-squares: θ, s)"]
                    J --> K{"Transform ≈
identity?"}
                    K -->|Yes| I
                    K -->|No| L["Resample image
(bilinear, inverse transform)"]
                    L --> M["Retry standard STDM decode
on corrected image"]
                    M -->|Success| N["Return message + quality
(geometry_corrected = true)"]
                    M -->|Fail| O["Try DFT ring extraction
on corrected image"]
                    O -->|Success| P["Return ring message
(dft_ring_used = true)"]
                    O -->|Fail| Q["Return error:
FrameCorrupted"]
                    I -->|Success| P
                    I -->|Fail| Q

7.2 Stage 1: Fortress BA-QIM (Fast Path)

The decoder first checks for the Fortress sub-mode – a BA-QIM embedding in DC block averages designed for social media survival, using Watson perceptual masking to adapt embedding strength per block. Fortress uses a magic-byte header (56 blocks with 7x majority voting) that can be checked with minimal computation. If the magic matches, the Fortress-specific extraction path runs. This fast path handles the majority of short-message Armor images.

7.3 Stage 2: Standard STDM Decode

If Fortress does not match, the decoder attempts standard STDM extraction with a delta sweep. The delta sweep tries multiple mean quantization table values (the embedded header value, the current image’s QT, and +/-30% variations in 3% steps) to account for QT changes from recompression. For each candidate delta, the decoder attempts both Phase 1 (fixed RS parity, no repetition) and Phase 2 (brute-force search over repetition factor $r$ and RS parity tier) decoding. This stage handles all non-geometric attacks – recompression, cross-library, format round-trips.

7.4 Stage 3: Geometric Recovery

When standard STDM fails (indicating that the block grid has been disrupted by a geometric transform), the decoder enters the geometric recovery pipeline:

FFT. Convert the stego JPEG to luminance pixels, compute the 2D FFT.
Template detection. Generate the expected 32 peak positions from the passphrase, search for each in the DFT magnitude within an 11x11 neighborhood.
Transform estimation. If $\geq 8$ peaks are detected, compute the least-squares rotation and scale estimates.
Identity check. If the estimated transform is essentially identity ($|\theta| < 0.001$ rad and $|s - 1| < 0.001$), skip resampling and try the DFT ring directly.
Resample. Apply the inverse transform (rotate by $-\theta$, scale by $1/s$) via bilinear interpolation to produce a geometry-corrected image.
Retry STDM. Write the corrected pixels back into JPEG coefficients and retry the standard STDM decode.
Ring fallback. If STDM still fails, extract the DFT ring payload from the corrected image’s spectrum.

The bilinear resampling module inverse-maps each output pixel through the estimated affine transform (center-relative rotation by $-\theta$ followed by scaling by $1/s$) and interpolates from the four nearest source pixels. Out-of-bounds pixels default to mid-gray (128.0) to avoid edge artifacts.

7.5 Quality Reporting

The DecodeQuality structure returned by Armor decode includes geometric recovery metadata:

geometry_corrected: whether the template-guided resample path was used
template_peaks_detected: number of peaks found (out of 32)
estimated_rotation_deg: recovered rotation angle in degrees
estimated_scale: recovered scale factor
dft_ring_used: whether the ring payload (rather than STDM) provided the message
dft_ring_capacity: ring capacity in bytes

This metadata allows the UI to inform the user about the nature and severity of the geometric manipulation that was detected and corrected.

8. Template Estimation Attack and Mitigations

8.1 The Holliman-Memon Attack

Holliman and Memon (2000) demonstrated that DFT template watermarks are vulnerable when the template positions are publicly known. The attack exploits the fact that template peaks create predictable structures in the magnitude spectrum. Given knowledge of the peak frequencies, an attacker can:

Measure the magnitude at each known template position.
Estimate the template contribution by comparing with the local spectral background.
Subtract the estimated template energy from those positions.
The modified image passes through normal processing, and the template becomes undetectable.

This attack is particularly effective because the template is a global signal – it affects specific frequency bins across the entire image, and those bins can be locally interpolated from their neighbors to estimate and remove the template contribution.

8.2 Passphrase-Derived Positions as Mitigation

Phasm’s primary defense is that the template peak positions are secret. An attacker who does not know the passphrase cannot determine which of the $\sim$74,000 frequency bins in the mid-frequency band contain template energy. The search space of $\binom{445{,}000}{32} \approx 10^{145}$ possible configurations makes exhaustive search infeasible.

Even statistical approaches – such as scanning for bins with anomalously high magnitude – are unlikely to succeed because:

The embedding amplitude is proportional to the local spectral magnitude ($\alpha = 0.4$ of the local mean), so template peaks do not create unusual magnitude ratios relative to their neighbors.
Natural image spectra contain numerous bins with high magnitude due to image content (edges, textures, periodic patterns), creating a high false-positive rate for any anomaly detector.
With only 32 peaks among 74,000 bins, the signal-to-noise ratio for any blind statistical search is extremely low.

8.3 Known-Plaintext Considerations

An attacker who obtains both the original image and the stego image (a known-cover scenario) can compute the difference spectrum and directly observe the template peaks. This enables the Holliman-Memon attack regardless of passphrase secrecy. However, the known-cover scenario is a strong assumption: in the steganographic threat model, the adversary typically has access only to the stego image (or a collection of stego images), not the original cover.

In the weaker known-message scenario (the attacker knows the plaintext but not the passphrase), the template peaks are still unrecoverable because the template positions are derived solely from the passphrase, not from the message content.

8.4 Multiple-Image Attack

An adversary who intercepts multiple stego images produced with the same passphrase could potentially average their magnitude spectra to detect consistent template peaks above the varying natural background. With $n$ images, the template SNR improves by $\sqrt{n}$, so approximately $n \geq (3/\alpha)^2 \approx 56$ images would be needed to bring the average template peak to $3\sigma$ above the averaged noise floor. This attack is plausible in a surveillance scenario but requires significant data collection and assumes the same passphrase is reused across many images – a practice that Phasm’s security guidance discourages.

9. Comparison with Commercial and Academic Systems

9.1 System Overview

The DFT template approach used by Phasm belongs to a family of synchronization techniques employed across academic and commercial watermarking systems. The table below contextualizes Phasm’s design against notable alternatives:

System	Synchronization Method	Payload Capacity	Geometric Robustness	Computational Cost	GPU Required
Phasm (Armor + DFT Template)	32 passphrase-derived DFT peaks	54-2,600 bytes (STDM) + ~2 bytes (ring)	Rotation, scale, moderate crop	~200-800 ms (WASM)	No
Digimarc	Fixed DFT template (proprietary)	48-96 bits (cloud ID lookup)	Full RST + crop	Proprietary	No
Meta Seal / VideoSeal	Neural sync network (SyncSeal)	32-256 bits	Full RST + perspective	~100 ms (GPU)	Yes
StegaStamp (UC Berkeley)	End-to-end neural codec	56 bits (100 raw)	RST + print-photograph	~200 ms (GPU)	Yes
Adobe TrustMark	Neural encoder/decoder	70 bits (100 raw)	Partial RST	~150 ms (GPU)	Yes
Pereira & Pun (2000)	Fixed DFT template (academic)	Spread-spectrum (64-256 bits)	RST	~100-400 ms	No
Zheng et al. DWT-DFT (2003)	DFT template + DWT payload	~100-500 bits	RST + compression	~200-600 ms	No
Log-Polar DFT / FMT	Fourier-Mellin invariant domain	32-256 bits	RST (inherent)	~300-1000 ms	No

9.2 Key Differentiators

Capacity. Phasm’s architecture separates the synchronization channel (DFT template, ~0 data bits) from the data channel (STDM, 54-2,600 bytes). This decoupling gives Phasm 1-3 orders of magnitude more payload capacity than systems that embed both synchronization and data in the same domain. Digimarc sidesteps the capacity problem entirely by embedding only a short ID and resolving the actual content via cloud lookup – a design incompatible with Phasm’s client-side, privacy-first architecture.

No GPU requirement. All neural watermarking systems (Meta Seal, StegaStamp, TrustMark) require GPU inference for both embedding and extraction. Phasm’s DFT template system runs entirely in pure Rust, compiled to native code and WebAssembly, executing in 200-800 ms on a mobile CPU or browser. This makes it deployable on commodity hardware without backend infrastructure.

Secret template positions. The passphrase-derived template is a significant security improvement over fixed-template systems (Pereira & Pun, Digimarc). While Digimarc compensates with legal and contractual protections (the template is proprietary and reverse-engineering is legally prohibited), Phasm’s template security is cryptographic rather than legal.

Client-side privacy. All processing in Phasm occurs on the user’s device – no images or messages are transmitted to a server. This is architecturally incompatible with cloud-ID systems like Digimarc but aligns with Phasm’s design philosophy of privacy by default.

10. Conclusion

The geometric synchronization problem – recovering encoder-decoder alignment after an unknown rotation, scaling, or crop – is fundamentally different from the recompression survival problem addressed by DCT-domain STDM. Recompression perturbs coefficient values within a preserved block grid (exploiting the three invariants of JPEG recompression); geometric transforms destroy the grid entirely. Solving both requires a dual-domain approach: DCT-domain STDM for data embedding (robust to value perturbations) and DFT-domain template embedding for synchronization (robust to coordinate perturbations).

Phasm’s implementation of this dual-domain architecture embeds 32 passphrase-derived peaks in the mid-frequency band of the DFT magnitude spectrum, using phase-preserving additive embedding with Hermitian symmetry to ensure real-valued spatial output. The three core properties of the DFT magnitude – translation invariance, rotation covariance, and scale covariance – enable the decoder to estimate the applied geometric transform from peak displacements using a closed-form least-squares estimator that requires only basic arithmetic operations (sums, products, atan2, sqrt) and a minimum of 8 detected peaks.

The DFT ring payload provides a secondary, resize-robust data channel – small (approximately 2 usable bytes) but sufficient to confirm message presence and passphrase correctness when the primary STDM payload cannot be recovered. Together, the template, the ring, and the existing STDM pipeline form a three-stage decode architecture that gracefully degrades from full message recovery (no geometric transform) through geometry-corrected recovery (template-guided resample + STDM retry) to minimal confirmation (ring extraction).

The entire system is implemented in pure Rust with no external FFT or linear algebra dependencies, using FDLIBM-based deterministic trigonometric functions and a custom Cooley-Tukey + Bluestein FFT engine to guarantee bit-identical results across native ARM64, x86-64, and WebAssembly targets. This deterministic arithmetic is not merely a convenience – it is a correctness requirement: if the encoder and decoder disagree on even a single peak position due to platform-dependent floating-point rounding, the transform estimation degrades, and the subsequent resample fails to restore the block grid.

The DFT template approach is not a complete solution to the geometric resilience problem. Heavy cropping (>40% area removal) weakens the template beyond detectability. Non-uniform scaling (aspect ratio changes) introduces errors in the rotation/scale estimation model. And the bilinear resampling step introduces interpolation noise that the STDM extractor must absorb, reducing the effective error correction margin. Implementation of the Phase 3 DFT template system is currently in progress, with the research and algorithm design complete. The 81-90% memory reduction achieved through optimization makes the full pipeline viable on all target platforms. Remaining limitations motivate future work on SIFT-based registration (for crop-heavy scenarios) and DFT-domain geometry estimation refinement.

For practitioners building robust watermarking or steganographic systems, the key takeaways are:

Separate synchronization from data. The template carries no payload; the payload knows nothing about geometry. This decoupling maximizes capacity and simplifies each subsystem.
Derive positions from a secret. Passphrase-derived peak positions convert the template estimation attack from feasible (known positions) to infeasible ($\binom{445{,}000}{32}$ search space).
Embed along existing phase. Phase-preserving additive embedding avoids spatial artifacts and maintains Hermitian symmetry with minimal bookkeeping.
Use the closed-form estimator. The $a, b, \theta, s$ formulas require only accumulators over peak pairs – no matrix inversion, no iterative optimization, no convergence concerns.
Plan for graceful degradation. The three-stage fallback (STDM, template + STDM, ring) ensures that some information is recoverable even under transforms that destroy the primary channel.

Phasm is a free steganography app that hides encrypted text messages inside JPEG photos. It runs on iOS, Android, and the web. All processing happens on your device. The DFT template research described in this post is complete, with implementation in progress as part of Armor mode’s Phase 3 geometric resilience. This complements the STDM recompression survival and Fortress social media survival layers. Ghost mode now uses the J-UNIWARD cost function (fully implemented as of version 1.3.0) for maximum stealth. The pure-Rust JPEG codec and deterministic FFT engine make all of this possible without C FFI or GPU dependencies.

References

Pereira, S., and Pun, T. (2000). “Robust Template Matching for Affine Resistant Image Watermarks.” IEEE Transactions on Image Processing, 9(6), 1123-1129.
Holliman, M., and Memon, N. (2000). “Counterfeiting Attacks on Oblivious Block-Wise Independent Invisible Watermarking Schemes.” IEEE Transactions on Image Processing, 9(3), 432-441.
Zheng, D., Zhao, J., and El Saddik, A. (2003). “RST-Invariant Digital Image Watermarking Based on Log-Polar Mapping and Phase Correlation.” IEEE Transactions on Circuits and Systems for Video Technology, 13(8), 753-765.
O’Ruanaidh, J. J. K., and Pun, T. (1998). “Rotation, Scale and Translation Invariant Spread Spectrum Digital Image Watermarking.” Signal Processing, 66(3), 303-317.
Solachidis, V., and Pitas, I. (2001). “Circularly Symmetric Watermark Embedding in 2-D DFT Domain.” IEEE Transactions on Image Processing, 10(11), 1741-1753.
Chen, B., and Wornell, G. W. (2001). “Quantization Index Modulation: A Class of Provably Good Methods for Digital Watermarking and Information Embedding.” IEEE Transactions on Information Theory, 47(4), 1423-1443.
Comesana, P., and Perez-Gonzalez, F. (2006). “On the Capacity of Stego-Systems.” Proceedings of the 8th ACM Workshop on Multimedia and Security, 15-24.
Bas, P., Filler, T., and Pevny, T. (2002). “Geometrically Invariant Watermarking Using Feature Points.” IEEE Transactions on Image Processing, 11(9), 1014-1028.
Kang, X., et al. (2025). “Efficient Geometric Synchronization for Deep Watermarking.” arXiv preprint.
Cooley, J. W., and Tukey, J. W. (1965). “An Algorithm for the Machine Calculation of Complex Fourier Series.” Mathematics of Computation, 19(90), 297-301.
Bluestein, L. I. (1970). “A Linear Filtering Approach to the Computation of Discrete Fourier Transform.” IEEE Transactions on Audio and Electroacoustics, 18(4), 451-455.
Biham, E., and Anderson, R. (1996). “Tiger: A Fast New Hash Function.” Fast Software Encryption, Springer.
Butora, J., and Fridrich, J. (2023). “Errorless Robust JPEG Steganography using Outputs of JPEG Coders.” IEEE Transactions on Dependable and Secure Computing.