Deterministic Math for WASM Steganography

WebAssembly promises portable, near-native performance – but its floating-point determinism guarantee has a critical gap. Transcendental functions like sin, cos, and atan2 delegate to host Math.* implementations that differ across browsers. For Phasm, a steganography application that must produce identical encode/decode behavior across iOS, Android, and WebAssembly, this gap caused three independent cross-platform failures. We describe the problems and solutions: an FDLIBM-based deterministic math module using only WASM-intrinsic operations, an in-house FFT replacing the rustfft crate, and explicit u32 type casts to fix PRNG divergence between 32-bit and 64-bit targets.

1. The Problem: WASM Math Is Not Deterministic

WebAssembly’s specification makes strong guarantees about basic arithmetic. IEEE 754 operations – addition, subtraction, multiplication, division, square root, floor, ceiling, absolute value, and copysign – are all defined as WASM intrinsics. When the Rust compiler emits f64.add or f64.sqrt, the WASM engine executes that operation directly, producing bit-identical results on every conformant runtime. These operations are the bedrock of portable numeric computation, and they work exactly as advertised.

The trouble starts with transcendental functions. When Rust code calls f64::sin(), the compiler does not emit a WASM intrinsic – because no such intrinsic exists. Instead, it emits a call to the host environment’s Math.sin function, which is implemented in JavaScript by the browser’s engine. V8 (Chrome), SpiderMonkey (Firefox), and JavaScriptCore (Safari) each provide their own Math.sin implementation, and the IEEE 754 standard does not require these implementations to agree beyond a generous accuracy bound.

The discrepancies are small – typically 1-3 ULP (units in the last place) – but they are real and they are systematic. Consider a concrete example. For the input $x = 1.7$:

$$\text{Math.sin}(1.7) = \begin{cases} \texttt{0x3FEFB4DD1B4CBE10} & \text{V8 (Chrome)} \\ \texttt{0x3FEFB4DD1B4CBE0F} & \text{SpiderMonkey (Firefox)} \\ \texttt{0x3FEFB4DD1B4CBE10} & \text{JavaScriptCore (Safari)} \end{cases}$$

A 1-ULP difference in the last bit of a 64-bit float. For most applications, this is irrelevant – both values are correct to 15 decimal digits. But for a system where floating-point results feed into downstream integer decisions, a 1-ULP difference can cascade into completely different outputs.

1.1 Where This Breaks Steganography

Phasm is a steganography application with a shared Rust core compiled to three targets: native static libraries for iOS and Android, and WebAssembly for the browser. The core implements two steganographic modes: Ghost mode for stealth (resists detection) and Armor mode for robustness (survives JPEG recompression and geometric transformations). Both modes must produce identical encode/decode behavior across all platforms – if an image encoded on an iPhone cannot be decoded in Chrome, the application is broken.

The first cross-platform failure we encountered was not in the trigonometric functions themselves, but in the systems built on top of them. Armor mode’s DFT template embedding places 32 passphrase-derived peaks in the 2D Discrete Fourier Transform magnitude spectrum of the cover image. These peaks survive geometric transforms (rotation, scaling, cropping) and allow the decoder to estimate and invert any transformation applied during transmission. The peak generation process is:

Derive a 256-bit key from the passphrase via Argon2id
Seed a ChaCha20 PRNG with this key
Generate 32 random angles $\theta_k \in [0, 2\pi)$ and radii $r_k \in [r_\text{min}, r_\text{max}]$
Convert polar to Cartesian: $u_k = r_k \cos\theta_k$, $v_k = r_k \sin\theta_k$

Steps 1-3 use only integer and basic arithmetic operations – they are perfectly deterministic across all platforms. Step 4 calls sin and cos. On native ARM64, these compile to platform libm calls that produce consistent results within each OS. On WASM, they compile to call Math.sin and call Math.cos. The results differ by 1-3 ULP.

A 1-ULP difference in a peak coordinate might seem harmless – after all, the peaks are placed at continuous frequency positions that get rounded to the nearest integer bin. But when the coordinate falls near a bin boundary (which happens for roughly 1 in every 16 peaks, statistically), the rounding goes different ways on different platforms. The encoder on Safari places a peak at bin $(42, 17)$; the decoder on Chrome expects it at bin $(42, 18)$. The peak detection search window compensates for small displacements, but when enough peaks are misplaced, the transform estimation degrades and decode fails.

1.2 The Paradox of Partial Determinism

What made this bug insidious was that most of the system was deterministic. ChaCha20 is pure integer math – identical on every platform. AES-256-GCM-SIV encryption is integer math. Syndrome-Trellis Codes are binary operations over GF(2). The pure-Rust JPEG coefficient codec uses only integer arithmetic (exploiting the deterministic invariants of JPEG recompression). Even the Fisher-Yates permutation worked correctly (after a separate fix discussed in Section 5). The non-determinism was confined to exactly the subsystems that called transcendental functions – but those subsystems were critical.

1.3 Scope of the Problem

This is not a Phasm-specific issue. Any WASM application requiring deterministic transcendentals faces the same gap – multiplayer game engines, distributed scientific computing, blockchain smart contracts. The WASM specification explicitly permits implementation-defined behavior for Math.* functions, and browser vendors have no incentive to converge.

2. Which Operations ARE Deterministic in WASM?

Before building custom math functions, it is worth understanding exactly which floating-point operations the WASM specification guarantees to be deterministic.

2.1 WASM Intrinsics: The Safe Foundation

The WASM specification defines intrinsic instructions for all basic IEEE 754 operations: f64.add, f64.sub, f64.mul, f64.div, f64.sqrt, f64.abs, f64.neg, f64.floor, f64.ceil, f64.trunc, f64.nearest, f64.copysign, f64.min, f64.max, and all comparison operators. These are correctly rounded per IEEE 754 – every WASM engine must produce the same result for the same inputs, down to the last bit.

2.2 Host-Delegated Functions: The Danger Zone

The following operations have no WASM intrinsic and delegate to the host JavaScript runtime:

Function	Rust Syntax	WASM Behavior	Deterministic?
Sine	`f64::sin()`	`call Math.sin`	No
Cosine	`f64::cos()`	`call Math.cos`	No
Tangent	`f64::tan()`	`call Math.tan`	No
Arctangent	`f64::atan()`	`call Math.atan`	No
Arctangent2	`f64::atan2()`	`call Math.atan2`	No
Exponential	`f64::exp()`	`call Math.exp`	No
Logarithm	`f64::ln()`	`call Math.log`	No
Power	`f64::powf()`	`call Math.pow`	No
Hypot	`f64::hypot()`	`call Math.hypot`	No

2.3 Mapping the Dependency

With this taxonomy in hand, we audited every subsystem in Phasm’s core for transcendental function usage:

Subsystem	Transcendental Functions Used	Impact of Non-Determinism
ChaCha20 PRNG	None (integer only)	None
AES-256-GCM-SIV	None (integer only)	None
Argon2id KDF	None (integer only)	None
JPEG coefficient codec	None (integer only)	None
STC embedding/extraction	None (binary GF(2))	None
J-UNIWARD cost function	`cos` (IDCT table)	Different embedding costs
Fisher-Yates permutation	None (integer PRNG)	None (after u32 fix)
DFT template peaks	`sin`, `cos`	Different peak positions
2D FFT twiddle factors	`sin`, `cos`	Different spectrum values
FFT magnitude	`hypot`	Different magnitude values
Transform estimation	`atan2`	Different rotation estimate

The audit revealed that only five functions needed custom implementations: sin, cos, sincos (combined), atan2, and hypot. Every other subsystem either uses only WASM-intrinsic operations or operates entirely in the integer domain.

3. FDLIBM: Freely Distributable LIBM

3.1 History and Design Philosophy

FDLIBM (Freely Distributable LIBM) was developed at Sun Microsystems in 1993 by K.C. Ng and colleagues. Its design goal was unusual for the era: produce a math library that gives bit-identical results on every IEEE 754 platform, regardless of hardware, compiler, or operating system. While other math libraries (like glibc’s libm or Intel’s MKL) optimize for speed and may exploit hardware-specific features, FDLIBM uses only the basic IEEE 754 operations – precisely the operations that WASM guarantees to be deterministic.

This makes FDLIBM uniquely suited to the WASM determinism problem. Its algorithms were designed thirty years ago for a different portability challenge (SPARC vs. x86 vs. MIPS), but the same property – using only add, subtract, multiply, divide, floor, sqrt, abs, and copysign – makes them perfectly portable across WASM engines today.

Phasm’s det_math module implements FDLIBM’s algorithms in Rust, targeting five functions: det_sin, det_cos, det_sincos, det_atan2, and det_hypot. The module is approximately 275 lines of Rust and has zero external dependencies.

3.2 Cody-Waite Range Reduction

All polynomial approximations for trigonometric functions require the input to be in a narrow range – typically $[-\pi/4, \pi/4]$. For arbitrary inputs, range reduction maps $x$ to a reduced argument $r$ and a quadrant index $q$ such that:

$$x = q \cdot \frac{\pi}{2} + r, \quad |r| \leq \frac{\pi}{4}$$

The challenge is that $\pi/2$ cannot be represented exactly in floating-point. Naive subtraction of $n \cdot \pi/2$ from $x$ suffers from catastrophic cancellation for large $x$, losing many significant bits.

The Cody-Waite technique addresses this by splitting $\pi/2$ into two parts:

$$\frac{\pi}{2} \approx \text{PIO2\_HI} + \text{PIO2\_LO}$$

where $\text{PIO2\_HI}$ has its low bits zeroed so that $n \cdot \text{PIO2\_HI}$ is exactly representable (no rounding) for moderate $n$. The reduction then proceeds as:

$$r = (x - n \cdot \text{PIO2\_HI}) - n \cdot \text{PIO2\_LO}$$

The first subtraction is exact (no rounding error); only the second subtraction introduces a small error, proportional to $n \cdot \text{PIO2\_LO} \approx n \cdot 6.1 \times 10^{-17}$. For inputs $|x| < 10^6$ (which covers all practical use cases in Phasm), this provides approximately 70 bits of effective precision in the reduced argument – more than enough for the subsequent polynomial evaluation.

In Phasm’s implementation, the quadrant index is computed as:

$$n = \left\lfloor x \cdot \frac{2}{\pi} + 0.5 \right\rfloor$$

and the reduced argument is:

$$r = (x - n \cdot \text{PIO2\_HI}) - n \cdot \text{PIO2\_LO}$$

The floor operation is a WASM intrinsic, and all arithmetic is multiplication and subtraction – both WASM intrinsics. The entire range reduction uses only deterministic operations.

3.3 Polynomial Kernels

With the reduced argument $r \in [-\pi/4, \pi/4]$, the sine and cosine are evaluated using minimax polynomial approximations.

Sine kernel. For $|x| \leq \pi/4$, the sine is approximated by a 13th-degree odd polynomial:

$$\sin(x) \approx x + x^3 \cdot (S_1 + x^2 \cdot (S_2 + x^2 \cdot (S_3 + x^2 \cdot (S_4 + x^2 \cdot (S_5 + x^2 \cdot S_6)))))$$

The coefficients $S_1$ through $S_6$ are the FDLIBM minimax coefficients, stored as exact bit patterns via f64::from_bits() to avoid any compiler-dependent float parsing:

$$S_1 = -1.6667 \times 10^{-1}, \quad S_2 = 8.3333 \times 10^{-3}, \quad S_3 = -1.9841 \times 10^{-4}$$ $$S_4 = 2.7557 \times 10^{-6}, \quad S_5 = -2.5051 \times 10^{-8}, \quad S_6 = 1.5897 \times 10^{-10}$$

These are the first terms of the Taylor series for $\sin(x)/x$, adjusted by the Remez algorithm to minimize the maximum error over the interval $[-\pi/4, \pi/4]$ rather than the error at $x = 0$.

Cosine kernel. The cosine uses a 14th-degree even polynomial with 6 coefficients, evaluated in Horner form:

$$\cos(x) \approx 1 - \frac{x^2}{2} + x^4 \cdot (C_1 + x^2 \cdot (C_2 + x^2 \cdot (C_3 + \cdots)))$$

The subtlety is in how $1 - x^2/2$ is computed. The implementation groups the subtraction and correction as $1 - (hz - z \cdot r)$ where $hz = x^2/2$ and $r$ is the polynomial term, computing the correction in a single expression to minimize rounding steps. (The full FDLIBM reference uses a Kahan-style compensated summation for additional precision near $x^2/2 \approx 1$, but at Phasm’s input ranges – angles within $[-\pi/4, \pi/4]$ from range reduction – the simplified form provides sufficient accuracy.)

3.4 Quadrant Dispatch

After range reduction yields $(r, q)$ where $q = n \bmod 4$, the full-range sine and cosine are computed by dispatching on the quadrant:

$$\sin(x) = \begin{cases} \sin_\text{kern}(r) & q = 0 \\ \cos_\text{kern}(r) & q = 1 \\ -\sin_\text{kern}(r) & q = 2 \\ -\cos_\text{kern}(r) & q = 3 \end{cases}$$

$$\cos(x) = \begin{cases} \cos_\text{kern}(r) & q = 0 \\ -\sin_\text{kern}(r) & q = 1 \\ -\cos_\text{kern}(r) & q = 2 \\ \sin_\text{kern}(r) & q = 3 \end{cases}$$

The det_sincos function computes both simultaneously, sharing the range reduction and evaluating both kernels once:

pub fn det_sincos(x: f64) -> (f64, f64) {
                    let (r, q) = reduce(x);
                    let s = sin_kern(r);
                    let c = cos_kern(r);
                    match q {
                        0 => ( s,  c),
                        1 => ( c, -s),
                        2 => (-s, -c),
                        3 => (-c,  s),
                        _ => unreachable!(),
                    }
                }

This is the function called by every FFT twiddle factor computation and every DFT template peak generation. It uses only WASM-intrinsic operations throughout – multiply, add, subtract, floor, and negation. No call to Math.sin or Math.cos is ever emitted.

3.5 The atan2 Implementation

The atan2 function is needed for transform estimation: given detected peak displacements, the decoder estimates the rotation angle via:

$$\theta = \text{atan2}(b, a)$$

where $a$ and $b$ are least-squares estimates from matched peak pairs (discussed in Section 4). FDLIBM’s atan uses a 5-range argument reduction scheme that maps arbitrary inputs to a small interval where an 11-coefficient polynomial provides accuracy within 1 ULP.

The five ranges partition $|x|$ into regions where the argument is reduced using the identity:

$$\arctan(x) = \arctan(x_0) + \arctan\left(\frac{x - x_0}{1 + x \cdot x_0}\right)$$

with reference points $x_0 \in \{0, 0.5, 1.0, 1.5, \infty\}$ and precomputed values $\arctan(x_0)$. The polynomial evaluation splits odd and even terms into separate summations for accuracy, then combines them:

$$\arctan(x) \approx x - x \cdot (s_1 + s_2)$$

where $s_1$ accumulates the even-index coefficients and $s_2$ the odd-index coefficients of an 11-term polynomial in $z = x^2$.

The det_atan2(y, x) wrapper handles the four-quadrant logic, special cases (zeros, infinities, NaN), and sign adjustments, then delegates to det_atan(|y/x|) for the core computation.

3.6 Error Bounds and Verification

FDLIBM guarantees all functions are accurate to within 1 ULP – the result is either the correctly rounded value or one of its two nearest floating-point neighbors. In practice, for the input ranges encountered in Phasm (angles $[0, 2\pi]$ and moderate magnitudes), the error is typically less than 0.6 ULP.

Phasm’s test suite verifies determinism at two levels:

Accuracy tests compare det_sin, det_cos, and det_atan2 against known exact values ($\sin(\pi/6) = 0.5$, $\cos(\pi/3) = 0.5$, $\arctan(1) = \pi/4$) with tolerance $10^{-15}$.

Bit-identity tests pin the exact u64 bit representations of function outputs at specific inputs. These pinned values are the ground truth for cross-platform compatibility – if any pinned value changes, the cross-platform test fails and the build breaks. For example:

$$\text{det\_sin}(1.0) = \texttt{0x3FEAED548F090CEE}$$ $$\text{det\_cos}(1.0) = \texttt{0x3FE14A280FB5068C}$$

These bit patterns are verified on every CI run across native and WASM targets. A mismatch means the deterministic math layer has been compromised and cross-platform decode is broken.

4. The FFT Challenge: Bluestein’s Algorithm

4.1 Why Steganography Needs FFT

Phasm’s Armor mode includes a DFT template embedding system for geometric resilience – the ability to recover hidden messages even after the image has been rotated, scaled, or cropped. The template works by embedding 32 known peaks in the 2D DFT magnitude spectrum at passphrase-derived positions. Because the DFT magnitude is invariant to translation and transforms predictably under rotation ($\theta$) and scaling ($s$):

$$|F'(u, v)| = \frac{1}{s^2} \left| F\left(\frac{u \cos\theta + v \sin\theta}{s}, \frac{-u \sin\theta + v \cos\theta}{s}\right) \right|$$

the decoder can detect the embedded peaks, measure how they have shifted relative to their expected positions, and estimate the geometric parameters via least-squares:

$$a = \frac{\sum(u_k u'_k + v_k v'_k)}{\sum(u_k^2 + v_k^2)}, \quad b = \frac{\sum(u_k v'_k - v_k u'_k)}{\sum(u_k^2 + v_k^2)}$$

$$\theta = \text{atan2}(b, a), \quad s = \sqrt{a^2 + b^2}$$

Computing the 2D DFT of a megapixel-scale image requires a fast Fourier transform. The standard approach is to use an FFT library – and Rust has excellent options, including the rustfft crate.

4.2 The External Crate Problem

Our initial implementation used rustfft for the 2D FFT. It worked perfectly on native targets. On WASM, the FFT output differed by small amounts – 1-3 ULP in individual frequency bins – because rustfft computes twiddle factors (the complex exponentials $e^{-2\pi i k/N}$) using the standard library’s sin and cos, which on WASM delegate to Math.sin and Math.cos.

The twiddle factor for position $k$ in an $N$-point FFT is:

$$W_N^k = e^{-2\pi i k/N} = \cos\left(\frac{2\pi k}{N}\right) - i \sin\left(\frac{2\pi k}{N}\right)$$

Every butterfly operation in the FFT multiplies by a twiddle factor. A 1-ULP error in a twiddle factor propagates through $\log_2 N$ butterfly stages, accumulating to errors of $O(\log N)$ ULP in the output bins. For a 1024-point FFT, this means output errors of up to 10 ULP – still small in absolute terms, but large enough to shift magnitude peaks across bin boundaries in the DFT template detection.

We could not patch rustfft to use our deterministic math functions without forking the crate and maintaining a permanent divergence from upstream. Instead, we replaced it entirely with an in-house FFT implementation that uses det_sincos for all twiddle factor computation.

4.3 Radix-2 Cooley-Tukey

For power-of-two input sizes, the classic Cooley-Tukey radix-2 decimation-in-time algorithm provides $O(N \log N)$ performance with a straightforward in-place implementation. The algorithm has three phases:

Bit-reversal permutation. Reorder the input so that element at index $i$ moves to the bit-reversed position $\text{rev}(i)$. This is performed in-place using the standard bit-reversal loop.

Butterfly stages. For each stage $m = 1, 2, \ldots, \log_2 N$, the array is divided into groups of size $2^m$. Within each group, pairs of elements are combined using a butterfly operation:

$$\begin{aligned} X[k] &\leftarrow X[k] + W \cdot X[k + h] \\ X[k + h] &\leftarrow X[k] - W \cdot X[k + h] \end{aligned}$$

where $h = 2^{m-1}$ and $W = e^{-i\pi k/h}$ is the twiddle factor computed via:

let (s, c) = det_sincos(angle_step * k as f64);
                let w = Complex64::new(c, s);

Every twiddle factor in every stage uses det_sincos. No call to Math.sin or Math.cos is ever emitted.

4.4 Bluestein’s Chirp-Z Transform

JPEG images rarely have power-of-two dimensions. A 4032x3024 photo, divided into 8x8 blocks, produces a luminance grid of 504x378 blocks – neither dimension is a power of two. For arbitrary-length FFTs, we implement Bluestein’s algorithm, which reduces an $N$-point DFT to a circular convolution of length $M = 2^{\lceil\log_2(2N-1)\rceil}$ – a power of two – computed via three radix-2 FFTs.

The key identity underlying Bluestein’s algorithm uses the chirp factor $w_k = e^{i\pi k^2/N}$:

$$X[k] = w_k^* \sum_{n=0}^{N-1} \left(x[n] \cdot w_n^*\right) \cdot w_{k-n}$$

This expresses the DFT as a convolution of the “chirped” input sequence $a[n] = x[n] \cdot w_n^*$ with the chirp sequence $b[n] = w_n$. The convolution is computed via:

Zero-pad $a$ and $b$ to length $M$ (with wrap-around for $b$’s negative indices)
Forward FFT both sequences (radix-2, since $M$ is a power of two)
Pointwise multiplication in the frequency domain
Inverse FFT the product
Extract the first $N$ elements and multiply by the conjugate chirp

All chirp factors are computed using det_sincos:

let angle = sign * PI * (k as f64 * k as f64) / n as f64;
                let (s, c) = det_sincos(angle);
                chirp[k] = Complex64::new(c, s);

The full pipeline for a single arbitrary-length FFT is:

                graph LR
                    A["Input x[n]
length N"] --> B["Chirp multiply
a[n] = x[n] · w*[n]"]
                    B --> C["Zero-pad to M
(power of 2)"]
                    C --> D["Radix-2 FFT
length M"]
                    D --> E["Pointwise ×"]
                    F["Chirp sequence b[n]
with wrap-around"] --> G["Zero-pad to M"]
                    G --> H["Radix-2 FFT
length M"]
                    H --> E
                    E --> I["Radix-2 IFFT
length M"]
                    I --> J["Normalize ÷ M"]
                    J --> K["Chirp multiply
X[k] · w*[k]"]
                    K --> L["Output X[k]
first N elements"]

4.5 2D FFT via Row-Column Decomposition

The 2D DFT is computed by applying 1D FFTs along rows, transposing, applying 1D FFTs along columns, and transposing back. Each 1D FFT dispatches to either radix-2 or Bluestein depending on length. The magnitude spectrum uses det_hypot for overflow protection (the standard scaling trick $|F| = \max(|a|,|b|) \cdot \sqrt{1 + (b/a)^2}$); since sqrt is a WASM intrinsic, this is inherently deterministic.

4.6 Verification: Parseval’s Theorem

The correctness of the FFT implementation is verified using Parseval’s theorem, which states that the total energy in the spatial domain equals the total energy in the frequency domain:

$$\sum_{x,y} |f(x,y)|^2 = \frac{1}{N} \sum_{u,v} |F(u,v)|^2$$

where $N$ is the total number of pixels. The test suite verifies this identity for both power-of-two and non-power-of-two dimensions, with the error consistently below $10^{-6}$.

Additionally, the roundtrip test – forward FFT followed by inverse FFT – recovers the original pixel values with error below $10^{-10}$ for power-of-two sizes and below $10^{-8}$ for arbitrary sizes (the slightly larger error for Bluestein is due to the additional FFT stages in the convolution).

The DC component (frequency bin $(0,0)$) equals the sum of all input values – a simple but effective sanity check that is verified exactly.

5. The u32 Trap: When Integer Width Breaks Cross-Platform Crypto

The third cross-platform determinism failure was not in floating-point math at all – it was in the integer domain, and it was the most dangerous because it produced silently wrong results.

5.1 The usize Width Problem

Rust’s usize type matches the pointer width of the target platform: 64 bits on native ARM64 and x86-64, but 32 bits on WASM (wasm32-unknown-unknown). This is well-documented and generally harmless – most code does not depend on usize being a specific width. But when usize is used as a type parameter for random number generation, the consequences are severe.

Phasm uses a Fisher-Yates shuffle to permute the order of embeddable DCT coefficients. The permutation is deterministic: seeded by the passphrase-derived key, it produces the same coefficient ordering on every platform, ensuring that the encoder and decoder process coefficients in the same sequence. The original implementation:

for i in (1..n).rev() {
                    let j = rng.gen_range(0..=i);  // i has type usize
                    positions.swap(i, j);
                }

The gen_range(0..=i) call generates a uniform random integer in $[0, i]$. The rand crate’s implementation dispatches based on the integer type: for u64 (native usize), it calls rng.next_u64(), consuming 64 bits of PRNG state per step; for u32 (WASM usize), it calls rng.next_u32(), consuming only 32 bits per step.

5.2 The Cascade Effect

ChaCha20 produces output in 32-bit words. A next_u64() call consumes two words; next_u32() consumes one. After $k$ Fisher-Yates steps, the native PRNG has consumed $2k$ words while WASM has consumed $k$. For a typical image with 100,000+ embeddable coefficients, the permutations diverge completely – an image encoded on iOS cannot be decoded in Chrome, and vice versa.

5.3 The Fix: Explicit u32

The fix is a single type annotation:

for i in (1..n).rev() {
                    let j = rng.gen_range(0..=(i as u32)) as usize;
                    positions.swap(i, j);
                }

By casting i to u32 before passing it to gen_range, the call always uses the u32 code path – consuming exactly one 32-bit word per step on every platform. The PRNG state advances identically on native and WASM, producing identical permutations.

This works because the maximum number of embeddable coefficients in a JPEG image is bounded by image dimensions. For the maximum supported resolution of 8192x8192 pixels, there are approximately 1 million 8x8 blocks with 63 AC coefficients each – roughly 66 million positions, well within u32::MAX ($\approx 4.3 \times 10^9$).

5.4 Pinned Regression Tests

To prevent regression, the cross-platform test suite pins the first 20 permutation indices for a known seed and cost map configuration:

Seed: [42; 32], 4x4 blocks (1008 positions)
                Expected: [258, 980, 673, 988, 76, 41, 725, 301,
                           438, 872, 667, 574, 867, 881, 46, 240,
                           965, 56, 339, 941]

These values were recorded on 2026-02-23 after the u32 fix. If any change to the shuffle algorithm, the PRNG, or the coefficient selection logic produces different values, the test fails with an explicit message: “Permutation output changed! This breaks WASM/native compatibility.”

Similar pinned tests exist for det_sin, det_cos, det_atan2, det_hypot, FFT outputs, IDCT cosine table entries, and DFT template peak positions. Together, these pinned values form an immutable contract between the encoder and decoder across all platforms.

6. Lessons for WASM Developers

6.1 Three Rules for WASM Determinism

The experience of building and debugging three independent cross-platform determinism failures distills into three principles:

Rule 1: Audit every transcendental function call. If your WASM module calls sin, cos, tan, atan, atan2, exp, ln, log, pow, or hypot and requires identical results across engines, you must replace those calls with implementations built exclusively from WASM intrinsics. FDLIBM provides battle-tested algorithms; the implementation effort for a targeted subset is modest (Phasm’s five functions required approximately 275 lines of Rust).

Rule 2: Fix your integer widths. usize is not portable. Any code path where usize determines PRNG consumption, hash computation, serialization format, or any other cross-platform-visible behavior must use explicitly-sized types (u32, u64). This includes indirect uses: Vec::len() returns usize, and passing it to a generic function parameterized on usize can trigger width-dependent code paths in the standard library or in crates like rand.

Rule 3: Pin bit-exact outputs in your test suite. Accuracy tests (“|a - b| < epsilon”) catch algorithmic errors but miss the 1-ULP differences that cause cross-platform divergence. Bit-identity tests (a.to_bits() == expected_bits) catch everything. Pin the outputs of every determinism-critical function at representative inputs, record the exact u64 bit patterns, and fail the build if any pattern changes. These pinned values are the specification; the code is the implementation.

6.2 Performance Considerations

Replacing Math.sin/Math.cos with FDLIBM polynomial evaluation has a measurable but tolerable performance impact. The det_sincos combined function shares range reduction across both kernels, halving the overhead for the common case where both sine and cosine are needed (as in every FFT twiddle factor). In Phasm’s case, switching to deterministic math increased FFT time by roughly 20-30% in microbenchmarks, but the overall encode operation – which includes JPEG parsing, J-UNIWARD cost computation, STC embedding, and JPEG reassembly – slowed by less than 5%. For an operation that completes in 200-800ms total, this is imperceptible to the user.

6.3 Applicability Beyond Steganography

The deterministic math approach applies directly to any WASM application requiring cross-engine reproducibility: multiplayer game engines running physics simulations client-side, distributed computing frameworks partitioning work across heterogeneous nodes, and blockchain smart contracts executing in WASM-based virtual machines (e.g., Cosmwasm). WASM’s lack of deterministic transcendentals is a known issue in the standards community, but browser vendors have no incentive to converge on a single implementation. Until the standard evolves, the FDLIBM approach works today, on every engine, with no special compiler flags.

7. Conclusion

WebAssembly delivers on its promise of portable, near-native performance – with one critical caveat. The transcendental functions that delegate to host Math.* implementations are a hidden source of cross-platform divergence. For Phasm, three independent bugs – transcendental non-determinism, FFT twiddle factor divergence, and usize-width PRNG consumption – each produced correct-looking but silently wrong cross-platform behavior. The solutions are modest in code size (a 275-line FDLIBM math module, an in-house FFT, and explicit u32 casts) but represent hard-won understanding of the WASM execution model.

The broader lesson: cross-platform determinism is not a property you get for free. It is a property you must design for, audit for, and test for – with bit-exact pinned outputs, not just epsilon-tolerance accuracy checks. The WASM specification guarantees deterministic basic arithmetic; everything beyond that is your responsibility.

Phasm is a free steganography app that hides encrypted text messages inside JPEG photos. It runs on iOS, Android, and the web. All processing happens on your device – no images or messages are sent to any server. The deterministic math layer described in this post ensures that images encoded on any platform can be decoded on any other platform, down to the last bit. Learn more about Phasm’s architecture: pure-Rust JPEG codec, Armor mode robustness, Ghost mode stealth, Syndrome-Trellis Codes, DFT template for geometric resilience, and how 15 platforms process your photos.

References

J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965.
L. I. Bluestein, “A Linear Filtering Approach to the Computation of Discrete Fourier Transform,” IEEE Transactions on Audio and Electroacoustics, vol. 18, no. 4, pp. 451–455, 1970.
W. J. Cody and W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980.
K. C. Ng, “Argument Reduction for Huge Arguments: Good to the Last Bit,” Sun Microsystems, 1992. (FDLIBM: Freely Distributable LIBM, Sun Microsystems, 1993.)
IEEE, “IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2019, 2019.
W3C/WebAssembly Community Group, “WebAssembly Specification,” Version 2.0, 2024. https://webassembly.github.io/spec/
S. Pereira and T. Pun, “Robust Template Matching for Affine Resistant Image Watermarks,” IEEE Transactions on Image Processing, vol. 9, no. 6, pp. 1123–1129, 2000.
M. E. A. B. Parseval, “Memoire sur les series et sur l’integration complete d’une equation aux differences partielles lineaire du second ordre, a coefficients constants,” Memoires presentes a l’Institut des Sciences, Lettres et Arts, par divers savans, et lus dans ses assemblees. Sciences, mathematiques et physiques, vol. 1, pp. 638–648, 1806.

When f64::sin() Breaks Your Crypto: Building Deterministic Math for WASM Steganography