Watson Perceptual Masking for QIM Steganography

Abstract

Block-Average Quantization Index Modulation (BA-QIM) provides near-perfect survival under JPEG recompression, but a fixed quantization step introduces visible artifacts in smooth image regions – gradients, skies, and out-of-focus backgrounds reveal the 8x8 block grid. This paper describes how Watson’s 1993 perceptual model for DCT coefficient visibility was adapted to make QIM embedding strength content-adaptive: embed harder in texture, softer in smooth areas. The Watson masking system is now fully integrated in Fortress mode as of version 1.3.0, with a 4-tier adaptive QIM step sizing scheme that scales the embedding strength per-block based on local texture complexity. We detail the AC energy ratio metric used as a texture proxy, the piecewise-linear base curve that maps energy to a masking factor, and the key innovation of coupling the masking range to the repetition factor so that short messages (high redundancy) get narrow adaptation for maximum robustness while long messages (low redundancy) get wide adaptation for maximum quality. The soft majority voting decoder uses LLR (log-likelihood ratio) integrity scoring to weight extraction confidence per-block, naturally complementing the Watson masking by downweighting low-confidence extractions from smooth blocks. We also recount a cautionary development failure: a discrete skip-tier scheme that appeared to work in isolation but caused catastrophic encoder-decoder misalignment after recompression, and the continuous-masking fix that resolved it. All parameter values are derived from production source code, and all experimental claims are grounded in end-to-end WhatsApp survival tests.

1. Introduction: The Quality-Robustness Tradeoff

Robust steganography presents a fundamental tension. The embedding must alter the host image enough to survive channel noise – recompression, format conversion, cross-library IDCT differences – but not so much that the alterations become visible. In Quantization Index Modulation (QIM), this tension reduces to a single parameter: the quantization step $\Delta$.

A large $\Delta$ creates wide decision regions that tolerate large perturbations, yielding low bit error rates (BER) after recompression. A small $\Delta$ produces minimal distortion but narrow decision regions that are easily destroyed by noise. For BA-QIM – the Fortress sub-mode within Phasm’s Armor architecture – the embedding domain is the average brightness of each 8x8 luminance block, derived from the DC coefficient:

$$\text{avg} = \frac{\text{DC} \times Q_{\text{DC}}}{8}$$

At a fixed step of $\Delta = 12$ pixel levels, Fortress survives JPEG recompression down to quality factor 53 (sufficient for even WeChat’s aggressive pipeline). The maximum brightness shift observed across three encoder families (sips/AppleJPEG, libjpeg-turbo, MozJPEG) and four quality factors is under 2 pixel levels – well within the $\pm 6$ decision margin that a step of 12 provides.

But a fixed step of 12 applied uniformly to every block in the image produces visible artifacts. In smooth regions – blue sky, defocused backgrounds, studio backdrops – the human visual system (HVS) is exquisitely sensitive to brightness perturbations. An 8x8 block shifted by 6 pixel levels against a flat gradient creates a visible mosaic pattern. In textured regions – foliage, fabric, gravel – the same 6-level shift is completely invisible, masked by the local contrast.

This observation has been well-understood since the early days of image compression. The HVS exhibits three forms of masking relevant to DCT-domain processing:

Luminance masking. Sensitivity varies with mean luminance: the eye is less sensitive to perturbations in very dark or very bright regions than at mid-luminance. The Weber fraction quantifies this – the just-noticeable difference (JND) scales approximately as $\Delta L / L \approx 0.02$ in the mid-range.

Texture masking. High spatial-frequency content (edges, texture, noise) raises the visibility threshold for additional perturbations in the same region. A block containing complex texture tolerates much larger modifications than a smooth block at the same mean luminance.

Frequency masking. Sensitivity varies across the DCT frequency positions: the eye is most sensitive to low-frequency components (which correspond to large-area brightness and contrast) and progressively less sensitive to high-frequency detail. JPEG quantization matrices exploit this directly, using larger quantization steps for high-frequency coefficients.

For BA-QIM, which modifies only the DC coefficient (the zero-frequency component), frequency masking is not directly applicable – every modification is at the same frequency position. But luminance masking and, especially, texture masking are directly relevant. A block’s AC energy – the energy in its non-DC coefficients – is a natural measure of local texture complexity. High AC energy means complex texture; low AC energy means smooth content.

Watson’s 1993 perceptual model (Watson, 1993) provides the theoretical foundation. Originally developed for optimizing JPEG quantization matrices, Watson’s model computes per-coefficient, per-block perceptibility thresholds that account for luminance adaptation, frequency sensitivity, and contrast masking. While the full model is richer than what BA-QIM requires – it operates on all 64 DCT positions per block – its core insight translates directly: the visibility threshold for a DCT coefficient modification depends on the local image content, and that dependency can be computed from the DCT coefficients themselves.

The adaptation we describe in this paper takes Watson’s insight and implements it for the specific case of DC-only modification with a QIM embedding constraint. The result is a per-block multiplicative factor that scales the QIM step: smooth blocks get a smaller effective step (less distortion, narrower decision region), and textured blocks get a larger effective step (more distortion, wider decision region). The embedding energy is redistributed from perceptually sensitive regions to perceptually tolerant regions – improving image quality without reducing aggregate robustness.

2. Watson’s Perceptual Model for DCT Coefficients

2.1 The Original Watson Model

Andrew Watson’s 1993 paper “DCTune: A Technique for Visual Optimization of DCT Quantization Matrices for Individual Images” introduced a principled method for computing content-dependent JPEG quantization matrices. Rather than using a single quantization matrix for all blocks (the standard JPEG approach), Watson’s model computes the maximum quantization step for each DCT coefficient in each block such that the quantization error remains below the visibility threshold.

Watson’s model combines three perceptual factors. For DCT coefficient $(i, j)$ in block $b$:

Frequency sensitivity threshold $t_{ij}$: the base visibility threshold for coefficient position $(i, j)$ at mean luminance, determined by psychophysical experiments. These are the well-known CSF (Contrast Sensitivity Function) values in the DCT domain. Low-frequency positions have small thresholds (high sensitivity); high-frequency positions have large thresholds (low sensitivity).

Luminance adaptation $t_{ij}^{(b)}$: the threshold adjusted for the block’s mean luminance $\bar{L}_b$ (proportional to the DC coefficient):

$$t_{ij}^{(b)} = t_{ij} \left( \frac{\bar{L}_b}{\bar{L}_0} \right)^{a_T}$$

where $\bar{L}_0$ is a reference luminance (typically 128) and $a_T \approx 0.649$ is the luminance adaptation exponent.

Contrast masking $t_{ij}^{*(b)}$: the threshold further elevated by the presence of similar-frequency content in the same block. If the existing coefficient magnitude $|c_{ij}^{(b)}|$ exceeds the luminance-adapted threshold, the masking-adjusted threshold becomes:

$$t_{ij}^{*(b)} = \max\left(t_{ij}^{(b)}, \; \left|c_{ij}^{(b)}\right|^{w_{ij}} \cdot \left(t_{ij}^{(b)}\right)^{1 - w_{ij}}\right)$$

where $w_{ij} \approx 0.7$ is the contrast masking exponent. This is the texture masking effect: large existing coefficients (complex texture) raise the visibility threshold, allowing larger modifications to go unnoticed.

2.2 Watson in Watermarking and Steganography

Watson’s model has been extensively applied in the digital watermarking literature. Podilchuk and Zeng (1998) embedded watermarks using Watson-derived JND (Just Noticeable Difference) masks, concentrating watermark energy where it would be least visible. Barni et al. (2001) combined Watson thresholds with DFT-domain embedding for robust watermarking. Perez-Gonzalez et al. (2007) applied perceptual masking to STDM (Spread Transform Dither Modulation) embedding, using Watson’s model to set per-group quantization steps.

The common pattern across this work is: compute a perceptual threshold per embedding location, then scale the embedding strength proportionally. Locations where the HVS threshold is high (textured, high-frequency content) receive stronger embedding; locations where the threshold is low (smooth, low-frequency content) receive weaker embedding. The total embedding energy may remain the same, but its spatial distribution shifts from perceptually sensitive to perceptually tolerant regions.

For QIM-based embedding specifically, the perceptual mask translates to a per-block scaling factor $w_b$ applied to the quantization step:

$$\Delta_b = w_b \cdot \Delta_{\text{base}}$$

where $\Delta_{\text{base}}$ is the nominal QIM step and $w_b \in [w_{\min}, w_{\max}]$ is the Watson factor for block $b$. The extraction rule at the decoder must use the same $w_b$ – which means the decoder must be able to recompute the perceptual mask from the received (possibly degraded) image. This places a critical constraint on the masking metric: it must be recompression-invariant, or at least stable enough that the encoder and decoder agree on each block’s factor despite recompression noise.

2.3 Why Full Watson Is Excessive for DC-Only Modification

The full Watson model computes 64 thresholds per block – one for each DCT coefficient position. For Fortress, which modifies only the DC coefficient, 63 of those 64 computations are unused. Moreover, the luminance adaptation and contrast masking in Watson’s model are formulated for the general case where multiple coefficients in the same block may be modified simultaneously. With DC-only modification, the perceptual analysis simplifies dramatically.

What remains relevant is the texture masking component: does the block contain enough high-frequency content to mask a DC-level brightness shift? The answer depends on the energy in the AC coefficients – the same information that Watson’s contrast masking uses, but reduced from 63 individual coefficient values to a single scalar energy metric.

This simplification is not merely an engineering convenience. It is a necessity. The full Watson model requires the exact DC coefficient value (for luminance adaptation) and exact AC coefficient values (for contrast masking). After recompression, both may change. The DC value changes because the quantization table changes; the AC values change because of the pixel-domain round-trip. Using the full Watson model would require the decoder to compute masking factors from post-recompression coefficient values – values that may differ from those the encoder saw.

By collapsing the Watson analysis to a single AC energy metric with carefully chosen stability properties, we can compute a masking factor that agrees between encoder and decoder even after aggressive recompression. The next section details this metric.

3. Our Adaptation: AC Energy Ratio as Texture Proxy

3.1 The Energy Metric

The core perceptual metric used in Phasm’s Watson masking is the AC energy of each 8x8 luminance block, defined as:

$$E_b = \sum_{\substack{k=1 \\ |c_k| \geq 2}}^{63} c_k^2$$

where $c_k$ is the $k$-th quantized AC coefficient of block $b$ (the DC coefficient at $k = 0$ is excluded). The sum spans all 63 AC positions, but only coefficients with absolute value 2 or greater are included.

3.2 Why the $|c| \geq 2$ Threshold

The threshold at $|c| \geq 2$ is not arbitrary – it is a stability criterion grounded in the recompression experiments documented in our companion post on JPEG recompression invariants. Coefficients with $|c| = 1$ are the most vulnerable to recompression: they sit at the boundary between zero and non-zero, and even a small pixel-domain perturbation can flip them. Experiments across three encoder families (sips, libjpeg-turbo, MozJPEG) at quality factors from 53 to 95 showed that the sign of coefficients with $|c| \geq 2$ is near-perfectly stable: 0.00% sign BER across all tests (with a negligible 2/27,677 exception for MozJPEG at QF 70).

By excluding $|c| = 1$ coefficients from the energy sum, the metric becomes stable across recompression. A block classified as “textured” by the encoder will also be classified as “textured” by the decoder, even if the intervening WhatsApp pipeline has recompressed the JPEG with a completely different quality factor and encoder library.

The squared magnitude $c_k^2$ emphasizes large coefficients (which are the most stable and the most relevant to perceived texture) while avoiding the need for a square root operation, keeping the computation in basic IEEE 754 arithmetic. Since all values are integers squared and summed, the energy is computed identically on all platforms – critical for deterministic encoder-decoder agreement in WebAssembly, native ARM, and native x86 targets (see also: our work on the pure Rust JPEG coefficient codec that provides the coefficient access layer).

3.3 Median Normalization

Raw AC energy values vary enormously across images. A photograph of dense foliage might have median block energy of 50,000, while a studio portrait with a smooth background might have median energy of 500. To produce a consistent masking curve that works across all images, the per-block energy is normalized by the median energy across all blocks:

$$r_b = \frac{E_b}{E_{\text{median}}}$$

where $E_{\text{median}} = \text{median}(\{E_1, E_2, \ldots, E_N\})$, clamped to a minimum of 1.0 to prevent division by zero in nearly uniform images. The resulting ratio $r_b$ has the following interpretation:

Energy Ratio $r_b$	Interpretation
0	Perfectly smooth block (all $\|c\| < 2$)
$\ll 1$	Smoother than the median block
$\approx 1$	Typical texture for this image
2–4	Significantly more textured than average
$> 4$	Very heavy texture (foliage, gravel, fur)

The median normalization is image-adaptive: the masking curve responds to the distribution of texture within each specific image, not to an absolute threshold that would need tuning per image category.

3.4 Computing Energy Ratios in Practice

The computation is straightforward. For each block in the luminance DCT grid:

let mut energy: f64 = 0.0;
                for k in 1..64 {
                    let c = block[k];
                    if c.abs() >= 2 {
                        energy += (c as f64) * (c as f64);
                    }
                }

All block energies are collected, sorted to find the median, and each energy is divided by the median. The energy ratios are computed once per image and cached – on decode, they are reused across all brute-force repetition-factor candidates, avoiding redundant computation.

4. The Piecewise-Linear Base Curve

4.1 Curve Definition

The base Watson curve maps the energy ratio $r_b$ to a masking factor $f_b \in [0.3, 1.5]$ via a monotone piecewise-linear function with four segments, forming the 4-tier adaptive QIM step sizing scheme:

$$f_b = \begin{cases} 0.3 & \text{if } r_b \leq 0.01 \\[4pt] 0.3 + \dfrac{r_b - 0.01}{0.35 - 0.01} \times (0.7 - 0.3) & \text{if } 0.01 < r_b \leq 0.35 \\[4pt] 0.7 + \dfrac{r_b - 0.35}{2.0 - 0.35} \times (1.0 - 0.7) & \text{if } 0.35 < r_b \leq 2.0 \\[4pt] 1.0 + \dfrac{r_b - 2.0}{4.0 - 2.0} \times (1.5 - 1.0) & \text{if } 2.0 < r_b \leq 4.0 \\[4pt] 1.5 & \text{if } r_b > 4.0 \end{cases}$$

The four anchor points, taken directly from the production source code, are:

Anchor	Energy Ratio $R$	Base Factor $F$	Interpretation
0	$R_0 = 0.01$	$F_0 = 0.3$	Smooth: reduce step to 30% of base
1	$R_1 = 0.35$	$F_1 = 0.7$	Low texture: reduce step to 70%
2	$R_2 = 2.0$	$F_2 = 1.0$	Median-ish texture: use base step
3	$R_3 = 4.0$	$F_3 = 1.5$	Heavy texture: increase step to 150%

4.2 Why Piecewise-Linear

The choice of piecewise-linear interpolation rather than a smooth sigmoid, power law, or lookup table is deliberate:

Determinism. The function uses only addition, subtraction, multiplication, and division – no transcendental functions ($\sin$, $\cos$, $\exp$, $\log$) whose implementations vary across platforms. IEEE 754 double-precision arithmetic for these basic operations is bit-identical on x86-64, ARM64, and WebAssembly. This is critical: the encoder and decoder must compute identical masking factors from the same energy ratio, even when running on different platforms (e.g., iOS encoder, browser decoder).

Monotonicity. The factor is non-decreasing in the energy ratio. Higher texture always means a larger QIM step (more robust, less visible). No region of the curve reduces the step as texture increases.

Interpretability. Each anchor point has a clear physical meaning. The curve can be adjusted by moving anchor points without unexpected interactions. Contrast this with a polynomial or neural network mapping, where changing one parameter affects the entire curve.

Computational cost. A single comparison chain and one multiply-add per block. No lookup tables, no interpolation coefficients to precompute.

4.3 Curve Shape and Segment Slopes

The three interior segments have different slopes, reflecting the diminishing perceptual returns of additional texture:

Segment	Ratio Range	Factor Range	Slope ($\Delta F / \Delta R$)
1	$[0.01, 0.35]$	$[0.3, 0.7]$	$\approx 1.18$
2	$[0.35, 2.0]$	$[0.7, 1.0]$	$\approx 0.18$
3	$[2.0, 4.0]$	$[1.0, 1.5]$	$\approx 0.25$

Segment 1 (smooth-to-low-texture) has the steepest slope: even a small amount of texture significantly increases the masking tolerance. Segment 2 (low-to-median) is nearly flat: once a block has moderate texture, additional texture provides only marginal perceptual benefit. Segment 3 (median-to-heavy) rises again: blocks with twice-median energy or more can absorb significantly more distortion. Beyond $R_3 = 4.0$, the factor saturates at 1.5 – there is no benefit to embedding harder in extremely textured blocks, and doing so would risk visual artifacts from extreme DC shifts.

The factor range of $[0.3, 1.5]$ defines the shape of the adaptation. The actual QIM step range applied to each block depends on the adaptive remapping described in Section 5, which narrows or widens this range based on the repetition factor.

4.4 The Remapping Function

The base factor $f_b \in [0.3, 1.5]$ is remapped to the adaptive range $[w_{\text{lo}}, w_{\text{hi}}]$ via linear interpolation:

$$w_b = w_{\text{lo}} + \frac{f_b - 0.3}{1.5 - 0.3} \times (w_{\text{hi}} - w_{\text{lo}})$$

When $w_{\text{lo}} = 0.3$ and $w_{\text{hi}} = 1.5$ (the full range), the remap is an identity. When the range is narrower (e.g., $[0.9, 1.1]$ for maximum robustness), the curve’s 5:1 shape ratio is compressed into a 1.22:1 ratio – the adaptation is present but gentle, biasing toward uniform embedding. The effective QIM step for block $b$ is then:

$$\Delta_b = w_b \times \Delta_{\text{base}}$$

5. Adaptive Parameters: Coupling Masking with Repetition Factor

5.1 The Key Innovation

The most important design decision in Phasm’s Watson masking is not the curve shape but the coupling between the masking range and the repetition factor $r$. This coupling is, to our knowledge, novel: prior work on perceptual QIM masking (Podilchuk and Zeng, 1998; Perez-Gonzalez et al., 2007) uses fixed masking ranges independent of the error correction scheme.

The intuition is as follows. The repetition factor $r$ (how many times each information bit is repeated across the image) directly determines the system’s tolerance for per-block bit errors. At $r = 61$ (a very short message in a large image), each bit is voted across 61 copies – even a 17% per-copy BER produces a post-voting error rate below $10^{-8}$ (see our companion post on soft majority voting and LLR-based concatenated codes). At $r = 15$ (the minimum for Fortress), the margin is much thinner: a 19% BER at $r = 15$ yields post-voting error rates that RS coding can barely correct.

This asymmetry means that at high $r$, the system can afford to reduce the QIM step in smooth blocks aggressively (improving quality) because the aggregate BER can be much higher before the message is lost. At low $r$, the system cannot afford quality-motivated step reductions because every bit error counts.

5.2 Continuous Linear Interpolation

The adaptive parameters are determined by continuous linear interpolation between two anchor points, using the repetition factor $r$ as the independent variable:

$$t = \frac{\text{clamp}(r, 15, 61) - 15}{61 - 15}$$

where $t \in [0, 1]$ is the interpolation parameter. Then:

$$\Delta_{\text{base}} = 12.0 + t \times (6.5 - 12.0) = 12.0 - 5.5t$$

$$w_{\text{lo}} = 0.9 + t \times (0.62 - 0.9) = 0.9 - 0.28t$$

$$w_{\text{hi}} = 1.1 + t \times (1.26 - 1.1) = 1.1 + 0.16t$$

At the two extremes:

Parameter	$r = 15$ (max robustness)	$r = 61$ (max quality)
Base QIM step $\Delta_{\text{base}}$	12.0	6.5
Watson low $w_{\text{lo}}$	0.9	0.62
Watson high $w_{\text{hi}}$	1.1	1.26
Effective step range	$[10.8, 13.2]$	$[4.03, 8.19]$
Watson range width	0.20	0.64

5.3 The Dual Lever

The coupling works as a dual lever:

Lever 1: Base step. At $r = 15$, the base step is 12.0 pixel levels – a wide QIM decision region that tolerates $\pm 6$ levels of recompression noise. At $r = 61$, the base step drops to 6.5 – a narrower region with $\pm 3.25$ margin, but still sufficient since the maximum observed recompression shift is under 2 pixel levels.

Lever 2: Watson range. At $r = 15$, the Watson masking range $[0.9, 1.1]$ is narrow – the step barely varies between smooth and textured blocks. This is deliberately conservative: with only 15 votes per bit, the system cannot afford to shrink the step in smooth blocks. At $r = 61$, the range widens to $[0.62, 1.26]$ – smooth blocks get steps as small as $0.62 \times 6.5 = 4.03$ pixel levels (barely perceptible), while textured blocks get steps up to $1.26 \times 6.5 = 8.19$ pixel levels (invisible in texture).

The two levers multiply: from the smallest effective step (4.03 at $r = 61$, smooth block) to the largest (13.2 at $r = 15$, textured block), the range spans more than 3:1. But this full range is never realized for a single message – a message that yields $r = 15$ uses a narrow Watson range, and a message that yields $r = 61$ uses a small base step.

5.4 Parameter Table

The following table shows the adaptive parameters at representative repetition factors, computed from the production interpolation formulas:

$r$	$t$	$\Delta_{\text{base}}$	$w_{\text{lo}}$	$w_{\text{hi}}$	Effective Range	Watson Width
15	0.00	12.00	0.900	1.100	[10.80, 13.20]	0.200
20	0.11	11.40	0.870	1.117	[9.91, 12.74]	0.248
25	0.22	10.80	0.839	1.135	[9.07, 12.26]	0.296
30	0.33	10.21	0.809	1.152	[8.25, 11.76]	0.343
38	0.50	9.25	0.760	1.180	[7.03, 10.92]	0.420
45	0.65	8.41	0.717	1.204	[6.04, 10.13]	0.487
53	0.83	7.46	0.669	1.232	[4.99, 9.19]	0.563
61	1.00	6.50	0.620	1.260	[4.03, 8.19]	0.640

5.5 The Elbow at $r = 61$

The cap at $R_{\text{MAX}} = 61$ is not arbitrary. It reflects two converging constraints:

Perceptual threshold. At $r = 61$ with $\Delta_{\text{base}} = 6.5$ and Watson $w_{\text{lo}} = 0.62$, the minimum effective step is 4.03 pixel levels. A QIM shift of half the step – 2.0 levels – is approximately the perceptual threshold for a brightness shift in a smooth 8x8 block against a typical photographic background. Reducing the step further would push the shift below perceptibility even in the worst case, but would also reduce the decision margin below the 2-level recompression noise floor.

BER floor. Blocks with a QIM decision margin of 2.0 pixel levels, subjected to recompression shifts of up to 1.875 levels, produce a non-trivial BER. At $r = 61$, soft majority voting comfortably handles this (error rate $\approx 10^{-8}$). Beyond $r = 61$, the quality improvement is imperceptible, but the BER in “diff=2 blocks” (blocks where the DC shift from embedding equals exactly 2 quantization units) starts increasing because the reduced step shrinks the decision margin faster than the increased redundancy compensates.

The elbow at $r = 61$ is therefore the point of diminishing returns: beyond it, quality gains are invisible and robustness begins to erode. All repetition factors above 61 are clamped to the $r = 61$ parameters.

6. The Watson Skip Tier Problem

6.1 The Original Design

The first implementation of Watson masking for Fortress used a discrete four-tier scheme, documented in the design rationale:

Tier	Energy Range	Watson Factor	Behavior
Tier 0	Smoothest blocks	Skip	No embedding
Tier 1	Low texture	0.7	Smaller QIM step
Tier 2	Average texture	1.0	Base QIM step
Tier 3	Heavy texture	1.5	Larger QIM step

The skip tier (Tier 0) was the most aggressive perceptual optimization: if a block was too smooth to tolerate any QIM perturbation, simply exclude it from embedding entirely. The payload would be distributed across the remaining blocks, and since smooth blocks are typically a minority (20-30% of blocks in a typical photograph), the capacity loss was modest.

This design passed all unit tests and local recompression tests with flying colors. The quality improvement was dramatic – smooth-sky regions showed no artifacts at all, because they were left untouched. The first WhatsApp end-to-end test (February 24, 2026) also succeeded: a message encoded into a 1200x1600 photo survived WhatsApp standard recompression, was decoded correctly on the receiving device.

6.2 The Disaster

The skip tier worked because – in the initial test – the set of blocks classified as “smooth” by the encoder happened to coincide with the set classified as “smooth” by the decoder after recompression. But this coincidence was not guaranteed.

The problem: recompression changes AC energy. When WhatsApp’s encoder re-quantizes the image with different quantization tables, some AC coefficients with $|c| = 2$ or $|c| = 3$ are quantized down to $|c| = 1$ or zeroed out entirely. This reduces the block’s AC energy (recall that $|c| = 1$ coefficients are excluded from the energy sum). A block that was barely above the Tier 0/Tier 1 boundary at encode time could fall below it after recompression, flipping from “embed” to “skip.”

The consequence was catastrophic. The encoder embedded payload bits into blocks at indices 100, 101, 102, 103, etc. (skipping smooth blocks). The decoder, seeing a different set of smooth blocks after recompression, tried to extract from a different sequence of blocks – perhaps indices 100, 102, 103, 104, etc. (skipping block 101 because its energy had dropped below the threshold). Every subsequent block index was shifted by one, creating a complete misalignment between the embedded and extracted bit sequences. The majority voting received misaligned copies, producing garbage output. RS decoding failed on every candidate.

6.3 Why Testing Didn’t Catch It

The skip tier failure was insidious because:

Local recompression (same encoder) preserves AC energy well. When the same libjpeg instance recompresses at a similar QF, coefficient magnitudes change minimally. The skip set is nearly identical.
The first WhatsApp test happened to work. The specific image and QF combination used in the initial test produced an energy distribution where no blocks were near the skip boundary. This was coincidental.
The failure mode is binary. A single skip-set disagreement doesn’t degrade gracefully – it misaligns the entire payload. There is no partial recovery.

6.4 The Fix: Continuous Masking, No Discrete Tiers

The fix was straightforward: eliminate the skip tier entirely. All blocks participate in embedding. Tier 0 was changed from “skip” to “minimum factor” – smooth blocks receive the lowest Watson factor (which, at $r = 15$, is 0.9 – barely below the base step), but they still carry embedded bits.

This introduces a small quality cost in the smoothest blocks, but eliminates the catastrophic failure mode. The continuous piecewise-linear curve described in Section 4 replaced the discrete tier system: instead of four discrete levels (skip, 0.7, 1.0, 1.5), the masking factor varies continuously from 0.3 to 1.5 (in the base curve) and is remapped to the adaptive range. No block is special. No block is skipped.

The second WhatsApp test (February 25, 2026) confirmed the fix: a message encoded with continuous masking and no skip tier survived WhatsApp standard recompression, decoded correctly with the same passphrase.

6.5 Lessons for the Watermarking Community

The skip tier failure illustrates a broader principle: any discrete classification of embedding locations that depends on host signal properties is fragile under lossy channel noise. If the encoder uses a threshold to decide “embed here, skip there,” the decoder must reproduce the exact same threshold decisions. When the channel changes the host signal (as JPEG recompression does), threshold boundary effects cause misalignment.

This principle applies beyond Watson masking:

Coefficient selection schemes that use hard thresholds on coefficient magnitude or stability margin face the same risk. A coefficient that passes the encoder’s threshold may fail the decoder’s threshold after recompression.
Region-of-interest masking in video watermarking, where embedding is restricted to “complex” regions based on a motion or texture metric, is vulnerable if the metric changes under transcoding.
Adaptive repetition schemes that vary the repetition factor per-block based on local SNR estimates can suffer decoder misalignment if the SNR changes across the channel.

The robust solution is to avoid discrete embed/skip decisions entirely. Use continuous scaling (as our base curve does), or embed everywhere and let the error correction system handle the higher BER in difficult regions (as Fortress does with soft majority voting). The soft majority voting decoder uses LLR (log-likelihood ratio) integrity scoring to compute per-block extraction confidence from the QIM decision margin. This naturally downweights low-confidence bits from smooth blocks (where the QIM decision margin is small) and upweights high-confidence bits from textured blocks (where the margin is large), achieving an analogous effect to skip-tier filtering without the catastrophic misalignment risk. The LLR scores also serve as a diagnostic: aggregate LLR across all blocks provides a message integrity metric that indicates whether the extraction is likely to succeed before committing to full RS decoding.

7. Experimental Validation

7.1 End-to-End WhatsApp Survival

The definitive validation of the Watson masking scheme is end-to-end survival through real WhatsApp message pipelines. Two controlled tests were conducted:

Test 1 (2026-02-24): Skip-tier Watson masking (pre-fix, original discrete-tier design).

Stage	File	Size	Dimensions	JPEG Type	QT DC
Original	HEIC camera photo	1,811,334	4032x3024	HEIC	–
Phasm-encoded	stego JPEG	400,131	1200x1600	Baseline (SOF0)	3
After WhatsApp	received JPEG	265,658	1200x1600	Progressive (SOF2)	8

Message decoded successfully. However, this test coincidentally had no blocks near the skip boundary.

Test 2 (2026-02-25): Continuous Watson masking (post-fix, production design).

Stage	File	Size	Dimensions	JPEG Type	QT DC
Original	HEIC camera photo	1,806,984	4032x3024	HEIC	–
Phasm-encoded	stego JPEG	285,101	1200x1600	Baseline (SOF0)	8
After WhatsApp	received JPEG	279,683	1200x1600	Baseline (SOF0)	6

Message decoded successfully. Note the completely different encoder behavior between tests: WhatsApp produced a Progressive JPEG in Test 1 and a Baseline JPEG in Test 2, with different QT tables (DC quantization step of 8 vs. 6). The DC quantization table changes entirely between the encode and decode stages. Both tests survived because BA-QIM operates on pixel-domain block averages – a domain that is invariant to quantization table changes.

7.2 Robustness of the Energy Metric

The AC energy metric (sum of $c^2$ for $|c| \geq 2$) must produce similar masking factors before and after recompression for the system to work. We can reason about its stability from the coefficient sign experiment data:

For coefficients with $|c| \geq 2$, the sign BER is 0.00% across all tested encoder/QF combinations (sips, libjpeg-turbo, MozJPEG at QF 53-95).
Coefficients with $|c| \geq 2$ that change magnitude typically shift by $\pm 1$ – a squared-magnitude change of at most $|c|^2 - (|c|-1)^2 = 2|c| - 1$ per coefficient.
For a block with many large coefficients (heavy texture), the total energy is on the order of thousands, and individual $\pm 1$ shifts contribute a relative change of under 1%.

The energy ratios are therefore highly stable for blocks well above the median. The only blocks where the energy ratio changes significantly after recompression are those near the smooth/textured boundary – precisely the blocks that would have caused skip-tier misalignment in the discrete scheme. With continuous masking, a small energy change in these blocks produces a small factor change, which produces a small step change, which produces a small BER increase that the voting scheme absorbs.

7.3 Quality Comparison: Fixed vs. Adaptive Step

A direct PSNR/SSIM comparison between fixed-step and adaptive-step embedding was not conducted as a formal controlled experiment, so we refrain from reporting specific numbers. However, the theoretical quality improvement can be estimated from the step reduction in smooth blocks.

For a message yielding $r = 38$ (mid-range), the adaptive parameters are $\Delta_{\text{base}} = 9.25$, $w_{\text{lo}} = 0.76$, $w_{\text{hi}} = 1.18$. In the smoothest blocks (ratio $\leq 0.01$, factor remapped to 0.76), the effective step is $0.76 \times 9.25 = 7.03$, compared to a fixed step of 9.25. The maximum QIM shift is half the step: $3.52$ levels vs. $4.63$ levels. This is a 24% reduction in peak distortion for the most perceptually sensitive blocks.

Conversely, in the most textured blocks (ratio $> 4.0$, factor remapped to 1.18), the effective step is $1.18 \times 9.25 = 10.92$. The increased distortion in these blocks is invisible due to texture masking – the HVS cannot distinguish a 5.46-level shift from a 4.63-level shift when both occur in a block containing edges, texture, and fine detail.

The net effect is a redistribution of embedding energy from perceptually sensitive to perceptually tolerant regions, with no change in aggregate robustness. The total embedding distortion (sum of squared DC shifts) is approximately preserved; its distribution shifts to match the HVS’s spatial sensitivity profile.

8. Comparison with Other Perceptual Models

8.1 JND Models

Just Noticeable Difference (JND) models (Yang et al., 2005; Wu et al., 2013) compute per-pixel or per-coefficient visibility thresholds similar to Watson’s model, but often incorporate more sophisticated masking effects (edge masking, pattern masking, temporal masking for video). For QIM watermarking, JND-guided step sizing has been proposed by several authors. Compared to Phasm’s approach:

JND models are more accurate – they better predict the visibility of specific modifications in specific locations. But they are also more complex, often requiring spatial-domain analysis (edge detection, activity measures) that is computationally expensive and potentially non-deterministic across platforms.
Phasm’s approach is simpler – a single scalar (AC energy ratio) per block, a piecewise-linear curve, basic arithmetic. The simplicity is a feature, not a limitation: it guarantees platform-deterministic masking factors, which is a hard requirement for blind QIM extraction.

8.2 SSIM-Guided Embedding

Structural Similarity (SSIM)-guided watermarking (Wang et al., 2004) adjusts embedding strength to maintain a target SSIM score per block or per region. This approach directly optimizes for a perceptual quality metric rather than using a masking model as a proxy. The advantage is that SSIM correlates well with perceived quality; the disadvantage is that SSIM computation requires the original image (it measures degradation relative to a reference), making it unsuitable for blind decoder-side computation.

8.3 Content-Aware QIM

Several authors have proposed content-dependent QIM step selection based on local image statistics – edge density, variance, histogram features. Zong et al. (2015) used local contrast and luminance to adapt the QIM step in DWT-domain watermarking. These approaches share the same structure as Phasm’s (compute a local metric, map it to a step factor), but differ in the metric, the domain, and the mapping function.

Phasm’s specific contribution is the coupling of the masking range to the error-correction redundancy. Prior content-aware QIM work treats the perceptual adaptation and the error correction as independent design dimensions. By linking them through the repetition factor, Phasm’s system automatically solves the quality-robustness tradeoff: when redundancy is high (short message, large image), the perceptual model is given more latitude; when redundancy is low, the perceptual model is constrained. This is a principled closed-loop design rather than the open-loop “pick a masking strength and hope it’s right” approach of prior work.

8.4 Neural Perceptual Models

Recent deep learning watermarking systems (HiDDeN, StegaStamp, TrustMark) learn end-to-end embedding and extraction networks that implicitly incorporate perceptual masking through their loss functions (typically a weighted combination of image fidelity loss and extraction accuracy loss). These systems achieve remarkable robustness, including survival through print-scan channels, but carry 30-100 bits per image (vs. Fortress’s 160-240+ bits) and require GPU inference.

Phasm’s classical approach – handcrafted Watson masking with a deterministic piecewise-linear curve – achieves comparable perceptual quality for the specific case of DC-only QIM embedding, without neural networks, without GPU, and with guaranteed cross-platform determinism. The tradeoff is generality: neural models handle arbitrary distortions (print-scan, photography of screens) that classical models cannot, though DFT-based template synchronization can address the geometric subset of these challenges.

9. Conclusion

Watson’s 1993 perceptual model, adapted from its original JPEG quantization optimization role, provides a principled foundation for content-adaptive QIM embedding. The adaptation is straightforward: compute AC energy as a texture proxy, map it through a piecewise-linear curve to a masking factor, and scale the QIM step per block. The result is invisible embedding in smooth regions and robust embedding in textured regions.

The key innovation is coupling the masking range to the repetition factor. This linkage transforms the quality-robustness tradeoff from a manual engineering decision into an automatic closed-loop system: the encoder chooses masking aggressiveness based on the redundancy budget, which is determined by the message length and image capacity. Short messages (high $r$) get aggressive masking for maximum quality; long messages (low $r$) get gentle masking for maximum robustness.

The skip-tier failure – a discrete embed/skip decision that caused catastrophic misalignment under recompression – is a cautionary lesson. Any discrete classification of embedding locations based on host signal properties is dangerous when the host signal changes across the channel. Continuous masking, combined with soft decoding that naturally downweights low-confidence extraction results, is the robust alternative.

All parameters described in this paper are from production Phasm source code (version 1.3.0, open source on GitHub), validated through end-to-end WhatsApp survival tests. The Watson perceptual masking system is fully integrated in Fortress mode with its 4-tier adaptive QIM step sizing scheme, operating within Phasm’s Armor mode. It builds on the recompression invariants that make BA-QIM possible and is complemented by soft majority voting with LLR integrity scoring and Reed-Solomon error correction that turn noisy per-block extractions into reliable message recovery. Ghost mode now uses the fully implemented J-UNIWARD cost function for maximum stealth, while Fortress’s Watson masking provides the perceptual quality that makes robust embedding invisible.

References

Watson, A. B. (1993). “DCTune: A Technique for Visual Optimization of DCT Quantization Matrices for Individual Images.” Society for Information Display Digest of Technical Papers, 24, 946–949.
Podilchuk, C. I., and Zeng, W. (1998). “Image-Adaptive Watermarking Using Visual Models.” IEEE Journal on Selected Areas in Communications, 16(4), 525–539.
Barni, M., Bartolini, F., and Piva, A. (2001). “Improved Wavelet-Based Watermarking Through Pixel-Wise Masking.” IEEE Transactions on Image Processing, 10(5), 783–791.
Perez-Gonzalez, F., Mosquera, C., Barni, M., and Abrardo, A. (2007). “Improved Spread Transform Dither Modulation Using a Perceptual Model.” IEEE ICASSP, II-1149–II-1152.
Chen, B., and Wornell, G. W. (2001). “Quantization Index Modulation: A Class of Provably Good Methods for Digital Watermarking and Information Embedding.” IEEE Transactions on Information Theory, 47(4), 1423–1443.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing, 13(4), 600–612.
Yang, X. K., Ling, W. S., Lu, Z. K., Ong, E. P., and Yao, S. S. (2005). “Just Noticeable Distortion Model and Its Applications in Video Coding.” Signal Processing: Image Communication, 20(7), 662–680.
Wu, J., Shi, G., Lin, W., Liu, A., and Qi, F. (2013). “Just Noticeable Difference Estimation for Images with Free-Energy Principle.” IEEE Transactions on Multimedia, 15(7), 1705–1710.
Zong, T., Xiang, Y., Natgunanathan, I., Guo, S., Zhou, W., and Beliakov, G. (2015). “Robust Histogram Shape-Based Method for Image Watermarking.” IEEE Transactions on Circuits and Systems for Video Technology, 25(5), 717–729.
Comesana, P., and Perez-Gonzalez, F. (2006). “On the capacity of stego-systems.” Proceedings of the 8th ACM Workshop on Multimedia and Security, 15–24.
Butora, J., and Fridrich, J. (2023). “Errorless Robust JPEG Steganography using Outputs of JPEG Coders.” IEEE Transactions on Dependable and Secure Computing.
Zhu, J., Kaplan, R., Johnson, J., and Fei-Fei, L. (2018). “HiDDeN: Hiding Data with Deep Networks.” European Conference on Computer Vision (ECCV).
Tancik, M., Mildenhall, B., and Ng, R. (2019). “StegaStamp: Invisible Hyperlinks in Physical Photographs.” IEEE/CVF CVPR.
Bui, T., et al. (2024). “TrustMark: Universal Watermarking for Arbitrary Resolution Images.” ICCV.