The Syndrome-Trellis Code (STC) at the heart of Phasm’s Ghost mode is elegant: a Viterbi algorithm walks a 128-state trellis across every embeddable DCT coefficient in the cover image, finding the minimum-cost modification sequence that encodes your message. The math is beautiful. The memory bill is not.
A 48-megapixel camera photo has roughly 10 million usable DCT coefficients. The textbook Viterbi stores one predecessor state (a u32) per trellis state per cover position:
$$\text{Memory}_{\text{naive}} = n \times 2^h \times 4 \;\text{bytes}$$
For $n = 10\text{M}$ and $h = 7$ (128 states), that is 5.1 GB. An iPhone gives your app about 1.5 GB before it gets killed. A WebAssembly instance starts with 16 MB.
We needed to make STC practical on the devices people actually use — without changing the embedding output by a single bit.
1. The Standard Viterbi Back-Pointer Problem
In any Viterbi algorithm, the forward pass finds optimal costs; the backward pass (traceback) reconstructs the optimal path. The traceback needs to know, at every step $j$ and every trellis state $s$, which predecessor state led to the minimum cost. The standard approach stores a full predecessor index per state per step.
For STC with constraint height $h = 7$:
| Image size | Cover positions $n$ | Naive back-pointer memory |
|---|---|---|
| 12 MP phone | 2M | 1.0 GB |
| 32 MP DSLR | 5M | 2.6 GB |
| 48 MP camera | 10M | 5.1 GB |
| 100 MP medium format | 30M | 15.4 GB |
Even moderate phone photos exceed what a mobile app can allocate. The J-UNIWARD cost computation and JPEG codec have already been optimized to ~200 MB — the Viterbi back pointers are the bottleneck.
2. Insight 1: One Bit Is Enough
The key observation is structural. In STC’s trellis, each target state $s$ at step $j$ has exactly two candidate predecessors, determined by the H-hat column vector $\mathbf{c}_j$:
$$\text{pred}_0(s) = s \quad (\text{stego bit} = 0)$$ $$\text{pred}_1(s) = s \oplus \mathbf{c}_j \quad (\text{stego bit} = 1)$$
where $\oplus$ is bitwise XOR. The forward pass picks whichever predecessor gave lower cost:
$$\text{cost}_0 = \text{prev}[s] + \rho_0(j)$$ $$\text{cost}_1 = \text{prev}[s \oplus \mathbf{c}_j] + \rho_1(j)$$
Since there are exactly two candidates, the choice is a single bit: did we pick stego_bit = 0 or stego_bit = 1? We do not need to store the predecessor state index — it is fully determined by the chosen bit and the known column vector.
For 128 trellis states, that is 128 bits = one u128 = 16 bytes per step:
let mut packed_bp = 0u128;
for s in 0..num_states {
let cost_0 = prev_cost[s] + rho_0; // stego=0
let cost_1 = prev_cost[s ^ col] + rho_1; // stego=1
if cost_1 < cost_0 {
curr_cost[s] = cost_1;
packed_bp |= 1u128 << s; // 1 bit records the choice
} else {
curr_cost[s] = cost_0;
}
}
During traceback, we reconstruct the predecessor from the stored bit:
let bit = ((back_ptr[j] >> s) & 1) as u8;
stego_bits[j] = bit;
if bit == 1 {
s ^= col; // undo XOR to recover predecessor
}
This is a 32x memory reduction: from 512 bytes per step (Vec<u32> of 128 entries) to 16 bytes per step (one u128).
| Image size | $n$ | Naive | 1-bit packed |
|---|---|---|---|
| 12 MP | 2M | 1.0 GB | 32 MB |
| 48 MP | 10M | 5.1 GB | 160 MB |
| 100 MP | 30M | 15.4 GB | 480 MB |
Better — 12 MP photos now fit on a phone. But 48 MP still doesn’t.
3. Insight 2: Shift Steps Are Free
Every $w$ cover positions (where $w = \lceil n/m \rceil$), the STC trellis undergoes a message-block boundary: the state is filtered by the required message bit and shifted right. States whose LSB doesn’t match the message bit are pruned, and the remaining states are shifted:
$$s' = s \gg 1 \quad \text{where} \; (s \;\&\; 1) = m_i$$
During traceback, this shift is perfectly invertible from the known message bit — no storage needed:
$$s = ((s' \ll 1) \;|\; m_i) \;\&\; (2^h - 1)$$
// Traceback at message-block boundary
let msg_bit = message[j / w] as usize;
s = ((s << 1) | msg_bit) & (num_states - 1);
The message itself provides the information needed to invert the shift. This means back pointers are only needed for the Viterbi steps (one per cover position), not for the shift steps.
4. The Remaining Problem
The 1-bit packing reduced the constant factor by 32×, and free shift steps avoid any overhead at block boundaries. But the memory still scales as $O(n)$:
$$\text{Memory}_{\text{1-bit}} = n \times 16 \;\text{bytes}$$
For a 100-megapixel image with 30 million usable positions, that is still 480 MB. The fundamental issue is that the standard Viterbi requires storing all back pointers before traceback can begin — because traceback runs backward through the entire sequence.
We need an approach where memory scales sub-linearly with image size.
5. Segmented Checkpoint Viterbi
The solution is a classic space-time tradeoff, closely related to Hirschberg’s (1975) technique for linear-space sequence alignment and the general checkpoint/recompute pattern from reversible computing.
The idea: instead of storing all back pointers in one pass, run the forward pass twice. The first pass saves lightweight checkpoints. The second pass recomputes back pointers one segment at a time, tracing back and freeing each segment before moving to the next.
Phase A: Forward Scan with Checkpoints
Run the complete Viterbi forward pass without storing any back pointers. At every $K$-th message-block boundary, save a checkpoint: a snapshot of the cost array ($2^h$ floats = 1 KB for $h = 7$).
$$K = \lceil \sqrt{m} \rceil$$
where $m$ is the number of message blocks. This choice of $K$ balances the two memory terms (we will derive why shortly). Total checkpoints: $\lceil m/K \rceil \approx \sqrt{m}$. Total checkpoint memory:
$$\text{Memory}_{\text{checkpoints}} = \frac{m}{K} \times 2^h \times 8 \;\text{bytes} \approx \sqrt{m} \;\text{KB}$$
For a maximum-capacity payload ($m \approx 30{,}000$ message blocks at $h = 7$), that is about 1.4 MB.
Phase B: Segment-by-Segment Recomputation
Process segments in reverse order (last segment first). For each segment:
- Restore the cost array from the segment’s checkpoint
- Re-run the Viterbi forward pass for this segment’s $K \times w$ cover positions, this time storing the 1-bit packed back pointers
- Traceback within the segment, using the entry state from the previous (later) segment’s traceback
- Free the segment’s back pointers
Segment layout (m = 12 message blocks, K = 4):
Checkpoint 0 Checkpoint 1 Checkpoint 2 Terminal
| | | |
v v v v
[seg 0: blocks 0-3] [seg 1: blocks 4-7] [seg 2: blocks 8-11]
<--- traceback order
Each segment stores at most $K \times w$ back pointers. Peak segment memory:
$$\text{Memory}_{\text{segment}} = K \times w \times 16 \;\text{bytes}$$
Why $K = \lceil \sqrt{m} \rceil$?
Total memory is the sum of checkpoint memory and peak segment memory:
$$M(K) = \underbrace{\frac{m}{K} \times 2^h \times 8}_{\text{checkpoints}} + \underbrace{K \times w \times 16}_{\text{segment back pointers}}$$
Both terms depend on $K$ inversely and directly. The minimum is at:
$$\frac{dM}{dK} = 0 \implies K^2 = \frac{m \times 2^h \times 8}{w \times 16} \implies K \propto \sqrt{m}$$
Since $2^h \times 8 / (w \times 16)$ is a small constant for typical parameters ($h = 7$, $w = 10$), $K = \lceil \sqrt{m} \rceil$ is the practical optimum.
Why Reverse Order?
Traceback runs backward through the trellis. The terminal state (minimum-cost state after the last forward step) is known. Tracing back through the last segment yields an entry state for the second-to-last segment, and so on. Processing segments in reverse provides the correct entry state for each segment naturally:
let mut entry_state = best_state; // from Phase A
for seg in (0..num_segments).rev() {
// Restore checkpoint, recompute forward, traceback
// ...
entry_state = s; // state at start of this segment
}
debug_assert_eq!(entry_state, 0); // must return to initial state
The entry_state == 0 assertion at the end is a powerful correctness check: if the backward path doesn’t return to trellis state 0, something went wrong.
6. Memory in Practice
The segmented approach produces a flat memory profile regardless of image size:
| Image | $n$ (positions) | Inline (1-bit) | Segmented | Reduction |
|---|---|---|---|---|
| 12 MP phone | 2M | 32 MB | 3 MB | 10× |
| 32 MP PNG | 5M | 80 MB | 3 MB | 27× |
| 48 MP camera | 10M | 160 MB | 3 MB | 53× |
| 100 MP medium format | 30M | 480 MB | 3 MB | 160× |
The ~3 MB figure is roughly constant because both the checkpoint memory ($\sqrt{m}$ KB) and segment memory ($\sqrt{m} \times w \times 16$ bytes) scale with $\sqrt{m}$, which plateaus for practical payload sizes.
The Compute Tradeoff
The segmented path runs the forward pass approximately twice: once in Phase A (without back pointers) and once cumulatively across all Phase B segments. Is this acceptable?
In practice, the STC Viterbi is not the bottleneck. The J-UNIWARD cost computation — which requires computing Daubechies-8 wavelet transforms across three subbands for every DCT coefficient — dominates wall-clock time by 3–5×. Doubling the Viterbi time adds roughly 20–30% to total encode time, while reducing memory from “app gets killed” to “fits comfortably.”
On mobile, memory pressure is far more dangerous than CPU time. An allocation that exceeds the system’s memory budget doesn’t just slow things down — it terminates the app. The 2× compute tradeoff is not a compromise; it is the only viable path.
The Auto-Dispatcher
The implementation uses both paths, selecting automatically:
const SEGMENTED_THRESHOLD: usize = 1_000_000;
pub fn stc_embed(cover_bits, costs, message, hhat, h, w) -> Option<EmbedResult> {
if n > SEGMENTED_THRESHOLD {
stc_embed_segmented(...) // O(√n) memory, 2× compute
} else {
stc_embed_inline(...) // O(n) memory, fastest
}
}
The threshold of 1 million positions (16 MB of back pointers) is conservative — well within mobile memory budgets for the inline path, but above it the segmented path takes over. Callers don’t need to know which path was used. The output is identical either way.
7. Proving Bit-for-Bit Equivalence
Both paths must produce identical stego bit sequences — a steganographic encoder cannot afford nondeterminism, because the receiver’s decoder must extract the exact same message.
The segmented path preserves equivalence by construction:
- Phase A runs the identical forward recurrence as the inline path — same cost updates, same shift operations, same state filtering. It simply doesn’t store back pointers.
- Phase B re-runs the forward pass for each segment starting from the exact cost array that Phase A computed at that checkpoint. The back pointers produced are identical to what the inline path would have stored.
- Traceback within each segment uses the same bit-extraction logic.
This is verified by explicit equivalence tests that run both paths on the same input and assert identical output:
#[test]
fn inline_segmented_equivalence_large() {
// 100K cover elements, K ≈ 100 → ~100 segments
let inline = stc_embed_inline(&cover, &costs, &msg, &hhat, h, w).unwrap();
let segmented = stc_embed_segmented(&cover, &costs, &msg, &hhat, h, w).unwrap();
assert_eq!(inline.stego_bits, segmented.stego_bits);
assert_eq!(inline.total_cost, segmented.total_cost);
}
The tests cover edge cases: single-segment inputs, maximum-length payloads, WET (unmodifiable) coefficient patterns, and varying H-hat matrices.
8. Why This Matters for Client-Side Steganography
Phasm processes images entirely on-device. No server ever sees your photo or your message. This is a non-negotiable design constraint — steganography that requires a server round-trip is steganography that requires trusting the server.
But client-side processing means living within the client’s memory constraints:
- iOS terminates background apps aggressively; foreground apps get ~1.5 GB on most devices
- Android varies wildly — budget phones may offer 500 MB per app
- WebAssembly linear memory defaults to 16 MB, grows in 64 KB pages, and has an architecture-dependent ceiling
The segmented Viterbi makes it possible to encode a message into a 48-megapixel photo — taken moments ago by the same phone — without reducing resolution, without downsampling, and without sending the image anywhere. SI-UNIWARD benefits directly: side-informed embedding on full-resolution camera photos gives the highest capacity, and that capacity is only useful if the encoder can actually handle the full image.
Combined with the 1-bit packing for smaller images, the memory profile of the entire Ghost encode pipeline is now bounded:
- J-UNIWARD cost map: ~187 MB (after 71% optimization)
- STC Viterbi: ~3 MB (segmented) or ~16–32 MB (inline for smaller images)
- JPEG codec: ~10–30 MB depending on image size
Total peak: well under 250 MB for any image up to 100 megapixels.
The Viterbi algorithm was invented in 1967 to decode convolutional codes for deep-space communication. Sixty years later, we are using the same dynamic programming structure to hide messages in photographs — on devices that would have been science fiction when Viterbi published his paper. The challenge is no longer the algorithm itself, but making it fit in the memory of the computer in your pocket.
One bit per state. Checkpoints every $\sqrt{m}$ blocks. Three megabytes for any image on Earth.
Frequently Asked Questions
Does the segmented path produce a different stego image?
No. Both paths produce bit-for-bit identical output. The segmented path recomputes the same forward pass from saved checkpoints — it simply doesn’t store all back pointers simultaneously. This is verified by equivalence tests on every build.
Why not just downsample large photos before embedding?
Downsampling reduces the number of embeddable coefficients, directly cutting capacity. For SI-UNIWARD, full resolution is especially valuable: the side-informed rounding errors that provide ~43% more capacity are computed from the original pixel data. Downsampling destroys that advantage.
Could this approach work for higher constraint heights (h=10, h=12)?
Yes. The segmented checkpoint strategy is agnostic to constraint height — it works on any Viterbi trellis. Higher $h$ means more states per checkpoint (4 KB for $h=10$, 32 KB for $h=12$) but the $O(\sqrt{n})$ scaling is unchanged. The 1-bit packing also generalizes: $h=10$ would use 1024 bits = 128 bytes per step (two u64s or a small array).
What about GPU or SIMD acceleration for the Viterbi inner loop?
The 128-state inner loop is already branchless and data-parallel — each state’s cost update is independent. SIMD (e.g., ARM NEON on mobile) could process 4 or 8 states per instruction. GPU acceleration is theoretically possible but the sequential dependency between steps (each step reads the previous step’s costs) limits parallelism to within a step. These optimizations are orthogonal to the memory improvements described here and could be layered on top.
References
-
Filler, T., Judas, J., & Fridrich, J. (2011). Minimizing additive distortion in steganography using syndrome-trellis codes. IEEE Transactions on Information Forensics and Security, 6(3), 920–935. doi:10.1109/TIFS.2011.2134094
-
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. doi:10.1109/TIT.1967.1054010
-
Hirschberg, D. S. (1975). A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6), 341–343. doi:10.1145/360825.360861
-
Holub, V., Fridrich, J., & Denemark, T. (2014). Universal distortion function for steganography in an arbitrary domain. EURASIP Journal on Information Security, 2014(1). doi:10.1186/1687-417X-2014-1
-
Filler, T. & Fridrich, J. (2010). Gibbs construction in steganography. IEEE Transactions on Information Forensics and Security, 5(4), 705–720. doi:10.1109/TIFS.2010.2077629