Segmented Viterbi for STC Steganography: 160x Memory Reduction

The Syndrome-Trellis Code (STC) at the heart of Phasm’s Ghost mode is elegant: a Viterbi algorithm walks a 128-state trellis across every embeddable DCT coefficient in the cover image, finding the minimum-cost modification sequence that encodes your message. The math is beautiful. The memory bill is not.

A 48-megapixel camera photo has roughly 10 million usable DCT coefficients. The textbook Viterbi stores one predecessor state (a u32) per trellis state per cover position:

$$\text{Memory}_{\text{naive}} = n \times 2^h \times 4 \;\text{bytes}$$

For $n = 10\text{M}$ and $h = 7$ (128 states), that is 5.1 GB. An iPhone gives your app about 1.5 GB before it gets killed. A WebAssembly instance starts with 16 MB.

We needed to make STC practical on the devices people actually use — without changing the embedding output by a single bit.

1. The Standard Viterbi Back-Pointer Problem

In any Viterbi algorithm, the forward pass finds optimal costs; the backward pass (traceback) reconstructs the optimal path. The traceback needs to know, at every step $j$ and every trellis state $s$, which predecessor state led to the minimum cost. The standard approach stores a full predecessor index per state per step.

For STC with constraint height $h = 7$:

Image size	Cover positions $n$	Naive back-pointer memory
12 MP phone	2M	1.0 GB
32 MP DSLR	5M	2.6 GB
48 MP camera	10M	5.1 GB
100 MP medium format	30M	15.4 GB

Even moderate phone photos exceed what a mobile app can allocate. The J-UNIWARD cost computation and JPEG codec have already been optimized to ~200 MB — the Viterbi back pointers are the bottleneck.

2. Insight 1: One Bit Is Enough

The key observation is structural. In STC’s trellis, each target state $s$ at step $j$ has exactly two candidate predecessors, determined by the H-hat column vector $\mathbf{c}_j$:

$$\text{pred}_0(s) = s \quad (\text{stego bit} = 0)$$ $$\text{pred}_1(s) = s \oplus \mathbf{c}_j \quad (\text{stego bit} = 1)$$

where $\oplus$ is bitwise XOR. The forward pass picks whichever predecessor gave lower cost:

$$\text{cost}_0 = \text{prev}[s] + \rho_0(j)$$ $$\text{cost}_1 = \text{prev}[s \oplus \mathbf{c}_j] + \rho_1(j)$$

Since there are exactly two candidates, the choice is a single bit: did we pick stego_bit = 0 or stego_bit = 1? We do not need to store the predecessor state index — it is fully determined by the chosen bit and the known column vector.

For 128 trellis states, that is 128 bits = one u128 = 16 bytes per step:

let mut packed_bp = 0u128;

                for s in 0..num_states {
                    let cost_0 = prev_cost[s]       + rho_0;  // stego=0
                    let cost_1 = prev_cost[s ^ col] + rho_1;  // stego=1

                    if cost_1 < cost_0 {
                        curr_cost[s] = cost_1;
                        packed_bp |= 1u128 << s;  // 1 bit records the choice
                    } else {
                        curr_cost[s] = cost_0;
                    }
                }

During traceback, we reconstruct the predecessor from the stored bit:

let bit = ((back_ptr[j] >> s) & 1) as u8;
                stego_bits[j] = bit;
                if bit == 1 {
                    s ^= col;  // undo XOR to recover predecessor
                }

This is a 32x memory reduction: from 512 bytes per step (Vec<u32> of 128 entries) to 16 bytes per step (one u128).

Image size	$n$	Naive	1-bit packed
12 MP	2M	1.0 GB	32 MB
48 MP	10M	5.1 GB	160 MB
100 MP	30M	15.4 GB	480 MB

Better — 12 MP photos now fit on a phone. But 48 MP still doesn’t.

3. Insight 2: Shift Steps Are Free

Every $w$ cover positions (where $w = \lceil n/m \rceil$), the STC trellis undergoes a message-block boundary: the state is filtered by the required message bit and shifted right. States whose LSB doesn’t match the message bit are pruned, and the remaining states are shifted:

$$s' = s \gg 1 \quad \text{where} \; (s \;\&\; 1) = m_i$$

During traceback, this shift is perfectly invertible from the known message bit — no storage needed:

$$s = ((s' \ll 1) \;|\; m_i) \;\&\; (2^h - 1)$$

// Traceback at message-block boundary
                let msg_bit = message[j / w] as usize;
                s = ((s << 1) | msg_bit) & (num_states - 1);

The message itself provides the information needed to invert the shift. This means back pointers are only needed for the Viterbi steps (one per cover position), not for the shift steps.

4. The Remaining Problem

The 1-bit packing reduced the constant factor by 32×, and free shift steps avoid any overhead at block boundaries. But the memory still scales as $O(n)$:

$$\text{Memory}_{\text{1-bit}} = n \times 16 \;\text{bytes}$$

For a 100-megapixel image with 30 million usable positions, that is still 480 MB. The fundamental issue is that the standard Viterbi requires storing all back pointers before traceback can begin — because traceback runs backward through the entire sequence.

We need an approach where memory scales sub-linearly with image size.

5. Segmented Checkpoint Viterbi

The solution is a classic space-time tradeoff, closely related to Hirschberg’s (1975) technique for linear-space sequence alignment and the general checkpoint/recompute pattern from reversible computing.

The idea: instead of storing all back pointers in one pass, run the forward pass twice. The first pass saves lightweight checkpoints. The second pass recomputes back pointers one segment at a time, tracing back and freeing each segment before moving to the next.

Phase A: Forward Scan with Checkpoints

Run the complete Viterbi forward pass without storing any back pointers. At every $K$-th message-block boundary, save a checkpoint: a snapshot of the cost array ($2^h$ floats = 1 KB for $h = 7$).

$$K = \lceil \sqrt{m} \rceil$$

where $m$ is the number of message blocks. This choice of $K$ balances the two memory terms (we will derive why shortly). Total checkpoints: $\lceil m/K \rceil \approx \sqrt{m}$. Total checkpoint memory:

$$\text{Memory}_{\text{checkpoints}} = \frac{m}{K} \times 2^h \times 8 \;\text{bytes} \approx \sqrt{m} \;\text{KB}$$

For a maximum-capacity payload ($m \approx 30{,}000$ message blocks at $h = 7$), that is about 1.4 MB.

Phase B: Segment-by-Segment Recomputation

Process segments in reverse order (last segment first). For each segment:

Restore the cost array from the segment’s checkpoint
Re-run the Viterbi forward pass for this segment’s $K \times w$ cover positions, this time storing the 1-bit packed back pointers
Traceback within the segment, using the entry state from the previous (later) segment’s traceback
Free the segment’s back pointers

Segment layout (m = 12 message blocks, K = 4):

                Checkpoint 0    Checkpoint 1    Checkpoint 2    Terminal
                    |               |               |               |
                    v               v               v               v
                  [seg 0: blocks 0-3] [seg 1: blocks 4-7] [seg 2: blocks 8-11]
                                                          <--- traceback order

Each segment stores at most $K \times w$ back pointers. Peak segment memory:

$$\text{Memory}_{\text{segment}} = K \times w \times 16 \;\text{bytes}$$

Why $K = \lceil \sqrt{m} \rceil$?

Total memory is the sum of checkpoint memory and peak segment memory:

$$M(K) = \underbrace{\frac{m}{K} \times 2^h \times 8}_{\text{checkpoints}} + \underbrace{K \times w \times 16}_{\text{segment back pointers}}$$

Both terms depend on $K$ inversely and directly. The minimum is at:

$$\frac{dM}{dK} = 0 \implies K^2 = \frac{m \times 2^h \times 8}{w \times 16} \implies K \propto \sqrt{m}$$

Since $2^h \times 8 / (w \times 16)$ is a small constant for typical parameters ($h = 7$, $w = 10$), $K = \lceil \sqrt{m} \rceil$ is the practical optimum.

Why Reverse Order?

Traceback runs backward through the trellis. The terminal state (minimum-cost state after the last forward step) is known. Tracing back through the last segment yields an entry state for the second-to-last segment, and so on. Processing segments in reverse provides the correct entry state for each segment naturally:

let mut entry_state = best_state;  // from Phase A

                for seg in (0..num_segments).rev() {
                    // Restore checkpoint, recompute forward, traceback
                    // ...
                    entry_state = s;  // state at start of this segment
                }

                debug_assert_eq!(entry_state, 0);  // must return to initial state

The entry_state == 0 assertion at the end is a powerful correctness check: if the backward path doesn’t return to trellis state 0, something went wrong.

6. Memory in Practice

The segmented approach produces a flat memory profile regardless of image size:

Image	$n$ (positions)	Inline (1-bit)	Segmented	Reduction
12 MP phone	2M	32 MB	3 MB	10×
32 MP PNG	5M	80 MB	3 MB	27×
48 MP camera	10M	160 MB	3 MB	53×
100 MP medium format	30M	480 MB	3 MB	160×

The ~3 MB figure is roughly constant because both the checkpoint memory ($\sqrt{m}$ KB) and segment memory ($\sqrt{m} \times w \times 16$ bytes) scale with $\sqrt{m}$, which plateaus for practical payload sizes.

The Compute Tradeoff

The segmented path runs the forward pass approximately twice: once in Phase A (without back pointers) and once cumulatively across all Phase B segments. Is this acceptable?

In practice, the STC Viterbi is not the bottleneck. The J-UNIWARD cost computation — which requires computing Daubechies-8 wavelet transforms across three subbands for every DCT coefficient — dominates wall-clock time by 3–5×. Doubling the Viterbi time adds roughly 20–30% to total encode time, while reducing memory from “app gets killed” to “fits comfortably.”

On mobile, memory pressure is far more dangerous than CPU time. An allocation that exceeds the system’s memory budget doesn’t just slow things down — it terminates the app. The 2× compute tradeoff is not a compromise; it is the only viable path.

The Auto-Dispatcher

The implementation uses both paths, selecting automatically:

const SEGMENTED_THRESHOLD: usize = 1_000_000;

                pub fn stc_embed(cover_bits, costs, message, hhat, h, w) -> Option<EmbedResult> {
                    if n > SEGMENTED_THRESHOLD {
                        stc_embed_segmented(...)  // O(√n) memory, 2× compute
                    } else {
                        stc_embed_inline(...)     // O(n) memory, fastest
                    }
                }

The threshold of 1 million positions (16 MB of back pointers) is conservative — well within mobile memory budgets for the inline path, but above it the segmented path takes over. Callers don’t need to know which path was used. The output is identical either way.

7. Proving Bit-for-Bit Equivalence

Both paths must produce identical stego bit sequences — a steganographic encoder cannot afford nondeterminism, because the receiver’s decoder must extract the exact same message.

The segmented path preserves equivalence by construction:

Phase A runs the identical forward recurrence as the inline path — same cost updates, same shift operations, same state filtering. It simply doesn’t store back pointers.
Phase B re-runs the forward pass for each segment starting from the exact cost array that Phase A computed at that checkpoint. The back pointers produced are identical to what the inline path would have stored.
Traceback within each segment uses the same bit-extraction logic.

This is verified by explicit equivalence tests that run both paths on the same input and assert identical output:

#[test]
                fn inline_segmented_equivalence_large() {
                    // 100K cover elements, K ≈ 100 → ~100 segments
                    let inline = stc_embed_inline(&cover, &costs, &msg, &hhat, h, w).unwrap();
                    let segmented = stc_embed_segmented(&cover, &costs, &msg, &hhat, h, w).unwrap();

                    assert_eq!(inline.stego_bits, segmented.stego_bits);
                    assert_eq!(inline.total_cost, segmented.total_cost);
                }

The tests cover edge cases: single-segment inputs, maximum-length payloads, WET (unmodifiable) coefficient patterns, and varying H-hat matrices.

8. Why This Matters for Client-Side Steganography

Phasm processes images entirely on-device. No server ever sees your photo or your message. This is a non-negotiable design constraint — steganography that requires a server round-trip is steganography that requires trusting the server.

But client-side processing means living within the client’s memory constraints:

iOS terminates background apps aggressively; foreground apps get ~1.5 GB on most devices
Android varies wildly — budget phones may offer 500 MB per app
WebAssembly linear memory defaults to 16 MB, grows in 64 KB pages, and has an architecture-dependent ceiling

The segmented Viterbi makes it possible to encode a message into a 48-megapixel photo — taken moments ago by the same phone — without reducing resolution, without downsampling, and without sending the image anywhere. SI-UNIWARD benefits directly: side-informed embedding on full-resolution camera photos gives the highest capacity, and that capacity is only useful if the encoder can actually handle the full image.

Combined with the 1-bit packing for smaller images, the memory profile of the entire Ghost encode pipeline is now bounded:

J-UNIWARD cost map: ~187 MB (after 71% optimization)
STC Viterbi: ~3 MB (segmented) or ~16–32 MB (inline for smaller images)
JPEG codec: ~10–30 MB depending on image size

Total peak: well under 250 MB for any image up to 100 megapixels.

The Viterbi algorithm was invented in 1967 to decode convolutional codes for deep-space communication. Sixty years later, we are using the same dynamic programming structure to hide messages in photographs — on devices that would have been science fiction when Viterbi published his paper. The challenge is no longer the algorithm itself, but making it fit in the memory of the computer in your pocket.

One bit per state. Checkpoints every $\sqrt{m}$ blocks. Three megabytes for any image on Earth.

Frequently Asked Questions

Does the segmented path produce a different stego image?

No. Both paths produce bit-for-bit identical output. The segmented path recomputes the same forward pass from saved checkpoints — it simply doesn’t store all back pointers simultaneously. This is verified by equivalence tests on every build.

Why not just downsample large photos before embedding?

Downsampling reduces the number of embeddable coefficients, directly cutting capacity. For SI-UNIWARD, full resolution is especially valuable: the side-informed rounding errors that provide ~43% more capacity are computed from the original pixel data. Downsampling destroys that advantage.

Could this approach work for higher constraint heights (h=10, h=12)?

Yes. The segmented checkpoint strategy is agnostic to constraint height — it works on any Viterbi trellis. Higher $h$ means more states per checkpoint (4 KB for $h=10$, 32 KB for $h=12$) but the $O(\sqrt{n})$ scaling is unchanged. The 1-bit packing also generalizes: $h=10$ would use 1024 bits = 128 bytes per step (two u64s or a small array).

What about GPU or SIMD acceleration for the Viterbi inner loop?

The 128-state inner loop is already branchless and data-parallel — each state’s cost update is independent. SIMD (e.g., ARM NEON on mobile) could process 4 or 8 states per instruction. GPU acceleration is theoretically possible but the sequential dependency between steps (each step reads the previous step’s costs) limits parallelism to within a step. These optimizations are orthogonal to the memory improvements described here and could be layered on top.

References

Filler, T., Judas, J., & Fridrich, J. (2011). Minimizing additive distortion in steganography using syndrome-trellis codes. IEEE Transactions on Information Forensics and Security, 6(3), 920–935. doi:10.1109/TIFS.2011.2134094
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. doi:10.1109/TIT.1967.1054010
Hirschberg, D. S. (1975). A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6), 341–343. doi:10.1145/360825.360861
Holub, V., Fridrich, J., & Denemark, T. (2014). Universal distortion function for steganography in an arbitrary domain. EURASIP Journal on Information Security, 2014(1). doi:10.1186/1687-417X-2014-1
Filler, T. & Fridrich, J. (2010). Gibbs construction in steganography. IEEE Transactions on Information Forensics and Security, 5(4), 705–720. doi:10.1109/TIFS.2010.2077629

From 480 MB to 3 MB: Fitting Viterbi Steganography on a Phone