The roof was reflecting light like a painted wall.
The input photo was fine. The segmentation was fine. The diffusion pipeline ran without errors. And yet the output didn’t pass the most basic human test: “does this look like the same material if I move the sun?” The wall’s reflectance characteristics were bleeding across boundaries, and the roof was paying the price.
Single-image PBR decomposition isn't actually a single-image problem—it's a per-region one.
The key insight: decomposition isn’t global—materials are local
RGB‑X (a diffusion-based decomposition pipeline) can produce the maps I need—albedo, normal, roughness, irradiance—from a single photo. The research problem is in good shape.
The engineering problem is the one that actually matters in a construction visualization workflow: a phone photo is a collage of different materials, and a global pass happily blends them.
If you decompose the full frame, you’re asking the model to explain everything at once. That’s how you get:
- “wall roughness” influencing “roof roughness”
- lighting estimates that smear across high-contrast boundaries
- normals that look plausible in isolation but don’t respect region edges
So I made a hard rule: I only decompose a region in isolation.
That means the pipeline becomes:
- segment the photo into architectural regions (walls, roof, trim)
- crop each region with padding
- resize to satisfy the U‑Net constraint (multiples of 8)
- run RGB‑X once per channel, per region
- return per-region PBR maps to drive interactive rendering
My one analogy for this: doing RGB‑X on a whole house photo is like trying to match paint by sampling the entire room—walls, ceiling, couch, and floor—then wondering why the “average color” looks wrong on the trim.
How it works under the hood
The pipeline has two “loops” that matter:
- a region loop (per SAM3 mask)
- a channel loop (per PBR map)
And there are two unglamorous constraints that decide whether the whole thing feels production-grade:
- padding to survive mask edge artifacts
- resizing to keep the diffusion U‑Net happy
Here’s the full data flow at the level that actually matters.
Channel-specific prompts: I don’t let the model guess what I want
RGB‑X is driven by a prompt per channel. I keep the mapping explicit and boring—because the worst failure mode is “it returned something that looks like a map.”
CHANNEL_PROMPTS = {
"albedo": "Albedo (diffuse basecolor)",
"normal": "Normal",
"roughness": "Roughness",
"irradiance": "Irradiance (estimate lighting)",
}
The thing I like about this is that it’s not clever. It’s declarative. When a map looks off, I can reason about exactly which pass produced it.
One inference pass per channel (and a weird nested-list unwrap)
Each channel is a separate diffusion inference call. That’s expensive, but it buys controllability and clarity.
The non-obvious footgun I had to handle: RGB‑X sometimes wraps the image output in an extra list, so I unwrap it before encoding.
prompt = CHANNEL_PROMPTS.get(channel)
result = rgbx_pipe(prompt=prompt, photo=photo, num_inference_steps=steps, height=h, width=w)
img = result.images[0]
if isinstance(img, list) and len(img) > 0:
img = img[0] # Unwrap nested list (RGB-X wraps each channel in an extra list)
if isinstance(img, np.ndarray):
if img.max() <= 1.0:
img = (img * 255).astype(np.uint8)
return Image.fromarray(img.astype(np.uint8))
return img
That unwrap looks tiny, but it’s the kind of “one weird trick” that prevents silent corruption—because without it, you can end up encoding the wrong object shape and not notice until the viewer renders garbage.
Per-region decomposition: the crop-and-decompose loop
The region pipeline starts by decoding the photo and preparing an output array.
photo = _decode_image(request.image_base64)
img_w, img_h = photo.size
regions_out = []
The reason I keep img_w, img_h around is simple: every bounding-box operation needs to clamp to the original image bounds. Region pipelines that don’t clamp eventually crash on edge cases—or worse, “work” by producing misaligned crops.
Iterating masks and normalizing bbox formats
SAM3 bbox data can arrive in more than one format, so I normalize both forms before doing anything else.
with torch.no_grad():
for i, mask_info in enumerate(request.masks):
bbox = mask_info.get("bbox", {})
label = mask_info.get("label", f"region_{i}")
# Handle both [x,y,w,h] (SAM3) and {x,y,width,height} formats
if isinstance(bbox, (list, tuple)) and len(bbox) >= 4:
bx, by, bw, bh = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
elif isinstance(bbox, dict):
bx, by, bw, bh = int(bbox.get("x",0)), int(bbox.get("y",0)), int(bbox.get("width",img_w)), int(bbox.get("height",img_h))
What surprised me here: bbox format inconsistencies are the kind of “integration tax” that never shows up in model demos. If you don’t normalize early, you end up debugging downstream artifacts that are actually just coordinate parsing bugs.
The 10% padding rule (because masks are never perfect)
Even with good segmentation, edges are messy: anti-aliasing, partial coverage, and thin structures all conspire to produce boundary artifacts.
So I pad the crop by 10% of the bbox size.
# 10% padding
pad_x, pad_y = int(bw * 0.1), int(bh * 0.1)
x0, y0 = max(0, bx - pad_x), max(0, by - pad_y)
x1, y1 = min(img_w, bx + bw + pad_x), min(img_h, by + bh + pad_y)
Padding is one of those decisions that feels “too simple to matter” until you see the before/after: the model gets a little context around the boundary, and the maps stop looking like they were cut out with dull scissors.
Cropping + resizing for inference
After padding and clamping, I crop the region and resize it for inference.
crop = photo.crop((x0, y0, x1, y1))
crop_resized, _ = _resize_for_inference(crop, request.target_size)
cw, ch = crop_resized.size
The important detail is not the resize itself—it’s the constraint it enforces.
The U‑Net constraint: dimensions must be multiples of 8
Diffusion pipelines that use a U‑Net architecture tend to have stride/downsample constraints. In this pipeline, I enforce “multiple of 8” sizing.
def _resize_for_inference(img, target_size):
w, h = img.size
scale = target_size / max(w, h) if max(w, h) > target_size else 1.0
new_w = max(8, (int(w * scale) // 8) * 8)
new_h = max(8, (int(h * scale) // 8) * 8)
return img.resize((new_w, new_h), Image.LANCZOS), scale
Two things are doing real work here:
- I scale down only if the region is larger than
target_size. - I quantize dimensions to multiples of 8, with a floor of 8.
I’ve learned to treat this as a hard requirement, not a suggestion. When you violate it, you don’t always get a clean error; sometimes you get subtly wrong outputs or shape mismatches deep in the stack.
Running decomposition per region, per channel
Once I have a clean crop, I run all requested channels and base64-encode the results.
region_maps = {}
for channel in channels:
channel_img = _run_rgbx_channel(crop_resized, channel, request.num_inference_steps, ch, cw)
region_maps[f"{channel}_base64"] = _encode_image(channel_img)
The “boring but correct” part: I pass ch, cw explicitly into _run_rgbx_channel. I don’t trust implicit resizing inside the model call, because it makes it harder to reason about what resolution each map was actually produced at.
Reliability engineering: endpoints still register when a model fails
GPU servers fail in annoying ways: missing wheels, mismatched CUDA builds, dependency regressions. When that happens, I still want the server to start and expose its API surface, even if a feature is temporarily unavailable.
So I load RGB‑X conditionally and register stub request models so the endpoint can exist.
RGBX_IMPORT_OK = False
class PBREstimateRequest(BaseModel): # Stub so endpoint registers
image_base64: str
channels: List[str] = ["albedo", "normal", "roughness"]
async def estimate_pbr(request):
raise ValueError("RGB-X pipeline not available")
This pattern keeps the system honest: if RGB‑X isn’t there, it fails loudly at call time instead of failing mysteriously at boot.
Patch for Transformers constant removals
RGB‑X depends on constants that aren’t present in newer transformers.utils. I patch them in-place before import paths explode.
import transformers.utils as _tu
if not hasattr(_tu, "FLAX_WEIGHTS_NAME"):
_tu.FLAX_WEIGHTS_NAME = "flax_model.msgpack"
if not hasattr(_tu, "ONNX_WEIGHTS_NAME"):
_tu.ONNX_WEIGHTS_NAME = "model.onnx"
I’m not romantic about this kind of patching—it’s a pragmatic “keep the server alive” move. The win is operational: the rest of the pipeline can still run, and the failure mode is constrained.
The procedural PBR fallback: the viewer always works
Even with a robust GPU pipeline, I don’t want the 3D experience to depend on it. If decomposition is unavailable, I still want materials that respond to light in a plausible way.
So on the client side I generate textures procedurally using FBM noise and a height-to-normal conversion.
FBM noise: cheap structure that reads like material
function fbmNoise(x: number, y: number, octaves = 4, seed = 0): number {
let value = 0, amplitude = 0.5, frequency = 1, maxValue = 0;
for (let i = 0; i < octaves; i++) {
value += amplitude * smoothNoise(x * frequency, y * frequency, seed + i * 100);
maxValue += amplitude;
amplitude *= 0.5;
frequency *= 2;
}
return value / maxValue;
}
The detail that matters is the amplitude/frequency schedule: halve amplitude, double frequency. It’s the simplest way to get multi-scale texture without shipping image assets.
Height → normal: turning scalar bumps into lighting response
function heightToNormal(heightData: Uint8Array, width: number, height: number, strength = 1.0): Uint8Array {
const normalData = new Uint8Array(width * height * 4);
for (let y = 0; y < height; y++) {
for (let x = 0; x < width; x++) {
const idx = (y * width + x) * 4;
const left = heightData[y * width + Math.max(0, x - 1)] / 255;
const right = heightData[y * width + Math.min(width - 1, x + 1)] / 255;
const up = heightData[Math.max(0, y - 1) * width + x] / 255;
const down = heightData[Math.min(height - 1, y + 1) * width + x] / 255;
const dx = (left - right) * strength;
const dy = (up - down) * strength;
const dz = 1.0;
const len = Math.sqrt(dx * dx + dy * dy + dz * dz);
normalData[idx] = Math.floor((dx / len * 0.5 + 0.5) * 255);
normalData[idx + 1] = Math.floor((dy / len * 0.5 + 0.5) * 255);
normalData[idx + 2] = Math.floor((dz / len * 0.5 + 0.5) * 255);
normalData[idx + 3] = 255;
}
}
return normalData;
}
This is the difference between a “flat sticker” and something that actually reacts when you move the light. Even as a fallback, it preserves the core promise: materials look different under different illumination.
Nuances that matter in practice
A per-region diffusion pipeline sounds straightforward until you feel the edges.
Why the naive “full image” approach fails
The failure isn’t that RGB‑X is bad. The failure is that the objective is misaligned:
- The model is encouraged to explain the entire frame coherently.
- Coherence across the frame is not what you want when the frame contains multiple materials.
Per-region cropping forces the model into the right local context. It’s not more intelligence—it’s better problem framing.
Why padding is a real parameter, not a magic number
I use 10% padding because it’s proportional, not absolute. Thin regions get a little context; large regions get enough boundary room to avoid edge artifacts.
The tradeoff is obvious: too much padding and you reintroduce cross-material contamination. Too little and the crop becomes “mask-edge dominated.”
The cost tradeoff: channel passes multiply quickly
Each region runs multiple diffusion passes—one per channel. That’s computationally heavy by definition.
The reason I accept it is that it buys a clean mental model:
- each output map has a specific prompt
- each prompt corresponds to a single inference call
When something looks wrong, I can isolate which pass is responsible.
The operational tradeoff: patching dependencies isn’t pretty
The Transformers constant patch is not elegant. It’s a stability move.
The alternative is worse: a server that fails to start because a dependency changed a constant name. I’d rather keep the pipeline alive and constrain the failure mode.
Closing
The moment I stopped decomposing the whole photo and started decomposing regions, the outputs stopped looking like “AI texture soup” and started behaving like materials—local, bounded, and believable under changing light.
