CVPR 2026

Text–Image Conditioned 3D Generation

*Work done during internship at Huawei    Corresponding authors
1MoE Key Lab of AI, AI Institute, School of CS, Shanghai Jiao Tong University   2Huawei Inc.   3Huazhong University of Science and Technology
TIGON Teaser

Joint text–image conditioning yields 3D assets that are simultaneously faithful to the reference appearance and aligned with the textual description.

Abstract

High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that create 3D content from user-provided prompts. Most existing 3D generators rely on a single conditioning modality: image-conditioned models deliver high visual fidelity but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text–Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON (Text–Image conditioned GeneratiON), a minimalist dual-branch baseline that maintains separate image- and text-conditioned DiT backbones coupled via lightweight cross-modal fusion. Extensive experiments demonstrate consistent gains over single-modality methods, suggesting complementary vision–language guidance as a promising direction for future 3D generation research.

Motivation

Image-only 3D generation captures local appearance faithfully but is highly sensitive to viewpoint informativeness. An uninformative view (top/bottom) leaves large parts of geometry unconstrained, leading to hallucinated regions that deviate from user intent. Text-only generation encodes rich semantics but lacks fine-grained visual constraints; outputs may broadly match the description yet fail on specific shape, color, or style. Our key observation is that the two modalities are complementary: the image anchors appearance while text fills semantic gaps. Even a trivial inference-time velocity-averaging baseline (SimFusion) already achieves 82.40 FDDINOv2 on Toys4K — outperforming image-only (125.93) and text-only (154.88) by a large margin. This motivates training a dedicated joint model, TIGON.

Table 1: Performance on Toys4K under different conditioning signals

Model Conditioning Rep. CLIP ↑ FDDINOv2
TripoSRView-0 (front)M.88.67269.58
Step1X-3D †View-0 (front)M.89.99152.69
Hunyuan3D-2.1 †View-0 (front)M.89.87114.64
TRELLISView-0 (front)GS92.8856.08
UniLat3DView-0 (front)GS93.3447.41
↓ Degraded viewpoint (View-1: low-angle, less informative)
TripoSRView-1 (low-angle)M.79.40804.18
Step1X-3D †View-1 (low-angle)M.80.47562.84
Hunyuan3D-2.1 †View-1 (low-angle)M.85.33229.36
TRELLISView-1 (low-angle)GS88.16143.58
UniLat3DView-1 (low-angle)GS89.03125.93
↓ Text-only conditioning
TRELLISTextGS86.30148.21
UniLat3DTextGS86.14154.88
SimFusion (Ours)View-1 + TextGS90.6482.40

† Using non-public training data.   Rep.: M.=mesh, GS=3D Gaussian Splatting.   View-0 = frontal; View-1 = low-angle (less informative).

Method

TIGON Method

TIGON’s dual-branch DiT. Zero-initialized cross-modal bridges exchange features at every block. At each denoising step, velocity predictions from both branches are averaged to produce the final velocity field v.

01

Dual-Branch Backbone

Two modality-specialized DiT backbones process image tokens (dense, view-grounded) and text tokens (sparse, abstract) independently, avoiding cross-modal granularity mismatch while preserving each branch’s single-modality capability.

02

Early Fusion — Cross-Modal Bridges

Zero-initialized linear projections at every DiT block enable bidirectional feature sharing (inspired by ControlNet). Zero-init ensures stability at the start of joint fine-tuning; gradients progressively open the gates during training.

03

Late Fusion — Prediction Averaging

At each denoising step: v = ½(vtxt + vimg). This simple scheme matches or outperforms learned fusion variants while adding no extra parameters at inference.

04

Free-Form Conditioning

Condition dropout (p = 0.5 per modality) during training produces four regimes: unconditional, text-only, image-only, and joint. At inference, TIGON flexibly accepts any combination.

Quantitative Results

Evaluated on Toys4K and UniLat1K with CLIP, FDDINOv2, ULIP, and Uni3D. Each case conditioned on three reference views (front, top, bottom).

Toys4K

ModelCond.Rep.CLIP↑FDDINOv2ULIP↑Uni3D↑
TripoSRIM.83.14596.4427.3724.38
TRELLISIM.89.09171.4439.9735.61
TRELLISIGS90.5098.75
Step1X-3D †IM.84.77361.4434.1530.04
Hunyuan3D-2.1 †IM.87.57171.9140.2235.70
UniLat3DIM.91.85109.6840.3235.75
UniLat3DIGS91.2085.30
TRELLISTGS86.30148.21
UniLat3DTGS86.14154.88
TIGON (Ours)IGS91.4084.62
TIGON (Ours)TGS86.77152.34
TIGON (Ours)I+TM.92.9780.7741.3636.68
TIGON (Ours)I+TGS92.3361.59

† Using non-public training data.   Cond.: I=image, T=text.   Rep.: M.=mesh, GS=3D Gaussian Splatting.

UniLat1K

ModelCond.Rep.CLIP↑FDDINOv2ULIP↑Uni3D↑
TripoSRIM.83.37652.2725.9023.96
TRELLISIM.89.40233.5339.3735.40
UniLat3DIM.90.00205.7239.6035.49
UniLat3DIGS91.40155.99
UniLat3DTGS85.75282.36
TIGON (Ours)IGS91.64153.79
TIGON (Ours)TGS86.42273.97
TIGON (Ours)I+TM.90.91176.6940.9536.74
TIGON (Ours)I+TGS92.42130.08

Qualitative Comparison

Comparison with Baselines

Qualitative comparison with baselines

TIGON vs. TRELLIS and UniLat3D under image-only, text-only, and joint conditioning. With only a reference view (top/bottom), image-only models must hallucinate unobserved regions — they fail to reconstruct a faithful trophy shape or produce distorted toaster slots. Injecting text semantics via TIGON’s joint conditioning recovers the correct shape and appearance.

Mesh Generation vs. Baselines

Mesh comparison

TIGON vs. Hunyuan3D-2.1 and Step1X-3D. Image-only methods depend on favorable viewpoints — TIGON’s text conditioning recovers plausible geometry in challenging cases (e.g., a bird from an uninformative view; a deer where image-only methods miss the legs entirely).

Controllable 3D Generation

Given the same reference image, TIGON produces diverse 3D objects guided by different text prompts. Click a category to explore.

Ablation Study

Table: Ablations on Toys4K. "Bridges" = zero-initialized cross-modal bridges. Fusion strategies: Sim = simple averaging, AW = adaptive (learned) weight, AT = attention-based fusion. FT = joint fine-tuning. Gray row = our TIGON configuration.

Bridges Sim AW AT FT CLIP ↑ FDDINOv2
91.95 66.78
92.05 66.04
92.33 61.59
92.31 60.90
92.26 62.00

Ablation study

We ablate TIGON's two core design choices on Toys4K: the cross-modal bridges (early fusion) and the late-fusion strategy.

Early Fusion — Cross-Modal Bridges. Without cross-modal bridges, joint fine-tuning of the two branches alone yields only marginal improvement (FDDINOv2: 66.78 → 66.04). Enabling the zero-initialized bridges brings a substantial gain (66.78 → 61.59), confirming that explicit cross-modal feature exchange is essential. Qualitatively, without bridges the text- and image-conditioned branches diverge during denoising, producing inconsistent or abnormal structures; with bridges they remain aligned and yield coherent outputs.

Late Fusion — Averaging vs. Learned Strategies. Under the same early-fusion setup, simple step-wise velocity averaging (Sim) already achieves 61.59 FDDINOv2. Replacing it with adaptive weighting (AW) or attention-based fusion (AT) changes the score only marginally (60.90 and 62.00, respectively), despite adding extra parameters and training variance. We therefore adopt parameter-free simple averaging as our default.

Discussion

Discussion

Open challenges. When image and text specify highly inconsistent semantics, the model may struggle to resolve the conflict gracefully. It tends to follow the image branch since it usually provides more reliable guidance. Moreover, the dual-branch design approximately doubles inference FLOPs compared to single-modality baselines. Performance is also bounded by the quality of the underlying pre-trained models. We hope TIGON serves as a solid baseline for the community to build upon for this emerging task.

BibTeX

@inproceedings{cen2026tigon, title = {Text--Image Conditioned {3D} Generation}, author = {Cen, Jiazhong and Fang, Jiemin and Li, Sikuang and Wu, Guanjun and Yang, Chen and Yi, Taoran and Zhou, Zanwei and Bao, Zhikuan and Xie, Lingxi and Shen, Wei and Tian, Qi}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }