TIGON: Text-Image Conditioned 3D Generation

Joint text–image conditioning yields 3D assets that are simultaneously faithful to the reference appearance and aligned with the textual description.

Abstract

High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that create 3D content from user-provided prompts. Most existing 3D generators rely on a single conditioning modality: image-conditioned models deliver high visual fidelity but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text–Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON (Text–Image conditioned GeneratiON), a minimalist dual-branch baseline that maintains separate image- and text-conditioned DiT backbones coupled via lightweight cross-modal fusion. Extensive experiments demonstrate consistent gains over single-modality methods, suggesting complementary vision–language guidance as a promising direction for future 3D generation research.

Motivation

Image-only 3D generation captures local appearance faithfully but is highly sensitive to viewpoint informativeness. An uninformative view (top/bottom) leaves large parts of geometry unconstrained, leading to hallucinated regions that deviate from user intent. Text-only generation encodes rich semantics but lacks fine-grained visual constraints; outputs may broadly match the description yet fail on specific shape, color, or style. Our key observation is that the two modalities are complementary: the image anchors appearance while text fills semantic gaps. Even a trivial inference-time velocity-averaging baseline (SimFusion) already achieves 82.40 FD_DINOv2 on Toys4K — outperforming image-only (125.93) and text-only (154.88) by a large margin. This motivates training a dedicated joint model, TIGON.

Table 1: Performance on Toys4K under different conditioning signals

Model	Conditioning	Rep.	CLIP ↑	FD_DINOv2 ↓
TripoSR	View-0 (front)	M.	88.67	269.58
Step1X-3D †	View-0 (front)	M.	89.99	152.69
Hunyuan3D-2.1 †	View-0 (front)	M.	89.87	114.64
TRELLIS	View-0 (front)	GS	92.88	56.08
UniLat3D	View-0 (front)	GS	93.34	47.41
↓ Degraded viewpoint (View-1: low-angle, less informative)
TripoSR	View-1 (low-angle)	M.	79.40	804.18
Step1X-3D †	View-1 (low-angle)	M.	80.47	562.84
Hunyuan3D-2.1 †	View-1 (low-angle)	M.	85.33	229.36
TRELLIS	View-1 (low-angle)	GS	88.16	143.58
UniLat3D	View-1 (low-angle)	GS	89.03	125.93
↓ Text-only conditioning
TRELLIS	Text	GS	86.30	148.21
UniLat3D	Text	GS	86.14	154.88
SimFusion (Ours)	View-1 + Text	GS	90.64	82.40

† Using non-public training data. Rep.: M.=mesh, GS=3D Gaussian Splatting. View-0 = frontal; View-1 = low-angle (less informative).

Method

TIGON’s dual-branch DiT. Zero-initialized cross-modal bridges exchange features at every block. At each denoising step, velocity predictions from both branches are averaged to produce the final velocity field v.

01

Dual-Branch Backbone

Two modality-specialized DiT backbones process image tokens (dense, view-grounded) and text tokens (sparse, abstract) independently, avoiding cross-modal granularity mismatch while preserving each branch’s single-modality capability.

02

Early Fusion — Cross-Modal Bridges

Zero-initialized linear projections at every DiT block enable bidirectional feature sharing (inspired by ControlNet). Zero-init ensures stability at the start of joint fine-tuning; gradients progressively open the gates during training.

03

Late Fusion — Prediction Averaging

At each denoising step: v = ½(v_txt + v_img). This simple scheme matches or outperforms learned fusion variants while adding no extra parameters at inference.

04

Free-Form Conditioning

Condition dropout (p = 0.5 per modality) during training produces four regimes: unconditional, text-only, image-only, and joint. At inference, TIGON flexibly accepts any combination.

Quantitative Results

Evaluated on Toys4K and UniLat1K with CLIP, FD_DINOv2, ULIP, and Uni3D. Each case conditioned on three reference views (front, top, bottom).

Toys4K

Model	Cond.	Rep.	CLIP↑	FD_DINOv2↓	ULIP↑	Uni3D↑
TripoSR	I	M.	83.14	596.44	27.37	24.38
TRELLIS	I	M.	89.09	171.44	39.97	35.61
TRELLIS	I	GS	90.50	98.75	—	—
Step1X-3D †	I	M.	84.77	361.44	34.15	30.04
Hunyuan3D-2.1 †	I	M.	87.57	171.91	40.22	35.70
UniLat3D	I	M.	91.85	109.68	40.32	35.75
UniLat3D	I	GS	91.20	85.30	—	—
TRELLIS	T	GS	86.30	148.21	—	—
UniLat3D	T	GS	86.14	154.88	—	—
TIGON (Ours)	I	GS	91.40	84.62	—	—
TIGON (Ours)	T	GS	86.77	152.34	—	—
TIGON (Ours)	I+T	M.	92.97	80.77	41.36	36.68
TIGON (Ours)	I+T	GS	92.33	61.59	—	—

† Using non-public training data. Cond.: I=image, T=text. Rep.: M.=mesh, GS=3D Gaussian Splatting.

UniLat1K

Model	Cond.	Rep.	CLIP↑	FD_DINOv2↓	ULIP↑	Uni3D↑
TripoSR	I	M.	83.37	652.27	25.90	23.96
TRELLIS	I	M.	89.40	233.53	39.37	35.40
UniLat3D	I	M.	90.00	205.72	39.60	35.49
UniLat3D	I	GS	91.40	155.99	—	—
UniLat3D	T	GS	85.75	282.36	—	—
TIGON (Ours)	I	GS	91.64	153.79	—	—
TIGON (Ours)	T	GS	86.42	273.97	—	—
TIGON (Ours)	I+T	M.	90.91	176.69	40.95	36.74
TIGON (Ours)	I+T	GS	92.42	130.08	—	—

Qualitative Comparison

Comparison with Baselines

TIGON vs. TRELLIS and UniLat3D under image-only, text-only, and joint conditioning. With only a reference view (top/bottom), image-only models must hallucinate unobserved regions — they fail to reconstruct a faithful trophy shape or produce distorted toaster slots. Injecting text semantics via TIGON’s joint conditioning recovers the correct shape and appearance.

Mesh Generation vs. Baselines

TIGON vs. Hunyuan3D-2.1 and Step1X-3D. Image-only methods depend on favorable viewpoints — TIGON’s text conditioning recovers plausible geometry in challenging cases (e.g., a bird from an uninformative view; a deer where image-only methods miss the legs entirely).

Controllable 3D Generation

Given the same reference image, TIGON produces diverse 3D objects guided by different text prompts. Click a category to explore.

Ablation Study

Table: Ablations on Toys4K. "Bridges" = zero-initialized cross-modal bridges. Fusion strategies: Sim = simple averaging, AW = adaptive (learned) weight, AT = attention-based fusion. FT = joint fine-tuning. Gray row = our TIGON configuration.

Bridges	Sim	AW	AT	FT	CLIP ↑	FD_DINOv2 ↓
					91.95	66.78
				✓	92.05	66.04
✓	✓			✓	92.33	61.59
✓		✓		✓	92.31	60.90
✓			✓	✓	92.26	62.00

We ablate TIGON's two core design choices on Toys4K: the cross-modal bridges (early fusion) and the late-fusion strategy.

Early Fusion — Cross-Modal Bridges. Without cross-modal bridges, joint fine-tuning of the two branches alone yields only marginal improvement (FD_DINOv2: 66.78 → 66.04). Enabling the zero-initialized bridges brings a substantial gain (66.78 → 61.59), confirming that explicit cross-modal feature exchange is essential. Qualitatively, without bridges the text- and image-conditioned branches diverge during denoising, producing inconsistent or abnormal structures; with bridges they remain aligned and yield coherent outputs.

Late Fusion — Averaging vs. Learned Strategies. Under the same early-fusion setup, simple step-wise velocity averaging (Sim) already achieves 61.59 FD_DINOv2. Replacing it with adaptive weighting (AW) or attention-based fusion (AT) changes the score only marginally (60.90 and 62.00, respectively), despite adding extra parameters and training variance. We therefore adopt parameter-free simple averaging as our default.

Discussion

Open challenges. When image and text specify highly inconsistent semantics, the model may struggle to resolve the conflict gracefully. It tends to follow the image branch since it usually provides more reliable guidance. Moreover, the dual-branch design approximately doubles inference FLOPs compared to single-modality baselines. Performance is also bounded by the quality of the underlying pre-trained models. We hope TIGON serves as a solid baseline for the community to build upon for this emerging task.

BibTeX

@inproceedings{cen2026tigon, title = {Text--Image Conditioned {3D} Generation}, author = {Cen, Jiazhong and Fang, Jiemin and Li, Sikuang and Wu, Guanjun and Yang, Chen and Yi, Taoran and Zhou, Zanwei and Bao, Zhikuan and Xie, Lingxi and Shen, Wei and Tian, Qi}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }