High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that create 3D content from user-provided prompts. Most existing 3D generators rely on a single conditioning modality: image-conditioned models deliver high visual fidelity but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text–Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON (Text–Image conditioned GeneratiON), a minimalist dual-branch baseline that maintains separate image- and text-conditioned DiT backbones coupled via lightweight cross-modal fusion. Extensive experiments demonstrate consistent gains over single-modality methods, suggesting complementary vision–language guidance as a promising direction for future 3D generation research.