Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Object-X
Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo¹

Federico Tombari²

Marc Pollefeys^{1, 3}

Dániel Béla Baráth^{1, 2}

¹ETH Zürich

²Google

³Microsoft

Arxiv

Code

TL;DR: A unified 3D scene representation method that learns rich scene graph node embeddings, enabling efficient 3D scene reconstruction and direct application in other downstream tasks.

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g., images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy.Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

Reconstruction

Overview of the proposed pipeline which learns object embeddings to reconstruct 3D Gaussians and support downstream tasks such as visual localization. (a) The method takes a mesh or point cloud of an object along with posed images observing it. The canonical object space is voxelized based on object geometry, and DINOv2 features extracted from the images are assigned to each voxel. This produces a 64 x 64 x 64 x 8 structured latent (SLat) representation. (b) The SLat is further compressed into a 16 x 16 x 16 x 8 U-3DGS embedding using a 3D U-Net. The embedding is trained with a masked mean squared error loss to ensure accurate reconstruction of the SLat, which in turn enables decoding into 3D Gaussians using standard photometric losses. (c) Additional task-specific losses, such as those for visual localization, can be incorporated to optimize the embedding for multiple objectives.

Downstream Tasks

The proposed Object-X learns per-object embeddings that are beneficial for a number of downstream tasks, besides object-wise 3DGS reconstruction, such as cross-modal visual localization (via image-to-object matching), 3D scene aligment (via object-to-object matching), and full-scene reconstruction by integrating per-object Gaussians primitives.

Results

3DGS

Object-X

We show qualitative comparison between 3DGS optimized on all images from the dataset - on the top - and Object-X - on the bottom - on few objects. We show Object-X can reconstruct the object with comparable visual quality to 3DGS, while significantly improving the geometric accuracy.

Citation

If you find our work useful, please consider citing:

@misc{dilorenzo2025objectxlearningreconstructmultimodal, title={Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations}, author={Gaia Di Lorenzo and Federico Tombari and Marc Pollefeys and Daniel Barath}, year={2025}, eprint={2506.04789}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.04789}, }