CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Under review, 2025

This work introduces a simple, purely cross-modal take on Joint Embedding Predictive Architecture (JEPA) for 3D vision. Instead of masking, it trains a predictor on 3D point clouds to infer the latent embeddings of 2D rendered views produced by a frozen image foundation encoder, conditioned on known projection parameters. We cache target image embeddings once to cut pretraining cost, keeping the model compact. This JEPA-style latent view prediction yields rich point cloud representations, requires the least GPU pretraining hours and learnable parameter count, and avoids the cross modal inconsistencies introduced by masking. On standard downstream benchmarks, this demonstrates strong performance.