Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors.
Xu Z., Li B., Hu Y., Zhang C., East J., Ali S., Rittscher J.
Accurate 3D reconstruction in endoscopy enables quantitative and holistic lesion characterization within the gastrointestinal (GI) tract. To achieve this, reliable depth and pose estimation is required. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a StyleGAN-based generator and a Variational Autoencoder (VAE). The StyleGAN generator leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. To further enhance pose stability and generalizability, we introduce a prior transfer module that distills motion knowledge from natural scene SLAM systems. Specifically, pose priors from a pretrained SLAM model-supervised on large-scale natural scene datasets-are used to guide the latent distribution of pose through a KL-divergence reparameterization. This mechanism effectively transfers structural motion priors into the endoscopic domain, improving trajectory consistency under challenging conditions. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol, C3VD, and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation. All data descriptions and code are available at https://github.com/EricXuziang/ Self-supervised-with-Latent-Priors.git.
