
A Single GPU Is All You Need for Self-Supervised Pretraining
In this episode I sat down with Lakshay Sharma, a machine learning scientist at Instacart and former member of Microsoft’s geospatial AI team, to discuss self-supervised learning for remote sensing and his recent research on efficient pretraining for semantic segmentation. Lakshay explains the evolution of self-supervised learning, covering predictive, generative, and contrastive approaches, and discusses how foundation models such as DINO have transformed computer vision and geospatial machine learning. We explore the unique challenges of applying these techniques to remote sensing imagery, where assumptions that work for natural images often break down.We then dive into Lakshay’s recent paper, Sub-Image Overlap Prediction: Task-Aligned Self-Supervised Pretraining for Semantic Segmentation in Remote Sensing Imagery, presented at the Computer Vision for Earth Observation Workshop at WACV 2026. He walks through the intuition behind the method, which trains models to localize extracted sub-images within larger scenes as a proxy task for semantic segmentation. We discuss the experimental setup, comparisons against established self-supervised learning approaches, and the surprising finding that the method achieves competitive or superior results using only thousands of pretraining images rather than millions. Along the way, we explore transfer learning across datasets, the growing importance of data efficiency, and why targeted pretraining may offer a compelling alternative to increasingly resource-intensive foundation model development for niche geospatial applications.* 📺 Video of this conversation on YouTube* 👤 Lakshay on LinkedIn* 🖥️ Personal website of Lakshay* 📖 PaperBio: Lakshay Sharma is a Senior Machine Learning Scientist / Engineer at Instacart. His research spans Computer Vision (CV) and Vision-Language Models (VLMs) with a focus on Self-Supervised and Semi-Supervised Learning. He has previously worked at Microsoft on multi-modal representation learning, and using aerial/satellite and streetside imagery for maps and geospatial applications. He has also worked at Amazon where he was focused on representation learning for videos. Based in New York City, Lakshay is an avid fan of soccer, snowboarding, and cricket. He often daydreams of some day applying his computer vision chops to sports. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.satellite-image-deep-learning.com















