
Ep. 247 - Part 3 - June 13, 2024
ArXiv Computer Vision research for Thursday, June 13, 2024. 00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data 01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth 03:08: GGHead: Fast and Generalizable 3D Gaussian Heads 04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset 06:34: Towards Vision-Language Geo-Foundation Model: A Survey 08:11: SimGen: Simulator-conditioned Driving Scene Generation 09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition 11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior 12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living 13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image 15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis 16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA 17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms 19:39: Real-Time Deepfake Detection in the Real-World 21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation 23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant 24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations 26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion 28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models 29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing 31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities 33:16: Towards Evaluating the Robustness of Visual State Space Models 34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images 36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras 37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach 40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding 41:40: Explore the Limits of Omni-modal Pretraining at Scale 42:46: Interpreting the Weight Space of Customized Diffusion Models 43:58: Depth Anything V2 45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels 46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models 48:11: Rethinking Score Distillation as a Bridge Between Image Distributions 49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding





