Outstanding Paper Award
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Uday Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
Shivam Chandhok, Wan-Cyuan Fan, Leonid Sigal
Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail.
Bianca Lamm, Janis Keuper
Visual Prompt Engineering for Medical Vision Language Models in Radiology
Stefan Denner, Markus Ralf Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, Paul F Jaeger, Klaus Maier-Hein
Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning
Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Kunze Huang, Xinghao Ding, Yue Huang
Annotation-Free Semantic Segmentation with Vision Foundation Models
Soroush Seifi, Daniel Olmeda Reino, Fabien Despinoy, Rahaf Aljundi
Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering.
Ido Sobol, Chenfeng Xu, Or Litany
INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, Grant Van Horn
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model at Inference Time?
Saeid Asgari, Joseph George Lambourne, Alana Mongkhounsavath
PlotTwist: Vision-language Models Struggle with Reasoning Over Mathematical Plots
Pulkit Madan, Sanjay Haresh, Apratim Bhattacharyya, Litian Liu, Reza Pourreza, Sunny Panchal, Roland Memisevic
Adversarial Attacks on Text-Recognizable Foundation Models: Optimized Search Space Reduction via Skeletonization
Haruto Namura, Masatomo Yoshida, Nicola Adami, Masahiro Okuda
Vision-Language Models Do Not Understand Negation
Kumail Alhamoud, Shaden Naif Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
ComTie: Fine-Grained Compositional Metric for Text-to-Image Evaluation
Amirmohammad Izadi, Seyed Mohammad Hadi Hosseini, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah