(EVAL-FoMo 24)
September 30 - Milan
Hall: Amber 5
New York University
University of Oxford
University of Illinois Urbana Champaign
Massachusetts Institute of Technology
Massachusetts Institute of Technology
Massachusetts Institute of Technology
*Due to unforeseen circumstances, Antonio Torralba is unable to attend in person. He will be represented by Tamar Rott Shaham, senior researcher from his group.
This workshop focuses on analysis and evaluations to understand and identify emerging visual capabilities and pinpoint visual limits in foundation models.
The computer vision landscape is dramatically evolving due to the capabilities of foundation models. These models exhibit capabilities they were not explicitly trained for, i.e. emergent abilities. Foundation models surpass standard benchmarks, highlighting the need for new evaluations. The inadequacy of current benchmarks inhibits the understanding and characterization of the emergent capabilities of these models (are they a mirage?). Moreover, limits in the visual abilities of these models have recently come into the spotlight, pointing to the need for innovative evaluations to identify these shortcomings. We exemplify some of the visual emergent abilities (and limits) in the following with the goal to inspire, not to limit, relevant contributions to our workshop:
Visual Reasoning: Investigate how foundation models explicitly or implicitly support spatial, temporal, and causal reasoning about the visual world.
Visual Grounding: For example, are generated outputs of multimodal LLMs grounded on the visual input, or is it a likely response by the LLM?
Interpreting for Emergence: Probe neurons and latent representations to identify if they form a semantic or mechanistic understanding of visual concepts.
Object, Part, Segment Discovery: Identify the ability to inherently detect objects, their boundaries, and parts [example].
Prompting: Study how prompting these models makes learned, but unobserved capabilities emerge or overcome shortcomings [examples].
Event-Centric Understanding: capabilities beyond object-centric understanding by leveraging relationships, context, and external knowledge.
Visual Imagination: E.g., do LLMs have spatial and visual understanding while solving tasks? [example]
Invariant understanding: Do models form similar embeddings for different representations of the same concept?
Semantic Metrics: Are the embedding spaces and distance metrics defined over them capturing specific stylistic or semantic aspects of the input? [example]
Understanding 3D: E.g. do foundation models trained on 2D data show 3D understanding? [example]
Hallucination: How do we evaluate the hallucinated content of image generation models? How do we identify the hallucinations of VLMs in tasks such as image captioning and VQA?
Visual Abstraction: Are these models capable of visual abstraction similar to the ones of humans? Do they follow Gestalt perceptual principles?
Visual Chain of Thought: Analyzing chain of thought reasoning for solving visual tasks in multimodal LLMs.
Visual In-Context Learning: How and to what extent can foundation models learn novel tasks in context with limited examples?
Miscellaneous visual capabilities (or shortcomings) such as understanding quantity and count, text understanding, fine-grained recognition, or understanding the direction and orientation of objects.
Submission Deadline: July 31 23:59 GMT
Submission Portal: EVAL-FoMo
Submissions now open!
Final Decisions: Aug 24
The workshop has no proceedings.
We accept already submitted papers (e.g. ECCV24 main conference submissions) and submitted workshop papers can be submitted to future venues.
We accept extended abstracts only. The extended abstracts are a maximum of 4 pages (excluding the references) in the format of CVPR_24_Author_Kit. (We are using the CVPR format to have more space for adding content if needed.)
There will be no supplementary materials. We limit the submission exclusively to the extended abstract (4 pages) format to focus on showcasing only the main ideas and insights.
The submissions will remain anonymous and will not be made public on OpenReview.
The reviewing is double-blind.
UC Berkeley, BAIR
University of Chicago
University of Oxford