2 min readfrom Machine Learning

[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks

[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks
[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks

Hey there, we’re sharing KidGym, an interactive 2D grid-based benchmark for evaluating MLLMs in continuous, trajectory-based interaction, accepted to ICLR 2026.

Motivation: Many existing MLLM benchmarks are static and focus on isolated skills, which makes them less faithful for characterizing model capabilities in continuous interactive settings. Inspired by the Wechsler Intelligence Scale for Children (WISC), we organize evaluation into five cognitive dimensions and design tasks to probe both single abilities and compositional abilities.

Previews of 12 tasks in KIDGYM

KidGym Features:

  • 5 abilities: Execution, Memory, Learning, Planning, Perception Reasoning
  • 12 task categories × 3 difficulty levels, covering single-ability and compositional tasks
  • Randomized layouts and diverse scenarios to emphasize generalization beyond memorization / data leakage
  • LLM-friendly interaction design: backpack system, hint panel, item indexing, and high-level actions
  • Gym-style API for easy customization, extension, and reuse by the community

Five-dimensional capability radar chart

Findings:

We find that while strong models can perform very well on some single-ability tasks, performance drops noticeably on tasks requiring:

  • Abstract / non-semantic visual reasoning
  • Numerical sensitivity / counting
  • Multi-rule coordination and compositional reasoning across abilities

We hope KidGym can provide a more fine-grained, interpretable, and interaction-oriented perspective for evaluating multimodal large models.

Feedback and discussion are very welcome!

Paper:https://arxiv.org/abs/2603.20209

Project Page:https://bobo-ye.github.io/KidGym/

Github:https://github.com/BoBo-Ye/KidGym

submitted by /u/Matwe_
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#rows.com
#cognitive automation
#cloud-based spreadsheet applications
#big data performance
#interactive charts
#financial modeling with spreadsheets
#big data management in spreadsheets
#machine learning in spreadsheet applications
#enterprise-level spreadsheet solutions
#conversational data analysis
#large dataset processing
#business intelligence tools
#real-time data collaboration
#intelligent data visualization
#AutoML capabilities
#data visualization tools
#enterprise data management