Logo MTU-Bench

A Multi-Granularity Tool-Use Benchmark
for Large Language Models

Alibaba Group.
*Equal contribution; +Corresponding author
ICLR 2025 Poster

Introduction

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.

Leaderboard on Normal Set

Leaderboard on Hard Set

Ablation Studies and Analysis

Error Analysis

BibTeX

@inproceedings{Wang2024mtubench,
  title={MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models},
  author={Pei Wang and Yanan Wu and Noah Wang and Jiaheng Liu、Xiaoshuai Song、Z.Y. Peng、Ken Deng、Chenchen Zhang、JiakaiWang、Junran Peng、Ge Zhang、Hangyu Guo、Zhaoxiang Zhang、Wenbo Su、Bo Zheng},
  year={2024},
  url={https://openreview.net/forum?id=6guG2OlXsr},
  pdf={https://openreview.net/pdf?id=6guG2OlXsr}
}