STATE-Bench: Memory-agnostic Benchmark
Microsoft Developer introduces STATE-Bench (Stateful Task Agent Evaluation Benchmark), an open-source, memory-agnostic benchmark that measures whether memory improves AI agent performance on realistic, stateful enterprise tasks.
Overview
STATE-Bench evaluates AI agents beyond simple recall by focusing on how well they execute procedural, stateful workflows and how they behave across repeated runs.
What STATE-Bench is
- STATE-Bench stands for Stateful Task Agent Evaluation Benchmark.
- It is designed to test production-readiness characteristics of agents on realistic enterprise tasks.
- It is memory-agnostic and supports a “bring your own memory” approach.
Why traditional memory benchmarks fall short
- Many benchmarks emphasize recall-style tests.
- STATE-Bench targets stateful task execution, where success depends on reliably completing multi-step procedures rather than remembering isolated facts.
What STATE-Bench measures
STATE-Bench evaluates agent performance across dimensions such as:
- Procedural workflow handling (stateful, multi-step tasks)
- Reliability across repeated runs
- Efficiency
- User experience
It includes domains such as:
- Customer support
- Travel
- Shopping
How to contribute and learn more
- GitHub repository: https://github.com/microsoft/STATE-Bench
- Related video: Using Microsoft Agent Framework with Foundry managed memory: https://youtu.be/DZn9bNDEs4U?si=IV2itRlRjMXPYQl8
- Short link for this video: https://aka.ms/memory-benchmark
Video chapters
- 00:00 What's project STATE Bench
- 03:45 Why this benchmark is different
- 13:06 How it works
- 18:57 What's Next and How to Contribute
- 20:58 Final statements