Benchmarks
Verification benchmarks are carefully specified problems used to evaluate and compare the capabilities of formal verification tools. VSComp problems have served this purpose since the competition's inception, providing a shared set of challenges against which tool developers and researchers can measure progress. This page explains what verification benchmarks are, how VSComp problems function as benchmarks, and which tools have been applied to them. For the problems themselves, see the problems catalogue; for background on the techniques involved, visit the formal verification overview.

What Are Verification Benchmarks?
A verification benchmark is a problem — typically a programming task with a formal specification — that is used to evaluate the effectiveness of a verification tool or technique. Good benchmarks are precisely specified so that there is no ambiguity about what constitutes a correct solution. They are grounded in realistic programming patterns so that the results are relevant to practical software development. And they cover a range of difficulty levels and problem domains so that different aspects of a tool's capabilities can be assessed.
Benchmarks matter because claims about verification tools are difficult to evaluate in the abstract. A tool developer might assert that their system handles pointer reasoning or concurrency effectively, but without concrete evidence — a set of problems solved, a comparison with other tools — such claims are hard to assess. Benchmarks provide that evidence. They turn qualitative claims into quantitative data: which problems can a tool solve, how much human effort is required, how long does the automated reasoning take, and how complete is the resulting proof?
How VSComp Problems Serve as Benchmarks
The problems published for each VSComp edition were designed primarily as competition challenges, but they have acquired a secondary life as benchmarks. This was not entirely unintentional — the competition organisers recognised from the outset that a well-specified, publicly available set of verification challenges would have value beyond the competition itself. In practice, VSComp problems have been adopted as benchmarks by tool developers, used in academic papers as case studies, and incorporated into university courses as teaching exercises.
Several properties make VSComp problems effective benchmarks. They span multiple problem domains: sorting algorithms, search procedures, data-structure invariants, arithmetic computations, and concurrent protocols. They vary in difficulty, from problems that can be handled by most modern tools to problems that challenge even the most capable systems. And they come with documented solutions from multiple tools, providing a basis for comparison.
The competition years pages provide access to the problems from each edition, along with information about how participants approached them. The solutions archive documents the tools and techniques used, making it possible to compare approaches across different verification systems.
Tools Used in VSComp
Over the course of its editions, VSComp has attracted submissions using a wide variety of verification tools. Each tool brings a different approach to the verification task — different specification languages, different proof strategies, different degrees of automation. Understanding these tools provides context for interpreting benchmark results.
Developed at Microsoft Research, Dafny is a verification-aware programming language that integrates specification, implementation, and proof into a single workflow. Programs are annotated with preconditions, postconditions, and loop invariants, and the Dafny verifier automatically generates and discharges proof obligations using the Z3 SMT solver. Dafny has been one of the most frequently used tools in VSComp, and its design has been influenced by the kinds of problems that competitions pose.
Why3 is a platform for deductive program verification developed at Inria. It provides a rich specification language (WhyML) and interfaces with multiple automated and interactive theorem provers. Why3's ability to delegate proof obligations to different solvers makes it versatile: easy obligations can be handled automatically, while harder ones can be tackled interactively. It has been used effectively in several VSComp editions.
Frama-C is a framework for analysis of C programs, with a specification language (ACSL) based on first-order logic. Its WP plugin performs deductive verification, generating proof obligations that are discharged by SMT solvers or interactive provers. Frama-C is particularly relevant for benchmarks involving systems-level code, where the semantics of C — pointer arithmetic, memory layout, undefined behaviour — create distinctive verification challenges.
KeY is a deductive verification tool for Java programs, based on dynamic logic and a custom sequent calculus. It supports both automatic and interactive proof construction, with a graphical interface that allows users to inspect and guide the proof process. KeY has been used in VSComp to verify Java implementations, bringing object-oriented reasoning to the benchmark suite.
SPARK is a formally defined subset of Ada designed for the development of high-integrity software. Its integrated verification toolset, based on Why3, supports both proof of absence of runtime errors and proof of functional correctness against contracts. SPARK represents the industrial end of the verification spectrum, and its appearance in VSComp entries demonstrates that competition problems are relevant to practitioners as well as researchers.
Coq and Isabelle/HOL are interactive theorem provers that provide the highest level of proof assurance. Users construct proofs within a rigorous logical framework, and the tool checks every step mechanically. These tools demand more human effort than automated verifiers but produce proofs of exceptional rigour. VSComp submissions using Coq or Isabelle tend to be thorough and detailed, often going beyond the minimum required by the specification.
How Benchmarks Help Evaluate Tool Progress
The value of a benchmark suite lies not just in individual problem-solving but in the ability to track progress over time and compare tools against one another. When the same problem is solved by multiple tools — as happens in VSComp — the resulting solutions can be compared along several dimensions: the degree of automation achieved, the amount of annotation required, the time spent on proof construction, the completeness of the verification, and the clarity of the proof artefacts.
This comparative perspective is one of the most useful outputs of the competition. A problem that required extensive manual proof guidance in 2010 might be handled almost automatically by improved tools in 2017. A specification pattern that caused difficulties for one tool might be straightforward for another. These observations help tool developers identify areas for improvement and help users choose the right tool for their verification needs.
The competition does not produce formal rankings or scores. Instead, the evaluation committee assesses each submission on three dimensions — correctness, completeness, and clarity — and provides qualitative feedback. This approach reflects the nature of verification work: a partial but rigorous proof may be more valuable than a complete but superficial one, and the quality of a solution depends on context in ways that a single numerical score cannot capture.
Comparison Methodology
Comparing verification tools is inherently difficult. Different tools target different languages, use different specification formalisms, and make different trade-offs between automation and expressiveness. A fair comparison requires careful attention to what is being measured and an honest acknowledgement of what the comparison does not capture.
VSComp addresses this by defining problems in language-independent terms. The specification describes what the program should do, not how it should be implemented. This allows participants to choose whatever language and tool they prefer, and the evaluation focuses on the quality of the verification rather than the choice of tool. The result is not a tool ranking but a landscape — a picture of how different approaches perform on different classes of problems.
Researchers who use VSComp problems as benchmarks in their own work should be aware of this context. A tool that solves a particular VSComp problem is demonstrating capability on that problem, not claiming superiority over tools that took a different approach or chose different problems to focus on. The formal verification overview provides background on the different techniques that these tools employ.
The problem catalogue lists every verification challenge from every VSComp edition, with domain tags and difficulty information.
The solutions archive documents how participants applied verification tools to competition problems across all editions.
The years index provides access to edition-specific information including problems, tools used, and participation details.