Security researchers from several universities have introduced ARVO, a large-scale vulnerability dataset that prioritises reproducibility — a quality long sacrificed in favour of sheer volume when compiling historical bug records.
The dataset, described in a paper posted to arXiv, builds on OSS-Fuzz, Google's continuous fuzzing platform and currently the largest open-source software vulnerability repository. The research team, which includes academics from institutions including NYU and Arizona State University, developed a methodology to bring full reproducibility to that existing corpus, ultimately packaging each vulnerability so it can be consistently rebuilt, triggered, and analysed across different software versions.
The reproducibility problem
Vulnerability datasets have historically faced a three-way trade-off between reproducibility, quantity, and diversity. In practice, reproducibility has been the dimension most commonly dropped. When a bug cannot be reliably recreated in a controlled environment, researchers cannot easily study how it behaves, how it was patched, or how tools perform against it — limiting the usefulness of the record for both human analysts and automated systems.
ARVO directly addresses this by identifying the key obstacles to large-scale bug reproduction and proposing general solutions applicable across projects and programming environments.
What ARVO provides
Beyond simply storing vulnerability records, ARVO automatically identifies the patch corresponding to each bug and supports direct interaction with the vulnerable code after changes have been applied. According to the paper, these capabilities are not available in existing large-scale datasets.
In evaluation, the system successfully reproduced 81% of the vulnerabilities in the corpus and located the correct patch with 89.4% accuracy — figures the authors describe as strong given the scale and diversity of the projects involved.
The dataset spans 311 projects and covers a wide range of vulnerability types, giving downstream researchers access to a representative cross-section of real-world software bugs rather than a narrow or synthetic sample.
Implications for automated security tools
Reproducible vulnerabilities are particularly valuable for training and evaluating automated security tools, including AI-assisted vulnerability detection and patch verification systems. Without reliable reproduction, it is difficult to measure whether a tool has genuinely identified or fixed a bug, or merely produced output that appears correct.
The authors note that ARVO is intended to influence both upstream practices — encouraging better documentation of bugs as they are discovered — and downstream research, where reproducible datasets can serve as rigorous benchmarks.
The paper is available on arXiv and the dataset is described as openly accessible to the research community.