Researchers Release Dataset of 6,100 Reproducible Open-Source Vulnerabilities to Accelerate Security Research

Security researchers from several universities have introduced ARVO, a large-scale vulnerability dataset that prioritises reproducibility — a quality long sacrificed in favour of sheer volume when compiling historical bug records.

The dataset, described in a paper posted to arXiv, builds on OSS-Fuzz, Google's continuous fuzzing platform and currently the largest open-source software vulnerability repository. The research team, which includes academics from institutions including NYU and Arizona State University, developed a methodology to bring full reproducibility to that existing corpus, ultimately packaging each vulnerability so it can be consistently rebuilt, triggered, and analysed across different software versions.

The reproducibility problem

Vulnerability datasets have historically faced a three-way trade-off between reproducibility, quantity, and diversity. In practice, reproducibility has been the dimension most commonly dropped. When a bug cannot be reliably recreated in a controlled environment, researchers cannot easily study how it behaves, how it was patched, or how tools perform against it — limiting the usefulness of the record for both human analysts and automated systems.

ARVO directly addresses this by identifying the key obstacles to large-scale bug reproduction and proposing general solutions applicable across projects and programming environments.

What ARVO provides

Beyond simply storing vulnerability records, ARVO automatically identifies the patch corresponding to each bug and supports direct interaction with the vulnerable code after changes have been applied. According to the paper, these capabilities are not available in existing large-scale datasets.

In evaluation, the system successfully reproduced 81% of the vulnerabilities in the corpus and located the correct patch with 89.4% accuracy — figures the authors describe as strong given the scale and diversity of the projects involved.

The dataset spans 311 projects and covers a wide range of vulnerability types, giving downstream researchers access to a representative cross-section of real-world software bugs rather than a narrow or synthetic sample.

Implications for automated security tools

Reproducible vulnerabilities are particularly valuable for training and evaluating automated security tools, including AI-assisted vulnerability detection and patch verification systems. Without reliable reproduction, it is difficult to measure whether a tool has genuinely identified or fixed a bug, or merely produced output that appears correct.

The authors note that ARVO is intended to influence both upstream practices — encouraging better documentation of bugs as they are discovered — and downstream research, where reproducible datasets can serve as rigorous benchmarks.

The paper is available on arXiv and the dataset is described as openly accessible to the research community.

Why This Matters

Reproducible vulnerability datasets are the foundation for training and fairly evaluating automated security tools, including AI-powered patch generation and bug detection; without them, benchmark results are difficult to trust or compare.
As AI-assisted code analysis matures, the quality of underlying training data becomes a critical bottleneck — ARVO directly addresses that gap at significant scale.
Open availability of the dataset means smaller research teams and independent security researchers gain access to infrastructure previously only practical for well-resourced organisations.

Background

Vulnerability datasets have been a contested resource in cybersecurity research for over two decades. Early efforts like the National Vulnerability Database (NVD) and NIST's Software Assurance Reference Dataset (SARD) provided useful catalogues but often lacked the environmental context needed to reproduce bugs reliably. Google's OSS-Fuzz, launched in 2016, dramatically expanded the volume of known vulnerabilities in open-source software through continuous automated fuzzing, but the resulting records were not packaged in a form that allowed consistent, automated reproduction across time or across computing environments.

The problem deepened as machine learning techniques were applied to vulnerability detection. Models trained on historical bug data often performed well on paper but failed in practice because the training data itself was noisy — bugs listed in datasets could not always be confirmed to actually manifest in the code as described. Patch-finding and automated program repair research suffered similar limitations.

ARVO represents one of the most systematic attempts to retrofit reproducibility onto an existing large-scale corpus rather than building a new, smaller, and more carefully curated dataset from scratch.

Key Perspectives

Security researchers and academics: A reproducible, large-scale dataset lowers the barrier to rigorous experimentation, enabling fairer comparisons between tools and more meaningful evaluation of AI-assisted vulnerability detection techniques. Automated security tool developers: ARVO provides a concrete benchmark corpus against which patch-generation and bug-detection systems can be measured, potentially accelerating development cycles and improving the credibility of published results. Critics and sceptics: An 81% reproduction rate, while notable at scale, also means roughly one in five vulnerabilities in the dataset still cannot be reliably recreated — a limitation that could introduce bias if users assume the dataset is complete. Additionally, open publication of a large, easily triggered vulnerability corpus raises questions about potential misuse, though the bugs involved are drawn from historical, patched issues.

What to Watch

Whether major AI security tool developers adopt ARVO as a standard benchmark, which would signal broader community acceptance and increase pressure on competing datasets to improve reproducibility.
Updates to the dataset's coverage rate — the authors achieved 81% reproduction; improvements toward 90%+ would substantially increase its utility and credibility.
Any policy response from open-source foundations or platform operators regarding how reproducible vulnerability datasets should be governed, shared, or restricted to prevent misuse.

ZOTPAPER