Philosophy
Omnibenchmark is an automated benchmarking system.
What actually is a benchmark, precisely? Here, a benchmark is thought of as a conceptual entity that roughly encompasses all the components of a study to understand the performance of a set of (computational) methods for a given task (here, the primary context is computational methods, but the concept can be applied to a broader scope, e.g., evaluation of laboratory-based methods).
A benchmark requires a well-defined task, such as inferring what set of genes are differentially expressed from a certain type of data (e.g., gene expression from RNA sequencing counts) or whether a high-dimensional dataset can be clustered into the 'correct' cell types (the definition of this correctness, or ground-truth, should be precisely defined in advance). Given a well-defined task, a benchmark consists of benchmark components (datasets, preprocessing steps, methods, metrics) and ideally a benchmark system is used to organize and orchestrate the running of the benchmark, and create the benchmark artifacts (code snapshots, file outputs that will be shared, results tables). Ultimately, the downstream results and interpretations, which may involve rankings and various additional analyses, will be derived from these artifacts. Such a system should be public, open and allow contributions. Importantly, at the outset of a benchmark, a benchmark definition could be envisioned, which could give a formal specification of the entire set of components (and the pattern of artifacts to be generated). The benchmark definition can be expressed as a single YAML file that specifies the scope and topology of components to include, details of the repositories (code implementations with versions), the instructions to create software environments (accessible across compute architectures), and the parameters used.
Similarly, a benchmark definition in terms of layers and their corresponding challenges and opportunities within each layer can be imagined.
There are various ways to structure the running of a benchmark study: a benchmark can be run by a single person on a laptop (or a server), it can be organized as a challenge or within a hackathon (typically a group of people that come together to competitively assess their approaches and/or populate benchmark components).
Community
Different roles and ways that people can contribute to or interact with a benchmark system have been defined.
The benchmarker is responsible for planning and coordinating a benchmark, defining the task, possibly splitting it into subtasks or processing stages, and defining the data formats across the stages; the benchmarker also brings an authority role of how to review and approve contributions.
The contributors curate and add content to the benchmark, which could be new datasets, analysis methods or evaluation metrics, adhering to the guidelines set up by the benchmarker. Finally, the viewers of benchmark results are users who retrieve one or more of the artifacts (datasets, intermediate results, or metric scores). This could span a range of use cases, including the data analyst choosing which method to use for a specific application, an instructor retrieving a curated dataset for teaching purposes, or a methods researcher prototyping their method.
Software
Software plays an important role in the current scientific landscape, since it is widely used throughout the scientific process, including data collection, simulation, analysis and reporting. Researchers are encouraged to publish their data and code to enhance transparency and reproducibility in their work. Ideally, this would increase the adoption rate by the wider scientific community. In practice, however, it is sometimes easier to develop new code than to reuse existing software. This leads to the phenomenon of academic abandonware, where projects are forgotten in code repositories, exacerbating the reproducibility crisis in research. One common reason for these abandoned projects is their failure to adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) principles 1. Another common cause is the life cycle inherent to academic practice: after delivering a research output, there are often no resources to maintain a software package in the longer term; left alone, the likelihood of a particular library compiling or being able to run after 5-10 years will drastically decrease.
Operational costs are given by the expected compute and storage requirements. In the context of benchmarking, predicting CPU and memory usage is challenging. The methods under evaluation typically have variable workloads, and their performance can fluctuate significantly. Benchmark designers could impose resource limits, thereby evaluating the methods under different constraints. Storage costs, on the other hand, are more predictable and can be directly influenced by design decisions of the benchmark (e.g., low retention for re-generatable artifacts, cold storage (i.e. Amazon Glacier, Azure Archive Storage) for code archives and software images, public repositories for method performance artifacts). Cold storage, in particular, is a cost-effective solution for the long-term storage of infrequently accessed data. Developments in IT infrastructure have led to the rise of cloud service providers and the commoditization of standardized compute and storage solutions, enabling these cost optimizations.
Storage costs for a benchmarking system can be efficiently managed by implementing specific storage and retention policies for different types of data. First, datasets and intermediary artifacts, which are often not the primary focus, can be managed with a low retention policy. Since these items are typically not crucial in the long term and can be easily recomputed, this approach helps reduce storage costs. Second, code archives and software environment images can be stored in cold storage, ensuring they remain archived and accessible when needed without incurring high costs. Although retaining method code and software environments may seem unnecessary, taking snapshots of benchmark dependencies helps mitigate issues related to deadlinks and ensures mid-term reproducibility. Finally, benchmark artifacts can be managed in several ways: they can be stored for the long term, or alternatively, signed with a cryptographic key for later verification, with the responsibility for long-term storage delegated to the end user.
Reproducible software environments
Software reproducibility comes from controlling the triad of data, code, and environment. Computationally, a benchmark executes code that transforms input data into outputs: this execution takes place and is affected by the execution environment; the environment encompasses the base system (OS, compiler toolchains, libraries) and a set of configurations. The benchmark definition should control as much as possible of the execution environment.
Leaving data aside, it can be useful to divide the codebase used into three distinct categories:
- The benchmark system itself has an impact on operational costs, such as storage needs, execution platform, etc. For example, it imposes a certain tech stack and base dependencies, and orchestrates the execution of a benchmark plan; usually, the system mandates the choice of a particular workflow manager. At this level, any requirements for input/output formats and shapes are defined; different systems can impose different constraints.
- Contributions related to individual benchmark components (e.g., datasets, methods, metrics) are expected to be small, typically short scripts that process data and wrap functionality by importing external libraries (where the component implementation itself is developed). Even so, imposing good practices at this level (output validation, testing for abnormal termination, ability to run with a subset of input data) can be beneficial to increase the quality and maintainability of the contributions. The toolset can also enforce validation of metadata annotations (authorship, versioning) for each contributed module.
- The software dependencies have easily the biggest impact on replicability and maintainability: archiving all combinations of a large dependency tree, across an arbitrary number of base OS images, will exponentially increase retention and maintenance costs. Turning the long-term replicability of a benchmark into a tractable problem is linked to choosing a sane software management system. Several software management systems have emerged to address these problems, from containerization (i.e. apptainer) and automating reproducible and efficient software management (i.e. easybuild 2, Spack 3), to distributing software installations (i.e. computecanada 4, EESSI 5) .
-
Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, and others. The fair guiding principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016. ↩
-
Kenneth Hoste, Jens Timmerman, Andy Georges, and Stijn De Weirdt. Easybuild: building software with ease. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 572–582. IEEE, 2012. ↩
-
Todd Gamblin, Matthew LeGendre, Michael R Collette, Gregory L Lee, Adam Moody, Bronis R De Supinski, and Scott Futral. The spack package manager: bringing order to hpc software chaos. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–12. 2015. ↩
-
Maxime Boissonneault, Bart E Oldeman, and Ryan P Taylor. Providing a unified software environment for canada's national advanced computing centers. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pages 1–6. 2019. ↩
-
Bob Dröge, Victor Holanda Rusu, Kenneth Hoste, Caspar van Leeuwen, Alan O'Cais, and Thomas Röblitz. Eessi: a cross-platform ready-to-use optimised scientific software stack. Software: Practice and Experience, 53(1):176–210, 2023. ↩