• Comment: I almost approved it but there are a handful of bits like A cornerstone of science is the need to validate experimental results so that confidence can be gained on the result's correctness and Information gained from science must be shared so that findings can be built upon and utilized, therefore the experiment and results must be comprehensible that should be more neutral. Gusfriend (talk) 12:36, 12 June 2022 (UTC)
  • Comment: As a note: if this were to be submitted as-is, I'd expect it to be declined as it reads more like an advertisement than an encyclopedia article. Bsoyka (talk · contribs) 04:19, 25 February 2022 (UTC)

Snakemake
Repositorygithub.com/snakemake/snakemake
Written inPython, YAML
PlatformWindows, Linux, macOS
LicenseMIT License
Websitesnakemake.github.io

Snakemake is a scientific workflow management framework that uses a domain-specific language (DSL) to specify allowed data transformations for automatic generation of processing pipelines. Subscribing to the declarative paradigm, the usage of Snakemake involves stating desired products instead of an explicit course of action. Individual steps, however, must still be defined as rules in a similar structure to what can be found in GNU Make. These rules are written in a workflow config file called a "Snakefile" using Snakemake's DSL, which allows both Python[1] and YAML syntax since the DSL is implemented as an extension of Python. Snakmake executes the pipeline defined in Snakefiles by selecting and running each step as prerequisites become satisfied. Steps with no interdependence may be executed in parallel.

The author of Snakemake has stated that the goals of this software tool is to enhance reproducibility, transparency, and adaptability in bioinformatic analysis.[2] The following characteristics of Snakemake support these goals: Native integration of external tools like Conda and Singularity borrow their functionalities, namely dependency resolution and containerization. Being platform agnostic allows it to run the same workflow on local or high performance compute (HPC) environments. Reports can be generated to describe and log the conditions and actions performed at each step.

Operation edit

 
Example generated workflow from available actions & goal

Rules edit

The set of actions that Snakemake can take towards generating a desired data product is specified in a file called the Snakefile. These actions are called "rules" and include a name, inputs, outputs, and how to start the represented action.[1] Inputs and outputs are declared as explicit file paths, cloud storage URIs, or sets of locations through patterns and wildcards. Several methods exist to start an action, including a shell command, python script, or Snakemake wrapper.[3] A wrapper is a predefined configuration that specifies how Snakemake can interface with existing software such as Samtools.[4] In a Snakefile, YAML adhering to Snakemake's DSL and native Python is allowed, [3] but upon execution, the rules are translated into Python via a state machine algorithm.[1]

Example rule:

rule user_defined_name:
    input:
        "path/to/input.file"
    output:
        "path/to/output.file"
    shell:
        "shell_command {input} > {output}"

Workflows and Execution edit

When given a set of rules and target outputs, Snakemake will first either look for files that match the targets or search for defined rules that produce them. By then redefining the targets to the inputs of found rules and repeating recursively until all inputs are existing files, an implicit workflow is generated in the form of a directed acycic graph (DAG) of steps.[2] The execution order of these steps is computed by optimizing a mixed integer linear program for various parameters such as runtime, total disk space required for temporary files, and user defined priority.[5] Where possible, intermediate data product from previous runs can be used to skip ahead if prerequisite steps are found to be equivalent. Equivalence is determined by comparing the prior steps used to generate the data, recorded through a ledger system analogous to that of block chains.[5]

Features edit

 
Example Snakefile and tool wrapper reuse

Reproducibility edit

Reproducibility refers to the level of determinism exhibited by a piece of software, that is, how similar are the outputs of different executions when processing the same input each time. Snakemake can use Conda or containerization via Docker or Singularity[2] to reconstruct the compute environment with required software dependencies[6] between executions on different machines. Reports can also be generated to summarize the various parameters and events at each processing step, including an image of the computed DAG.[2]

Transparency edit

Transparency refers to the interpretability of a software's execution or how well the software reports its actions. In addition to the report generation mentioned above, Snakefiles contain more commonly understood terms rather than technical or Snakemake-specific language.[2] This supports transparency by reducing the amount of domain knowledge required to understand what a pipeline does and what the results mean.

Reusability edit

Previously written Snakefiles can be referenced in new Snakefiles, thus allowing rules to be reused.[2] Certain predefined rules, called tool wrappers, integrate external software for workflow generation and are publicly available.[4]

Comparisons edit

Snakemake belongs to a family of implicit convention frameworks,[7] members of which include Make and Nextflow. The common characteristic of this group of tools is the ability to generate specific workflows from a collection of discrete steps. Specifically, the GNU derivative of Make used for building source code is considered to be the source of inspiration for Snakemake and contributes to its model of operation and certain features like wildcards.[8] Usage-wise, Snakemake uses text-based configuration instead of a GUI, like that of Galaxy, and attempts to be environment agnostic.[9][2]

Usage Examples edit

Bioinformatics edit

Other edit

  • HEXME is a dataset of tetrahedral meshes generated from CAD models, using Snakemake. Rules were specified to generate three variants for each input.[16]

References edit

  1. ^ a b c Köster, Johannes; Rahmann, Sven (2012). Building and Documenting Workflows with Python-Based Snakemake. OpenAccess Series in Informatics (OASIcs). Vol. 26. Marc Herbstritt. pp. 49–56. doi:10.4230/OASICS.GCB.2012.49. ISBN 9783939897446.
  2. ^ a b c d e f g Mölder, Felix; Jablonski, Kim Philipp; Letcher, Brice; Hall, Michael B.; Tomkins-Tinch, Christopher H.; Sochat, Vanessa; Forster, Jan; Lee, Soohyun; Twardziok, Sven O.; Kanitz, Alexander; Wilm, Andreas (2021-04-19). "Sustainable data analysis with Snakemake". F1000Research. 10: 33. doi:10.12688/f1000research.29032.2. ISSN 2046-1402. PMC 8114187. PMID 34035898.
  3. ^ a b "Snakemake — Snakemake 7.0.0 documentation". snakemake.readthedocs.io. Retrieved 2022-02-26.
  4. ^ a b "The Snakemake Wrappers repository — Snakemake Wrappers tags/v1.2.0 documentation". snakemake-wrappers.readthedocs.io. Retrieved 2022-02-26.
  5. ^ a b Köster, Johannes; Rahmann, Sven (2012-10-01). "Snakemake—a scalable bioinformatics workflow engine". Bioinformatics. 28 (19): 2520–2522. doi:10.1093/bioinformatics/bts480. ISSN 1367-4803. PMID 22908215.
  6. ^ Strozzi, Francesco; Janssen, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Di Tommaso, Paolo; Belhachemi, Dominique; Möller, Steffen; Smant, Geert (2019), Anisimova, Maria (ed.), "Scalable Workflows and Reproducible Data Analysis for Genomics", Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910, New York, NY: Springer, pp. 723–745, doi:10.1007/978-1-4939-9074-0_24, ISBN 978-1-4939-9074-0, PMC 7613310, PMID 31278683, S2CID 195813647, retrieved 2022-02-27
  7. ^ Leipzig, Jeremy (2017-05-01). "A review of bioinformatic pipeline frameworks". Briefings in Bioinformatics. 18 (3): 530–536. doi:10.1093/bib/bbw020. ISSN 1467-5463. PMC 5429012. PMID 27013646.
  8. ^ Jackson, Michael; Kavoussanakis, Kostas; Wallace, Edward W. J. (2021-02-25). "Using prototyping to choose a bioinformatics workflow management system". PLOS Computational Biology. 17 (2): e1008622. Bibcode:2021PLSCB..17E8622J. doi:10.1371/journal.pcbi.1008622. ISSN 1553-7358. PMC 7906312. PMID 33630841.
  9. ^ Spjuth, Ola; Bongcam-Rudloff, Erik; Hernández, Guillermo Carrasco; Forer, Lukas; Giovacchini, Mario; Guimera, Roman Valls; Kallio, Aleksi; Korpelainen, Eija; Kańduła, Maciej M.; Krachunov, Milko; Kreil, David P. (2015-08-19). "Experiences with workflows for automating data-intensive bioinformatics". Biology Direct. 10 (1): 43. doi:10.1186/s13062-015-0071-8. ISSN 1745-6150. PMC 4539931. PMID 26282399.
  10. ^ Rioualen, Claire; Charbonnier-Khamvongsa, Lucie; Helden, Jacques van (2017-07-19). "SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses": 165191. doi:10.1101/165191. S2CID 195970637. {{cite journal}}: Cite journal requires |journal= (help)
  11. ^ Wilks, Christopher; Zheng, Shijie C.; Chen, Feng Yong; Charles, Rone; Solomon, Brad; Ling, Jonathan P.; Imada, Eddie Luidy; Zhang, David; Joseph, Lance; Leek, Jeffrey T.; Jaffe, Andrew E. (2021-11-29). "recount3: summaries and queries for large-scale RNA-seq expression and splicing". Genome Biology. 22 (1): 323. doi:10.1186/s13059-021-02533-6. ISSN 1474-760X. PMC 8628444. PMID 34844637.
  12. ^ Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen; Zhang, Jian; Qiu, Xintao; Pun, Matthew (2018-04-12). "VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis". BMC Bioinformatics. 19 (1): 135. doi:10.1186/s12859-018-2139-9. ISSN 1471-2105. PMC 5897949. PMID 29649993.
  13. ^ François, Pauline; Arbes, Hugo; Demais, Stéphane; Baudin-Baillieu, Agnès; Namy, Olivier (2021-01-01). "RiboDoc: A Docker-based package for ribosome profiling analysis". Computational and Structural Biotechnology Journal. 19: 2851–2860. doi:10.1016/j.csbj.2021.05.014. ISSN 2001-0370. PMC 8141510. PMID 34093996.
  14. ^ Weißbecker, Christina; Schnabel, Beatrix; Heintz-Buschart, Anna (2020-11-30). "Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology". GigaScience. 9 (12): giaa135. doi:10.1093/gigascience/giaa135. ISSN 2047-217X. PMC 7702218. PMID 33252655.
  15. ^ Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel (2016-04-01). "An automated workflow for parallel processing of large multiview SPIM recordings". Bioinformatics. 32 (7): 1112–1114. doi:10.1093/bioinformatics/btv706. ISSN 1367-4803. PMC 4896369. PMID 26628585.
  16. ^ Beaufort, Pierre-Alexandre; Reberol, Maxence; Liu, Heng; Ledoux, Franck; Bommes, David (2021-11-19). "Hex Me If You Can". arXiv:2111.10295 [cs.CG].