Draft:Snakemake

Snakemake
Repository	github.com/snakemake/snakemake
Written in	Python, YAML
Platform	Windows, Linux, macOS
License	MIT License
Website	snakemake.github.io

Submission declined on 25 October 2023 by MicrobiologyMarcus (talk).

This submission does not appear to be written in the formal tone expected of an encyclopedia article. Entries should be written from a neutral point of view, and should refer to a range of independent, reliable, published sources. Please rewrite your submission in a more encyclopedic format. Please make sure to avoid peacock terms that promote the subject.

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just brief mentions about the subject or routine announcements)
reliable
secondary
strictly independent of the subject

Make sure you add references that meet all four of these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL
Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by MicrobiologyMarcus 5 months ago. Last edited by Citation bot 4 months ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Submission declined on 12 June 2022 by Gusfriend (talk).

This submission does not appear to be written in the formal tone expected of an encyclopedia article. Entries should be written from a neutral point of view, and should refer to a range of independent, reliable, published sources. Please rewrite your submission in a more encyclopedic format. Please make sure to avoid peacock terms that promote the subject.

Declined by Gusfriend 22 months ago.

Comment: Reads like an advertisement. Please demonstrate WP:NCORP in the article and ensure tone is WP:NPOV. microbiologyMarcus ^{(petri dish)} 15:50, 25 October 2023 (UTC)

Comment: I almost approved it but there are a handful of bits like A cornerstone of science is the need to validate experimental results so that confidence can be gained on the result's correctness and Information gained from science must be shared so that findings can be built upon and utilized, therefore the experiment and results must be comprehensible that should be more neutral. Gusfriend (talk) 12:36, 12 June 2022 (UTC)

Comment: As a note: if this were to be submitted as-is, I'd expect it to be declined as it reads more like an advertisement than an encyclopedia article. Bsoyka (talk · contribs) 04:19, 25 February 2022 (UTC)

Snakemake is a scientific workflow management framework that uses a domain-specific language (DSL) to specify allowed data transformations for automatic generation of processing pipelines. Subscribing to the declarative paradigm, the usage of Snakemake involves stating desired products instead of an explicit course of action. Individual steps, however, must still be defined as rules in a similar structure to what can be found in GNU Make. These rules are written in a workflow config file called a "Snakefile" using Snakemake's DSL, which allows both Python^[1] and YAML syntax since the DSL is implemented as an extension of Python. Snakmake executes the pipeline defined in Snakefiles by selecting and running each step as prerequisites become satisfied. Steps with no interdependence may be executed in parallel.

The author of Snakemake has stated that the goals of this software tool is to enhance reproducibility, transparency, and adaptability in bioinformatic analysis.^[2] The following characteristics of Snakemake support these goals: Native integration of external tools like Conda and Singularity borrow their functionalities, namely dependency resolution and containerization. Being platform agnostic allows it to run the same workflow on local or high performance compute (HPC) environments. Reports can be generated to describe and log the conditions and actions performed at each step.

Operation edit

Example generated workflow from available actions & goal

Rules edit

The set of actions that Snakemake can take towards generating a desired data product is specified in a file called the Snakefile. These actions are called "rules" and include a name, inputs, outputs, and how to start the represented action.^[1] Inputs and outputs are declared as explicit file paths, cloud storage URIs, or sets of locations through patterns and wildcards. Several methods exist to start an action, including a shell command, python script, or Snakemake wrapper.^[3] A wrapper is a predefined configuration that specifies how Snakemake can interface with existing software such as Samtools.^[4] In a Snakefile, YAML adhering to Snakemake's DSL and native Python is allowed, ^[3] but upon execution, the rules are translated into Python via a state machine algorithm.^[1]

Example rule:

rule user_defined_name:
    input:
        "path/to/input.file"
    output:
        "path/to/output.file"
    shell:
        "shell_command {input} > {output}"

Workflows and Execution edit

When given a set of rules and target outputs, Snakemake will first either look for files that match the targets or search for defined rules that produce them. By then redefining the targets to the inputs of found rules and repeating recursively until all inputs are existing files, an implicit workflow is generated in the form of a directed acycic graph (DAG) of steps.^[2] The execution order of these steps is computed by optimizing a mixed integer linear program for various parameters such as runtime, total disk space required for temporary files, and user defined priority.^[5] Where possible, intermediate data product from previous runs can be used to skip ahead if prerequisite steps are found to be equivalent. Equivalence is determined by comparing the prior steps used to generate the data, recorded through a ledger system analogous to that of block chains.^[5]

Features edit

Example Snakefile and tool wrapper reuse

Reproducibility edit

Reproducibility refers to the level of determinism exhibited by a piece of software, that is, how similar are the outputs of different executions when processing the same input each time. Snakemake can use Conda or containerization via Docker or Singularity^[2] to reconstruct the compute environment with required software dependencies^[6] between executions on different machines. Reports can also be generated to summarize the various parameters and events at each processing step, including an image of the computed DAG.^[2]

Transparency edit

Transparency refers to the interpretability of a software's execution or how well the software reports its actions. In addition to the report generation mentioned above, Snakefiles contain more commonly understood terms rather than technical or Snakemake-specific language.^[2] This supports transparency by reducing the amount of domain knowledge required to understand what a pipeline does and what the results mean.

Reusability edit

Previously written Snakefiles can be referenced in new Snakefiles, thus allowing rules to be reused.^[2] Certain predefined rules, called tool wrappers, integrate external software for workflow generation and are publicly available.^[4]

Comparisons edit

Snakemake belongs to a family of implicit convention frameworks,^[7] members of which include Make and Nextflow. The common characteristic of this group of tools is the ability to generate specific workflows from a collection of discrete steps. Specifically, the GNU derivative of Make used for building source code is considered to be the source of inspiration for Snakemake and contributes to its model of operation and certain features like wildcards.^[8] Usage-wise, Snakemake uses text-based configuration instead of a GUI, like that of Galaxy, and attempts to be environment agnostic.^[9]^[2]

Usage Examples edit

Bioinformatics edit

SnakeChunks is a set of Snakemake rules used for processing Next-Generation Sequencing data, including quality control, peak calling, and visualization steps. The authors demonstrated its use on transcriptome (RNA-seq) and genome wide location (ChIP-seq) data.^[10]
Recount3 is a database of human and mouse RNA-seq data that was curated and annotated with a Snakemake based workflow.^[11]
VIPER is a snakemake based pipeline for producing graphical reports of RNA-seq data.^[12]
RiboDoc is a ribosomal profiling package that also involves complex steps of quality control and data-intensive statistical analysis managed by Snakemake and Docker.^[13]
Dadasnake integrates amplicon sequencing analysis with DADA2 using Snakemake. The implied workflow is for phylogenetic analysis of specific genes.^[14]
In order to compile timelapses of selective plane illumination microscopy images of cells, Snakemake was used to parallelize the task on high performanc compute clusters.^[15]

Other edit

HEXME is a dataset of tetrahedral meshes generated from CAD models, using Snakemake. Rules were specified to generate three variants for each input.^[16]

References edit

^ ^a ^b ^c Köster, Johannes; Rahmann, Sven (2012). Building and Documenting Workflows with Python-Based Snakemake. OpenAccess Series in Informatics (OASIcs). Vol. 26. Marc Herbstritt. pp. 49–56. doi:10.4230/OASICS.GCB.2012.49. ISBN 9783939897446.
^ ^a ^b ^c ^d ^e ^f ^g Mölder, Felix; Jablonski, Kim Philipp; Letcher, Brice; Hall, Michael B.; Tomkins-Tinch, Christopher H.; Sochat, Vanessa; Forster, Jan; Lee, Soohyun; Twardziok, Sven O.; Kanitz, Alexander; Wilm, Andreas (2021-04-19). "Sustainable data analysis with Snakemake". F1000Research. 10: 33. doi:10.12688/f1000research.29032.2. ISSN 2046-1402. PMC 8114187. PMID 34035898.
^ ^a ^b "Snakemake — Snakemake 7.0.0 documentation". snakemake.readthedocs.io. Retrieved 2022-02-26.
^ ^a ^b "The Snakemake Wrappers repository — Snakemake Wrappers tags/v1.2.0 documentation". snakemake-wrappers.readthedocs.io. Retrieved 2022-02-26.
^ ^a ^b Köster, Johannes; Rahmann, Sven (2012-10-01). "Snakemake—a scalable bioinformatics workflow engine". Bioinformatics. 28 (19): 2520–2522. doi:10.1093/bioinformatics/bts480. ISSN 1367-4803. PMID 22908215.
^ Strozzi, Francesco; Janssen, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Di Tommaso, Paolo; Belhachemi, Dominique; Möller, Steffen; Smant, Geert (2019), Anisimova, Maria (ed.), "Scalable Workflows and Reproducible Data Analysis for Genomics", Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910, New York, NY: Springer, pp. 723–745, doi:10.1007/978-1-4939-9074-0_24, ISBN 978-1-4939-9074-0, PMC 7613310, PMID 31278683, S2CID 195813647, retrieved 2022-02-27
^ Leipzig, Jeremy (2017-05-01). "A review of bioinformatic pipeline frameworks". Briefings in Bioinformatics. 18 (3): 530–536. doi:10.1093/bib/bbw020. ISSN 1467-5463. PMC 5429012. PMID 27013646.
^ Jackson, Michael; Kavoussanakis, Kostas; Wallace, Edward W. J. (2021-02-25). "Using prototyping to choose a bioinformatics workflow management system". PLOS Computational Biology. 17 (2): e1008622. Bibcode:2021PLSCB..17E8622J. doi:10.1371/journal.pcbi.1008622. ISSN 1553-7358. PMC 7906312. PMID 33630841.
^ Spjuth, Ola; Bongcam-Rudloff, Erik; Hernández, Guillermo Carrasco; Forer, Lukas; Giovacchini, Mario; Guimera, Roman Valls; Kallio, Aleksi; Korpelainen, Eija; Kańduła, Maciej M.; Krachunov, Milko; Kreil, David P. (2015-08-19). "Experiences with workflows for automating data-intensive bioinformatics". Biology Direct. 10 (1): 43. doi:10.1186/s13062-015-0071-8. ISSN 1745-6150. PMC 4539931. PMID 26282399.
^ Rioualen, Claire; Charbonnier-Khamvongsa, Lucie; Helden, Jacques van (2017-07-19). "SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses": 165191. doi:10.1101/165191. S2CID 195970637. {{cite journal}}: Cite journal requires |journal= (help)
^ Wilks, Christopher; Zheng, Shijie C.; Chen, Feng Yong; Charles, Rone; Solomon, Brad; Ling, Jonathan P.; Imada, Eddie Luidy; Zhang, David; Joseph, Lance; Leek, Jeffrey T.; Jaffe, Andrew E. (2021-11-29). "recount3: summaries and queries for large-scale RNA-seq expression and splicing". Genome Biology. 22 (1): 323. doi:10.1186/s13059-021-02533-6. ISSN 1474-760X. PMC 8628444. PMID 34844637.
^ Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen; Zhang, Jian; Qiu, Xintao; Pun, Matthew (2018-04-12). "VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis". BMC Bioinformatics. 19 (1): 135. doi:10.1186/s12859-018-2139-9. ISSN 1471-2105. PMC 5897949. PMID 29649993.
^ François, Pauline; Arbes, Hugo; Demais, Stéphane; Baudin-Baillieu, Agnès; Namy, Olivier (2021-01-01). "RiboDoc: A Docker-based package for ribosome profiling analysis". Computational and Structural Biotechnology Journal. 19: 2851–2860. doi:10.1016/j.csbj.2021.05.014. ISSN 2001-0370. PMC 8141510. PMID 34093996.
^ Weißbecker, Christina; Schnabel, Beatrix; Heintz-Buschart, Anna (2020-11-30). "Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology". GigaScience. 9 (12): giaa135. doi:10.1093/gigascience/giaa135. ISSN 2047-217X. PMC 7702218. PMID 33252655.
^ Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel (2016-04-01). "An automated workflow for parallel processing of large multiview SPIM recordings". Bioinformatics. 32 (7): 1112–1114. doi:10.1093/bioinformatics/btv706. ISSN 1367-4803. PMC 4896369. PMID 26628585.
^ Beaufort, Pierre-Alexandre; Reberol, Maxence; Liu, Heng; Ledoux, Franck; Bommes, David (2021-11-19). "Hex Me If You Can". arXiv:2111.10295 [cs.CG].

[:0-1] Köster, Johannes; Rahmann, Sven (2012). Building and Documenting Workflows with Python-Based Snakemake. OpenAccess Series in Informatics (OASIcs). Vol. 26. Marc Herbstritt. pp. 49–56. doi:10.4230/OASICS.GCB.2012.49. ISBN 9783939897446.

[:2-2] ^ ^a ^b ^c ^d ^e ^f ^g Mölder, Felix; Jablonski, Kim Philipp; Letcher, Brice; Hall, Michael B.; Tomkins-Tinch, Christopher H.; Sochat, Vanessa; Forster, Jan; Lee, Soohyun; Twardziok, Sven O.; Kanitz, Alexander; Wilm, Andreas (2021-04-19). "Sustainable data analysis with Snakemake". F1000Research. 10: 33. doi:10.12688/f1000research.29032.2. ISSN 2046-1402. PMC 8114187. PMID 34035898.

[:3-3] "Snakemake — Snakemake 7.0.0 documentation". snakemake.readthedocs.io. Retrieved 2022-02-26.

[:4-4] "The Snakemake Wrappers repository — Snakemake Wrappers tags/v1.2.0 documentation". snakemake-wrappers.readthedocs.io. Retrieved 2022-02-26.

[:1-5] Köster, Johannes; Rahmann, Sven (2012-10-01). "Snakemake—a scalable bioinformatics workflow engine". Bioinformatics. 28 (19): 2520–2522. doi:10.1093/bioinformatics/bts480. ISSN 1367-4803. PMID 22908215.

[6] Strozzi, Francesco; Janssen, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Di Tommaso, Paolo; Belhachemi, Dominique; Möller, Steffen; Smant, Geert (2019), Anisimova, Maria (ed.), "Scalable Workflows and Reproducible Data Analysis for Genomics", Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910, New York, NY: Springer, pp. 723–745, doi:10.1007/978-1-4939-9074-0_24, ISBN 978-1-4939-9074-0, PMC 7613310, PMID 31278683, S2CID 195813647, retrieved 2022-02-27

[7] Leipzig, Jeremy (2017-05-01). "A review of bioinformatic pipeline frameworks". Briefings in Bioinformatics. 18 (3): 530–536. doi:10.1093/bib/bbw020. ISSN 1467-5463. PMC 5429012. PMID 27013646.

[8] Jackson, Michael; Kavoussanakis, Kostas; Wallace, Edward W. J. (2021-02-25). "Using prototyping to choose a bioinformatics workflow management system". PLOS Computational Biology. 17 (2): e1008622. Bibcode:2021PLSCB..17E8622J. doi:10.1371/journal.pcbi.1008622. ISSN 1553-7358. PMC 7906312. PMID 33630841.

[9] Spjuth, Ola; Bongcam-Rudloff, Erik; Hernández, Guillermo Carrasco; Forer, Lukas; Giovacchini, Mario; Guimera, Roman Valls; Kallio, Aleksi; Korpelainen, Eija; Kańduła, Maciej M.; Krachunov, Milko; Kreil, David P. (2015-08-19). "Experiences with workflows for automating data-intensive bioinformatics". Biology Direct. 10 (1): 43. doi:10.1186/s13062-015-0071-8. ISSN 1745-6150. PMC 4539931. PMID 26282399.

[10] Rioualen, Claire; Charbonnier-Khamvongsa, Lucie; Helden, Jacques van (2017-07-19). "SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses": 165191. doi:10.1101/165191. S2CID 195970637. {{cite journal}}: Cite journal requires |journal= (help)

[11] Wilks, Christopher; Zheng, Shijie C.; Chen, Feng Yong; Charles, Rone; Solomon, Brad; Ling, Jonathan P.; Imada, Eddie Luidy; Zhang, David; Joseph, Lance; Leek, Jeffrey T.; Jaffe, Andrew E. (2021-11-29). "recount3: summaries and queries for large-scale RNA-seq expression and splicing". Genome Biology. 22 (1): 323. doi:10.1186/s13059-021-02533-6. ISSN 1474-760X. PMC 8628444. PMID 34844637.

[12] Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen; Zhang, Jian; Qiu, Xintao; Pun, Matthew (2018-04-12). "VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis". BMC Bioinformatics. 19 (1): 135. doi:10.1186/s12859-018-2139-9. ISSN 1471-2105. PMC 5897949. PMID 29649993.

[13] François, Pauline; Arbes, Hugo; Demais, Stéphane; Baudin-Baillieu, Agnès; Namy, Olivier (2021-01-01). "RiboDoc: A Docker-based package for ribosome profiling analysis". Computational and Structural Biotechnology Journal. 19: 2851–2860. doi:10.1016/j.csbj.2021.05.014. ISSN 2001-0370. PMC 8141510. PMID 34093996.

[14] Weißbecker, Christina; Schnabel, Beatrix; Heintz-Buschart, Anna (2020-11-30). "Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology". GigaScience. 9 (12): giaa135. doi:10.1093/gigascience/giaa135. ISSN 2047-217X. PMC 7702218. PMID 33252655.

[15] Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel (2016-04-01). "An automated workflow for parallel processing of large multiview SPIM recordings". Bioinformatics. 32 (7): 1112–1114. doi:10.1093/bioinformatics/btv706. ISSN 1367-4803. PMC 4896369. PMID 26628585.

[16] Beaufort, Pierre-Alexandre; Reberol, Maxence; Liu, Heng; Ledoux, Franck; Bommes, David (2021-11-19). "Hex Me If You Can". arXiv:2111.10295 [cs.CG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]