Code standards

Here the standards and conventions are defined on how to write rules, configs, and other code in the hydra-genetics framework.

Documentation

README

Each module and pipeline should include a README with the following parts:

Short description
Github action badges
Introduction on what the module should be used for
Dependencies as badges
Section about input data and reference files
How to run the test dataset (if possible)
Rule graph which is contained in images/rulegraph.svg.

readthedocs

Each module and pipeline are recommended to have a more in depth documentation using readthedocs.

Snakemake rules

Rules should be placed in the workflow/rules directory.
Use only small letters and connect words with underscore (e.g. picard_mark_duplicates).
Use alphabetical order when applicable.
For structure, all rules that use the same tool (e.g. picard) should be added to the same rule file named tool.smk. Rule names should be tool name followed by command, for example picard_collect_wgs_metrics.
Rules that produce new output should place this file in module/rule/ while rules that only modify an output file (bgzip, annotate) should place the file in the input directory under the same name as the input with a descriptive but short suffix module/input_rule/input_file.suffix.
Input and Output files/directories should get a reasonable tag (specific suffix, e.g. vcf, bam, is preferred), such as:

input:
    vcf=”module/input_rule/{sample}.vcf”,

The name of the main output file (e.g. .bam, .vcf) will also be used for naming the log and benchmark files of a rule, only adding .log and .benchmark.tsv in the end, respectively.
Output files should be marked with the temp() directive in order to save space. Use a rule that copies final results files to a results folder.
When accessing values in the config object they should be retrieved with the get directive while also setting a sensible default:

sorting=config.get("bwa_mem", {}).get("sort", "samtools"),

For all rules, threads and resources (mem_mb, mem_per_cpu, partition, threads, time) should be specified having the default pointing to default_resources in the config object.

threads=config.get("bwa_mem", {}).get("threads", config["default_resources"]["threads"]),

Container images are used for execution. Containers should be located at dockerHub and docker images provided by hydra-genetics should be used. New containers are added and uploaded via the docker module.
All rules should contain a message for logging starting with the rule name followed by a colon and a brief description of what is done and on which file.

message:
      "{rule}: align fastq files {input.reads} using bwa mem"

Last but not least, the execution needs to be specified which can either be shell, run, script or wrapper. We prefer to use official wrappers if they exist. Otherwise, the command should be specified or a script file referenced. The command should be split in several lines, each starting and ending with quotes, for each new flag. Don’t forget to include logging.

common.smk

This is a general rule taking care of any actions that are not directly connected with running a specific program.

Set up

On the top, include a snakemake version check, import of config, resources, tsv-files and respective checks:

min_version("6.0.0")

configfile: "config.yaml"

validate(config, schema="../schemas/config.schema.yaml")
config = load_resources(config, config["resources"])
validate(config, schema="../schemas/resources.schema.yaml")

samples = pd.read_table(config["samples"], dtype=str).set_index("sample", drop=False)
validate(samples, schema="../schemas/samples.schema.yaml")

units = pandas.read_table(config["units"], dtype=str).set_index(["sample", "type", "run", "lane"], drop=False).sort_index()
validate(units, schema="../schemas/units.schema.yaml")

wildcard_constraints:
    sample="|".join(samples.index),
    unit="N|T|R",

Functions

The next part should comprise necessary functions used by rules, input or parameters. There are a number of functions available in hydra-genetics/tools which should be used where possible.

Output

The bottom should be the function compile_output_list which programmatically generates a list of all necessary output files for the module to be targeted in the all rule defined in the Snakemake file. See further Result files.

Scripts

Scripts in python (at least 3.8.0) or R (at 4.0.0) should be placed in the scripts directory.
Try to keep your names concise and use only lowercase and underscores.
Scripts should comprise of functions (DRY) which are called in your main function like so:

    if __name__ == "__main__":
        …

Logging is to be included - anything from info to warnings and errors to make troubleshooting easier.
Unit tests are mandatory, putting them in scriptname_test.py files. Refer to the section unit tests for details.

Unit tests

Unit tests use the unittest library and should be defined as TestCases. In the class, define your test function. We are using table-driven testing by exploiting the functionality of dataclasses. All edge cases should be defined in a list of TestCase and subsequently looped through to test the function output and compare it to the expected result:

import unittest
from dataclasses import dataclass
from my_script import my_function

class TestInsertSize(unittest.TestCase):
    def test_my_function(self):
        @dataclass
            class TestCase:
                    name: str
                    input: str
                    expected: str

                testcases = [
                        TestCase(
                                name="Successful test",
                                input=”input string”,
                                expected="expected string",
                        ),
                ]

            for case in testcases:
                    actual = my_function(case.input)
                    self.assertEqual(
                        case.expected,
                        actual,
                        "failed test '{}': expected {}, got {}".format(
                                case.name, case.expected, actual
                        ),
                    )

Config

The modules use a config.yaml file to tie all file and other dependencies as well as parameters for different rules together. To make configuration easier, add an example to the config folder. See further pipeline configuration.

`sample.tsv` and `units.tsv`

The files samples.tsv and units.tsv store all sample meta data needed to run pipelines that uses hydra-genetics. These can be automatically generated from the .fastq-files by the hydra-genetics help tool, see create sample files.

Schemas

For config.yaml, resources.yaml and input tsv-files (samples.tsv and units.tsv), appropriate schemas should be included in workflow/schemas/. * Each entry defined should include a type and description. * Use the required keyword for stanzas that are absolutely necessary for the whole module, e.g. resources, samples and units. * To make configuration easier, add examples to the config folder. An example schema for the config.yaml:

$schema: "http://json-schema.org/draft-04/schema#"
description: snakemake configuration file
type: object
properties:
  resources:
    type: string
    description: path to resources.yaml file

  samples:
    type: string
    description: path to samples.tsv file

  units:
    type: string
    description: path to units.tsv file

  default_container:
    type: string
    description: name or path to a default docker/singularity container

  bwa_mem_merge:
    type: object
    description: parameters for merging of bam files, directly after alignment step
    properties:
      benchmark_repeats:
        type: integer
        description: set number of times benchmark should be repeated
      container:
        type: string
        description: name or path to docker/singularity container
      extra:
        type: string
        description: parameters that should be forwarded

An example schema for the unit.tsv:

$schema: "http://json-schema.org/draft-04/schema#"
description: row represents one dataset
properties:
  sample:
    type: string
    description: sample id
  type:
    type: string
    description: type of sample data Tumor, Normal, RNA (N|T|R)
    pattern: "^(N|T|R)$"
  flowcell:
    type: string
    description: flowcell id
  fastq1:
    type: string
    description: absolute path to R1 fastq file
  fastq2:
    type: string
    description: absolute path to R2 fastq file
required:
  - sample
  - type
  - flowcell
  - fastq1
  - fastq2

Github

The following branches are used: main, develop (default), feature, bugfix and release branches. Branches are merged into main and then into main follwoing a new release. Read more about gitflow: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow https://lucamezzalira.com/2014/03/10/git-flow-vs-github-flow/

Code contribution should be kept small, be done via pull-request.
Use commit tags and meaningful commit messages to make it easier for the reviewer to understand the purpose of your contribution. sgc is a nice CLI that will guide you when committing staged changes. You need to get npm or yarn though.
If possible try to squash commits before doing the first pull-request and reformat the commit messages
All contributions should be reviewed before being incorporated in the code base, by at least 2 persons for main and 1 for develop.

Continuous Integration/Actions

To ensure the quality of the code submitted, we include github actions that are automatically run when pull requests are generated or merged into develop/main. These need to finish successfully. These tasks include integration tests (snakemake dry run and run on test data with both conda and singularity), unit tests (pytest) and linting/formatting (snakemake lint, pycodestyle and snakefmt). See further testing.

Releases

For versioning, we follow the semantic versioning standard: Major.Minor.Patch. Release versioning can be set up automatically by using release-please.