Snakemake

Lev Kovalenko

Syntax

Snakemake rule

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

Start snakemake

snakemake -np mapped_reads/A.bam

Wildcards

Snakemake rule

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

Start snakemake

snakemake -np mapped_reads/B.bam

Configuration

Snakemake rule

configfile: "config.yaml"

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

config.yaml

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

Start snakemake

snakemake -np bcftools_call

Additional features

Benchmark

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        "benchmarks/{sample}.bwa.benchmark.txt"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

Modules

include: "path/to/other.snakefile"
include: "path/to/other.smk"

Conda enviroments

rule samtools_index:
  input:
      "sorted_reads/{sample}.bam"
  output:
      "sorted_reads/{sample}.bam.bai"
  conda:
      "envs/samtools.yaml"
  shell:
      "samtools index {input}"

Clusters

  • Кластера
  • Kubernetes
  • Google cloud
  • AWS

Pros&Cons

Pros

  • Не зависит от языка
  • DAG на основе файлов
  • Правила рядом с кодом
  • Распределенное выполнение
  • Есть работа с ограничениями
  • Python like синтаксис
  • Интегрирован с conda и docker
  • Модульность
  • Параметризация на основе wildcards

Cons

  • Сложно осознать некоторые моменты
  • Ограничение по ресурсам фиктивно
  • Не централизованные распределенные вычесления