2. Data processing from FASTQ to GVCF

Learning outcome

After this practical, the students recognise FASTQ, BAM, GVCF and VCF data formats and can explain their roles and relationships. They can map FASTQ data to a reference genome using bwa mem and manipulate the resulting BAM data with samtools; and can call variants with GATK package using the joint-calling approach.


Getting to Puhti

Get to your working directory. If you are on a login node (if you access Puhti through a stand-alone terminal, you get to a login node; in the web interface, you can select either a login node or a compute node), start an interactive job (see sinteractive -h for the options):

$ WORK=/scratch/project_2000598
$ cd $WORK/$USER
$ # the command below is optional, see the text!
$ sinteractive -A project_2000598 -p small -c 2 -m 2000 -t 8:00:00
Warning

Depending on the utilisation of the Puhti cluster, the last step may take time. If it doesn’t progress in a reasonable time (a couple of minutes), you may skip it and work on the login node. The work below is computationally light and should not put much strain on the login nodes.

Preparations for the analyses

We have previously done the FASTA-index for our reference genome (reference/ninespine_rm.fa.fai) and will now generate a FASTA-dictionary:

$ module load samtools
$ samtools dict reference/ninespine_rm.fa > reference/ninespine_rm.dict

The dictionary will later be included in the VCF file.

The role of the dictionary is to provide a means to confirm that the VCF file (where it is integrated) and the external reference genome match. Without a dictionary, it would be possible that the VCF data were analysed against an incorrect reference and the results would therefore be wrong.

So, how to verify that the sequences are the same without writing the sequences? There are multiple versions of the human reference genome and each has chromosome names “chr1”, “chr2” etc. and the names are clearly not informative. The chromosome lengths would be better but, in theory, it is possible that a revision of the reference genome would remove sequence somewhere and add elsewhere, and produce an equally long but different sequence. The same length of two sequences does not mean that they are the same.

The start of the dictionary looks like this:

$ head -3 reference/ninespine_rm.dict 
@HD VN:1.0  SO:unsorted
@SQ SN:ctg7180000005484 LN:697083   M5:530cf96b9ecf6c85885876fa01820b82 UR:file:///scratch/project_2000598/$USER/reference/ninespine_rm.fa
@SQ SN:ctg7180000005545 LN:816998   M5:785bd5fdfe0aaf255c50a930e9c4cdba UR:file:///scratch/project_2000598/$USER/reference/ninespine_rm.fa

The key is the field starting with “M5:”. This combination of commands produces the M5 string for the first contig:

$ samtools faidx reference/ninespine_rm.fa ctg7180000005484 \
  | grep -v ">" | tr -d '\n' | tr [acgtn] [ACGTN] | md5sum 
530cf96b9ecf6c85885876fa01820b82  -
Important

The hash sum is the first part of the output. The second part - indicates that the sum was computed for STDIN. You can compare the output to that of:

$ md5sum reference/ninespine_rm.fa.fai

Now, the sum and the target are clear.

Think about the commands and what they do. md5sum and cryptographic hash functions are explained in Wikipedia. There are many hash functions but md5sum is the most commonly used. It can be computed fairly quickly to any file and is widely used e.g. to verify that the file hasn’t been corrupted during transfer (e.g. when downloading of a large file from a remote server).

bwa mem needs an index of it own to map the read against:

$ module load bwa
$ bwa index reference/ninespine_rm.fa

This creates multiple new files (ls reference/ninespine_rm.fa.*):

  • .amb is text file, to record appearance of non-ATGC characters in the ref FASTA.
  • .ann is text file, to record ref sequences, name, length, etc.
  • .bwt is binary, the Burrows-Wheeler transformed sequence.
  • .pac is binary, packaged sequence (four base pairs encode one byte).
  • .sa is binary, suffix array index.

If unfamiliar, check the difference between text and binary files at https://en.wikipedia.org/wiki/Text_file and https://en.wikipedia.org/wiki/Binary_file.

Exercise: Identification of differences with md5sum

Our reference genome is soft-masked such that the inferred repeat regions are written in lower case and the rest of the sequence in upper case. The sequence dictionary doesn’t consider that and converts the sequence to upper case before computing the hash sum.

A naive approach to check whether two sequences are different would be to compute their word count. The commands below show that the word counts are the same for soft-masked and non-masked sequences and this naive approach is not able to distinguish the two sequence versions:

$ samtools faidx reference/ninespine_rm.fa ctg7180000005484   | grep -v ">" | tr -d '\n' | wc 
$ samtools faidx reference/ninespine_rm.fa ctg7180000005484   | grep -v ">" | tr -d '\n' | tr [acgtn] [ACGTN] | wc 

Compute the md5sum for the soft-masked sequence above (only the sequence part!) and show that it is different from that of the non-masked sequence discussed earlier (and included in the dict-file).

Exercise 1 in Moodle.

Mapping with bwa mem

First, let’s copy some data. The sequences are in FASTQ-format that is not hugely different from FASTA:

$ mkdir fastq
$ cp $WORK/shared/fastq/sample-1_[12].fq.gz fastq/
$ zcat fastq/sample-1_1.fq.gz | head -8
@K00137:79:H3KNNBBXX:3:1101:1030:21850
AAAAGGTCATTTGCTTGATAGAAACAGTGTAAGCTAAAGCATTCCTCTTACATTCTGTTTTAGCTGTTTACTCCTTCTGAGAAAGCTTTAAAAGTAACAATAATGGTCAAAACGTTGTTGAGTGAATAAGGTTCTAAAACTAAACATAAT
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKF7FKKKKKKKAKKKKKKKKKKKKKAAKFKKKKKKKKKKKKKKKKKKKFF<KKFFAKFKFKKKKFFKKKAAAAK<<FAF7AKA7A7A7<AAFFAFFKAAF7F
@K00137:79:H3KNNBBXX:3:1101:1040:23907
GTGTACTTTAAGTATAACAAATGCAGTAAGTCCCCTATTGTTACATAATTACATATTATGTAAAAGAATAATCAATTTGTAAGTAGCCTTTTATTATTGTGGTCAAAGGGGAGCTAATTACATATTTTATATTCAGAAGAAACTGATGCA
+
AAF<FKKKFKKKK<FKKKKKKKKAKKKFKKKFFKFKKAKFAKKKKKKAFKKFKKKAKKKK7KFKKKKKFFFA<FAKKKKKK77<FAA,,,<<<A7FFK7<,,,,,<AA<AFA7F,<7<FA7<,,,<<77A<7,77<FFKFA7,,,7<<,<```

Each read consists of four lines and above we see the first two reads. The first line is the header; the second line is the sequence; the third line is nowadays unused; and the fourth line is the base calling quality.

The reads in a fastq file should represent the same sample but, in principle, they could come from different devices, runs and lanes. (It would be good to keep them separate and only combine later; see below.) Above, we see that the two reads come from the same lane and tile, but from different clusters (x- and y-coordinates). See the earlier description of Illumina sequencing to understand the meaning of “clusters”.

Files “*_1.fq.gz” and “*_2.fq.gz” are pairs and list the two parts of the pair-end reads, that is the sequence from the two ends of a fragment. You can do zcat fastq/sample-1_1.fq.gz | head -1 and zcat fastq/sample-1_2.fq.gz | head -1 to confirm tha the reads have the same information and come from the same cluster.

The sequence line is pretty obvious, it lists the 150 bases from one of the fragment. The line with base calling quality scores is equally long and has one quality score per base. The scale of Illumina quality scores is from 0 to 41 and these 42 numbers need to be reported with one character per each possible score.

In the ASCII table, the first characters are control characters (such as newline and TAB) and thus unsuitable; the 32nd character is space ’ ’ and still confusing. From 33 on, the characters can be printed:

$ awk 'BEGIN{for(i=0;i<=41;i+=1){printf "%i %i %c\n", i, i+33, i+33}}' | less

In the command above, the first column shows Phred score, the second column the ASCII number (Phred score + 33) and the third column the corresponding ASCII character.

Exercise: Phred scores

Phred scores and their coding with ASCII characters provide a compact approach for reporting the base quality values. There is one-to-one correspondance between probabilities, Phred scores, ASCII numbers and printable caharacters. Use the command above and the explanation in the theory part to connect the terms correctly.

Exercise 2 in Moodle.

bwa mem

Execute bwa mem to see the available options. We use it with fairly default settings.

$ bwa mem
Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:
.
.

Although not specified as obligatory above, we need to specify a “readgroup” that will later allow identifying reads that belong to a particular set. There’s a nice explanation of this on the GATK web site. There are minimal requirements for the information and we could just fill those, faking the info that we don’t know.

$ rgroup="@RG\tID:sample-1\tLB:Lib\tSM:sample-1\tPL:Plfm"
$ echo $rgroup

Above, the important field is “SM” standing for “sample”. We repeat the sample name for “ID” and fake the “LB” (library) and “PL” (platform) fields.

Often, the reads for a particular sample come from different lanes, runs or even devices, and are in the end combined in one file. It could happen that we later learn that one of these sources has been contaminated and the reads may be erroneous. It is possible to integrate the information of the data source into the mapping and then use this information to later subset or remove the data based on the information.

Below, we have an example how one can integrate some of this information, in an automated manner, in the ID field:

$ sample=sample-1
$ file=fastq/sample-1_1.fq.gz
$ device=ILLUMINA
$ lib=LIB-1

$ flowcell=`zcat $file |head -1|cut -f3 -d":"`
$ lane=`zcat $file |head -1|cut -f4 -d":"`
$ ID=$sample.$flowcell.$lane

$ echo "@RG\tID:$ID\tSM:$sample\tPL:$device\tLB:$lib"
@RG\tID:sample-1.H3KNNBBXX.3\tSM:sample-1\tPL:ILLUMINA\tLB:LIB-1

The GATK web site has some more ideas. For simplicity, we skip this here.

Given this, we could now create a directory for our results and run bwa mem:

$ mkdir bams
$ bwa mem -R ${rgroup} reference/ninespine_rm.fa fastq/sample-1_1.fq.gz fastq/sample-1_2.fq.gz > bams/sample-1.sam

The mapping takes approximately two minutes. You can see the resulting output with less:

$ less -S bams/sample-1.sam

After the header lines (starting with “@”) come the reads. Each row has the following fields.

# Field Meaning
1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (`=’ if same as RNAME)
8 MPOS 1-based Mate POSition
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE

We’ll return to the details later, but you can now confirm that the fields QNAME, SEQ and QUAL of the first read are the three lines of the first read in the fastq file:

$ samtools view bams/sample-1.sam | head -1 | cut -f1,10,11 | tr '\t' '\n'
$ zcat fastq/sample-1_1.fq.gz | head -4

Can you explain what you see? Remember that the DNA is a double helix!

Confirm also that the reads in the mapping file are in the same order as they were in the FASTQ files, always as pairs with one read from files *_1.fq.gz and *_2.fq.gz:

$ samtools view bams/sample-1.sam | head -4 | cut -f1
$ zcat fastq/sample-1_1.fq.gz | head -8 | grep "^@"
$ zcat fastq/sample-1_2.fq.gz | head -8 | grep "^@"


Mapping and processing simultaneously

The output of bwa mem is in SAM-format which is a human readable text file. The downstream programs need the data in the binary BAM-format and we could convert SAM to BAM with samtools view. After conversion to BAM, we would use samtools fixmate to mark the pair-end reads (that are now next to each other and thus recognisable) as “mates”; sort the data with samtools sort and finally index the file with samtools index.

We could write all these commands separately but it is more elegant to write them as one command, piping the data from one command to the next. Moreover, we could use variables to hold the file paths and make our commands (i) cleaner and (ii) reusable.

$ module load samtools

$ ref=reference/ninespine_rm.fa
$ smp=sample-1

$ bwa mem -R ${rgroup} $ref fastq/${smp}_[12].fq.gz \
  | samtools fixmate -m - -  \
  | samtools sort - -o bams/${smp}_bwa.bam \
  && samtools index bams/${smp}_bwa.bam

Here, bwa mem sends the mapped data to samtools fixmate which send it to samtools sort. We don not need to make the format conversion as many samtools commands accept both SAM and BAM; by default, they write out BAM. Technically, samtools index is a separate command and will be executed after finalising the piped commands (as it should, the file cannot be indexed before its complete). The extra dashes in the commands tell the program to use STDIN or STDOUT instead of regular files.

Above, we used commands such as:

$ samtools view -h bams/sample-1.sam | less -s

but, as SAM files are text files, we could also have written:

$ less -S bams/sample-1.sam

You can try using less for a BAM file (less -S bams/sample-1_bwa.bam) and find out that it won’t work. With BAMs, we have to read the file with samtools view and pipe that to less to control the output:

$ samtools view -h bams/sample-1_bwa.bam | less -s

We can compare the output of bwa mem and the more complex pipeline of commands:

$ samtools view bams/sample-1.sam | head -4 | cut -f1,3,4
$ samtools view bams/sample-1_bwa.bam | head -4 | cut -f1,3,4

We see that in the first file, the reads are in the order that they were mapped (i.e. the order they were in the input data). In the second file, they are sorted, first by the contig (field 3) and then by position (field 4).

We can visualise the mapped data using samtools tview:

$ samtools tview bams/sample-1_bwa.bam reference/ninespine_rm.fa -p ctg7180000005484:13100

You can find the instructions for its usage by typing the command name only (samtools tview) and pressing ? inside the program. Useful commands within the program are . and n. Find out what they mean.

Exercise: samtools tview

Use samtools tview to see the region starting from position 33100 in the contig “ctg7180000005484”. Which view matches that region?

View 1

View 1

View 2

View 2

View 3

View 3

View 4

View 4

Exercise 3 in Moodle.

Realignment around indels

One problem of bwa mem is that it aligns the reads separately and do not consider the alignment of other reads. This can cause errors around indels (=insertions and deletions). We use GATK to realign those regions although the developers of the tool say that this is no more necessary when using a pipeline like ours. I disagree with this and have seen the results with other GATK tools to change by realignment of the reads. Moreover, other tools use BAM files as such, and making the BAMs as good as possible is worthwhile.

Because of the decisions of the GATK development team, the realignment tools are no more a part of the GATK (v.4 and higher) packages. We use them from GATK v3. Due to licensing issues, CSC cannot provide these as a model and we have to use an installation of my own.

The process consists of two parts. First, we create a list of regions to realign:

$ gatk3=/projappl/project_2000598/bin/gatk3
$ ref=reference/ninespine_rm.fa
$ smp=sample-1
$ bams=bams

$ $gatk3 -T RealignerTargetCreator \
   -R ${ref} -I $bams/${smp}_bwa.bam -o $bams/${smp}.intervals

then, realign the BAM files around the target:

$ $gatk3 -T IndelRealigner \
   -R ${ref} -I $bams/${smp}_bwa.bam -targetIntervals $bams/${smp}.intervals -o $bams/${smp}_real.bam

$ samtools markdup $bams/${smp}_real.bam $bams/${smp}_mdup.bam
$ samtools index $bams/${smp}_mdup.bam

The last two commands mark the duplicates and index the resulting file.

Important

Note that in the commands starting:

$ $gatk3 ...

the first dollar sign is the prompt symbol and is not copied to your command, but the second dollar sign indicates that “$gatk3” is variable and must be copied to your command.

As a part of short-read sequencing, the samples undergo PCR amplification. As a result, the pool of fragments to be sequenced may contain PCR duplicates of exactly the same sequence. On the other hand, a cluster on the sequencing chip (where the reactions happen) may occasionally spread such that it forms two neighbouring spots containing the same sequence. These are called PCR and optical duplicates, respectively.

Such duplicates appear as independent reads in the data but, in reality, do not represent independent observations of the genome. The odds that two random fragments of a genome have exactly the same endpoints by chance (and then produce identical sequencing reads) are vanishingly small. Given that, the read-pairs with exactly the same end points are considered duplicates and all but one copy of them is marked as “duplicate”. The data are not removed but analysis methods aware of the practice do not consider them in calculations.

The information about duplicates is included in the FLAG field of SAM/BAM files. The format specification of SAM is at https://samtools.github.io/hts-specs/SAMv1.pdf, the FLAG field is explained on page 7.

Binary numbers are explained on EEB-021. In a FLAG, the zeros and ones can be seen as switches – ON vs OFF – such that the combination of switches makes the full binary number; for practical reasons, this binary number is converted to a decimal number for the samtools command. A useful tool to play with flags is at https://broadinstitute.github.io/picard/explain-flags.html. If we check “read is PCR or optical duplicate”, the SAM Flag is “1024”. We can read the instruction with the command:

$ samtools view

and learn that -f FLAG or --require-flags FLAG means “Only include in output reads that have all of the FLAGs present”. With that information, we can now select only those reads that have been marked as duplicates:

$ samtools view -f 1024 $bams/${smp}_mdup.bam | less -S

There are lots of them. We can select the first four of them and output four fields (look at the table above to see what these are):

$ samtools view -f 1024 $bams/${smp}_mdup.bam | head -4 | cut -f2,3,4,8
1187    ctg7180000005484    19207   19358
1123    ctg7180000005484    19294   19434
1107    ctg7180000005484    19358   19207
1171    ctg7180000005484    19434   19294

First we notice that none of them has FLAG value “1024”. Pick the first of these, “1187”, and copy it to “SAM Flag” field on the “Decoding SAM flags” web page and click “Explain”. That should explain it. You can confirm all other flag values if still in doubt.

Next we notice that the numbers are repeated: “19358” is the last column on the 1st row, and the second last column on the 3rd row; the same for “19434” on the 2nd and 4th row.

These are mate pairs that we marked earlier and when marking reads as duplicates, mate pairs are always marked together.

We can confirm the duplicates visually:

$ samtools tview $bams/${smp}_mdup.bam $ref -p ctg7180000005484:19206
$ samtools tview $bams/${smp}_mdup.bam $ref -p ctg7180000005484:19357

The two reads with exactly the same start and end points are near the bottom.

Exercise: Information in a SAM/BAM file

Execute the command:

$ samtools view bams/sample-1.sam | head -1

and study the SAM flag of this one read here. What does it tell about this read?

Exercise 4 in Moodle.

Variant calling

In the variant calling, we approximately follow the GATK guidelines for “Germline short variant discovery (SNPs + Indels)”. GATK v.4+ is provided as a module on Puhti and we start with its tool HaplotypeCaller. You can see the tool documentation here.

Using the earlier approach, the command looks like this:

$ ref=reference/ninespine_rm.fa
$ smp=sample-1
$ bams=bams
$ gvcfs=gvcfs
$ mkdir $gvcfs

$ module load gatk

$ gatk HaplotypeCaller \
    -R ${ref} -I $bams/${smp}_mdup.bam \
    -O $gvcfs/${smp}.gvcf.gz -ERC GVCF

Here, the only optional parameter is -ERC GVCF that tells to the GVCF approach for reporting the confidence on non-variant sites. We’ll have a closer look on that elsewhere.

Important

Note that two versions of GATK differ on the format for selecting the analysis tool: in GATK v.3 one has to use the parameter -T to select the tool (gatk3 -T <toolname>) but in GATK v.4 the tool name comes without any parameter (gatk4 <toolname>).

The run takes less than five minutes. You can look at the resulting file with less but a more formal way is to use the program bcftools from the “samtools” module:

$ less -S gvcfs/sample-1.gvcf.gz
$ bcftools view gvcfs/sample-1.gvcf.gz | less -S

you have to browse a few hundred rows down to see the calls. They start after “#CHROM …”.

Exercise: GVCF data columns

GVCF and VCF files consist of a (potentially long) header that explains the process of how the data was generated as well as the contents of the data part, and a table that lists the variant data using the terms described in the header. The lines in the header part start with two hash characters (##); the line describing the data columns starts with one hash character (#), and the variant data has no hash characters. Study the file using the command:

$ less -S gvcfs/sample-1.gvcf.gz

Exercise 5 in Moodle.

Doing it all at once with SLURM

This seems like a lot of work to map and align a single sample and we may have hundreds of them in a study. Fortunately multiple commands can be collected into a script and this script may then be automated to handle multiple samples. Below is a naive approach to map all samples in the directory “fastq”. It works for our one sample but, with many more samples, the 15 minutes time limit of the partition “test” would quickly run out. Furthermore, the idea of computer cluster is to parallelise tasks (i.e. run them simultaneously), not to serialise the tasks as the loop over files below does.

Nevertheless, you can copy the script using the Heredoc format:

$ cat > map_and_call.sh << 'EOF'
#!/bin/bash
#SBATCH --time=15:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=6G
#SBATCH --partition=test
#SBATCH --account=project_2000598
#SBATCH --output=logs/map_and_call-%j.out

gatk3=/projappl/project_2000598/bin/gatk3
ref=reference/ninespine_rm.fa

fastq=fastq
bams=bams
gvcfs=gvcfs

module load bwa
module load gatk
module load samtools 

ls $fastq/sample-*1.fq.gz | xargs -n1 -I% basename % _1.fq.gz | while read smp; do

  rgroup="@RG\tID:$smp\tLB:Lib\tSM:$smp\tPL:Plfm"

  bwa mem -t2 -R ${rgroup} ${ref} $fastq/${smp}_1.fq.gz $fastq/${smp}_2.fq.gz \
    | samtools fixmate -m - -  | samtools sort - -o $bams/${smp}_bwa.bam \
    && samtools index $bams/${smp}_bwa.bam

  $gatk3 -T RealignerTargetCreator \
    -R ${ref} -I $bams/${smp}_bwa.bam -o $bams/${smp}.intervals
    
  $gatk3 -T IndelRealigner \
    -R ${ref} -I $bams/${smp}_bwa.bam -targetIntervals $bams/${smp}.intervals -o $bams/${smp}_real.bam

  samtools markdup $bams/${smp}_real.bam $bams/${smp}_mdup.bam
  samtools index $bams/${smp}_mdup.bam

  gatk HaplotypeCaller \
    -R ${ref} -I $bams/${smp}_mdup.bam -O $gvcfs/${smp}.gvcf.gz -ERC GVCF

done

EOF

You can then create the log directory and check that the script looks good:

$ mkdir logs
$ less map_and_call.sh

If you are happy with that, you can submit it to the queue:

$ sbatch map_and_call.sh

and check how it progresses there:

$ squeue --me

If everything goes fine, in less than minutes the script should be ready and BAM and GVCF files are created in correct directories. You can browse the log file with the command:

$ less logs/map_and_call-*.out
Important

The last part utilises the SLURM job scheduling system at Puhti. You can rehearse the vocabulary and concepts used at https://intscicom.biocenter.helsinki.fi/section5.html#running-jobs-on-a-linux-cluster


Take-home message

Short-read sequencing data are provided in FASTQ format that consists of a header line, a sequence line and a base-call quality line. The reads are aligned – or mapped – to a reference genome; the output of that is provided in SAM/BAM format that retains everything from the FASTQ and adds information about the mapping. In the variant calling, the differences between the sample and the reference are identified, reducing the size of the data dramatically. In the analysis of many samples, a common practice is to report the variants in GVCF format that 1) lists the observed differences in comparison to the reference genome and 2) reports the certainty that there are truly no variants in the intervening positions. The latter is important as seeing no variants can be caused by the true lack of variants in the existing data and lack of data to observe the variants.