Introducing Awk

In this entry I’m going to explain the very basics of a little programming language called Awk. The primary reason why I’m advocating the use of Awk (and not for instance Perl or Python) is that Awk is absolutely great for writing very short programs, or one-liners. In fact, in my opinion Awk should never be used for anything longer than one-liners – if your task appears to demand a longer piece of code, it will probably be wiser to choose Perl or Python instead. Combined with the brevity of Unix pipes and process substitution, the concise syntax of Awk permits extremely powerful one-liners.

The first example in this entry describes constructing a stability file for Mothur, described here. You can test everything described in this entry using this example data set. After uncompressing the data set, check the fastq files in the directory using ls *.fastq. You should see the following:

F3D0_S188_L001_R1_001.fastq
F3D0_S188_L001_R2_001.fastq
F3D141_S207_L001_R1_001.fastq
F3D141_S207_L001_R2_001.fastq
F3D142_S208_L001_R1_001.fastq
F3D142_S208_L001_R2_001.fastq
F3D143_S209_L001_R1_001.fastq
F3D143_S209_L001_R2_001.fastq
F3D144_S210_L001_R1_001.fastq
F3D144_S210_L001_R2_001.fastq
F3D145_S211_L001_R1_001.fastq
F3D145_S211_L001_R2_001.fastq
...

Each fastq file is present in two flavors: R1 in the name indicates a forward read from Illumina MiSeq and R2 a reverse read. Mothur has a built-in functionality to join these reads into single contigs – however for that it needs something called a stability file. A stability file is a tab-separated table containing the sample names and the associated forward and reverse read file names. Mothur tutorial provides such file but does not describe how it can be constructed.

F3D0 F3D0_S188_L001_R1_001.fastq F3D0_S188_L001_R2_001.fastq
F3D141 F3D141_S207_L001_R1_001.fastq F3D141_S207_L001_R2_001.fastq
F3D142 F3D142_S208_L001_R1_001.fastq F3D142_S208_L001_R2_001.fastq
F3D143 F3D143_S209_L001_R1_001.fastq F3D143_S209_L001_R2_001.fastq
F3D144 F3D144_S210_L001_R1_001.fastq F3D144_S210_L001_R2_001.fastq
...

Columns two and three of a stability file correspond to the forward and reverse read file names, respectively, and column one corresponds to the beginning of the file name that in this example is also the sample name. Extracting the sample names from the file names is a task for which we need Awk. To cut the chase, here’s how it’s done:

paste <(ls *R1*.fastq | awk -F_ '{print $1}') <(ls *R1*.fastq) <(ls *R2*.fastq)

If you’ve read the previous entry, you might recognize the process substitution syntax – that is, the <( ... ) -structures that are feeding stuff to the paste command. Please review the previous entry if you’re not familiar with process substitution.

Awk can be used for a lot of things but one of the most common uses is splitting lines at particular characters. The -F switch specifies the character by which the line is to be chopped. The resulting pieces can be accessed using variables $1, $2, $3 etc. The entire non-chopped line is also available under variable $0. Armed with this information, let’s take apart the first process substitution part of the command pipeline above:

ls *R1*.fastq | awk -F_ '{print $1}'

What’s happening here is that we’re first getting a list of files that have letters R1 in their name and a suffix fastq , and piping this to Awk where it’s processed line by line. First we’re telling Awk to split the lines at each occurrence of ‘_’ by providing the -F_ switch. Next comes the actual Awk program – that is the part specified inside the ‘-characters. For reasons that I will not discuss here, all print statements in Awk must be surrounded by curly brackets. By specifying variable $1 for the print statement, the first part of the splitted line gets printed. Experiment by yourself and see what happens if you choose $2, $3 or $0 instead.

You can also add your own text to the Awk print statement, for instance

ls *R1*.fastq | awk -F_ '{print "I am sample " $1}'

In this example it really makes no sense but in other cases can be very useful. One simple example is converting a fastq to a fasta. This is done simply by taking the two first lines of each fastq entry and prepending the first line by “>”-character. Here’s how it goes:

cat file.fastq | paste -d, - - - - | awk -F, '{print "&gt;" $1 "\n" $2 "\n"}' &gt; file.fasta

What’s going on here is that the four successive lines of a fastq are concatenated into a single line separated by commas. This functionality is provided by the paste -d, - - - - construct, where the ‘-d,’ switch tells paste to use comma as a line separator, and the four ‘-’-characters inform paste to read in four consecutive lines at a time. The output from paste – the four concatenated lines – is piped to Awk. Awk is told to use comma as line splitting character – this is how we parse the concatenated lines of fastq back into separate lines, available in variables $1, $2, $3 and $4. We only need variables $1 and $2 here since they contain the fasta ID line and the DNA sequence, respectively. We need to add ‘>’ before the ID line and provide newline characters ‘\n’ between the ID and sequence lines. Please examine the print statement to get the idea of how it works and experiment by adding your own text. Any of the fastq files in the example data set can be used for experimenting.

Awk is an extremely versatile and powerful tool and I have only scratched the surface of its uses here. I will cover more examples in future posts – if you’re feeling curious in the meantime, check my Unix tricks page!

Advertisements
Posted in Bioinformatics

Pipes, process substitution and why should a biologist ever care

This is the first entry in my series about how to keep DNA sequence processing as simple as possible. Each entry attempts to teach a useful Unix trick or two, focusing on relevance for biologists who might not have much prior Unix experience. Because of the very clever design choices made by the folks behind Unix, every trick can be combined with everything else in ways that lead to a much greater flexibility and expressiveness than any of the tricks alone – I attempt to demonstrate that in the following posts. These tricks work on Bash and Zsh, so be sure that your Unix profile is running one of them1.

A lot of the computational work that a modern genetically inclined biologist has to do is actually just converting data between incompatible formats and other trivial processing tasks – think converting fastq to fasta, trimming off adapter sequences, removing problematic characters from fasta ids etc… A lot of labs have their stash of Perl and Python scribbles to handle these tasks and certainly these scripts do their thing. However, I personally prefer doing as much of this as possible using just the plain Unix command line and avoid writing scripts if possible. The brevity of expression that Unix command line provides for simple text and sequence processing tasks is about unparalleled. Plus the commands are available on any Unix system, right out of the box.

What’s the deal with pipes?

Unix command line and standard tools were not designed for DNA sequence processing but nevertheless do that really well. The basic Unix commands form a vocabulary of simple text and sequence processing operations and this vocabulary can be combined into sentences using Unix pipes. For a better introduction for pipes than I can provide here, this video tutorial will get you started. Piping is an easy but powerful concept and definitely worth learning.

Here are three common commands in the Unix text and sequence processing vocabulary:

  • cat prints contents of a file on screen or sends them down a pipe
  • rev reverses any line of text
  • tr replaces a set of characters with a different set of characters

Let’s put some of these into practice. A simple example would be taking reverse-complements of DNA sequences in file sequences.txt:

sequences.txt

AAGTGTCGTA
GGTCGCAAATG
GAAGTGCCTAG
TGAAGATGC

A very simple command pipeline to do the task would be

cat sequences.txt | rev | tr AGCT TCGA

This crunches through the file sequences.txt, first reversing each sequence and then converting each into a respective complement, line by line. The output is – not surprisingly – a reverse-complement of each sequence:

TACGACACTT
CATTTGCGACC
CTAGGCACTTC
GCATCTTCA

A trick I use all the time is a minor variant of this: during primer design I reverse-complement sequences a lot and want to do that with a minimal typing effort. In that case I won’t even bother to create a file; rather I just send a text string to rev and tr using command echo and the following pipeline2.

echo TGGTAGTGTAACGCCTACA | rev | tr AGCT TCGA

Pipes can be used to join pretty much any Unix commands together. After all my years with Unix, I’m still discovering new useful ways of piping stuff together. I’ll be coming back to pipes again and again in my future posts.

When are pipes not enough?

Pipes are great but not always enough. A good example is a more realistic case of reverse-complementing sequences: in real world DNA sequences are typically not provided as simple lists. Rather, a typical format would be fasta where the sequence ids are interspersed by the DNA sequences, as shown in the example below:

example.fasta

>Sequence_id_1
AATTGATAGGTCCGTT
>Sequence_id_2
TTGTCGTTATATGGTTAC
>Sequence_id_3
GTTGAGTCGTGGTCGT

With the kind of command pipeline we used above, the sequence ids would also get reversed and their letter As translated into Ts etc… How could we get around this?

I first read about process substitution in this excellent blog entry. Don’t be put off by a scary name – the concept is about as simple as Unix pipes and about as useful. Where piping directs the output of one command to the input of next one, process substitution directs outputs of multiple commands to the input of next one. The figure below attempts to illuminate the concepts and their difference.

mockup

 
Let me demonstrate this using an example. To reverse-complement every sequence in a multiple fasta file and leave the sequence ids untouched, one has to separate the id and sequence parts. For this we also have to introduce two new words into our text processing vocabulary:

  • paste joins together two lines of text
  • grep finds lines containing (or excluding) a particular text string or character
paste -d "\n" <(grep ">" sequence.fasta) <(grep -v ">" sequence.fasta | rev | tr ATCG TAGC)

That’s it3 – one line of text and quite readable if you get used to the process substitution notation.

Let’s open this up a little: after the paste command there are two text strings enclosed inside a <( ... ) -structure. This structure is used to create a stream of text. Since we’re having two of those structures after paste, we’re directing two separate text streams to the paste command. Paste takes these text streams and intersperses their lines.

The first text stream is the fasta headers – no processing is taking place for those.

grep ">" sequence.fasta

The second text stream is the sequence parts of the fasta file (filtered by grep using the -v switch) that are, before streaming out to paste, streamed through rev and tr to take a reverse-complement of each sequence.

grep -v ">" sequence.fasta | rev | tr AGCT TCGA

The -d “\n” switch instructs paste to use a newline character as a separator for the text streams. See for yourself what happens if you leave that out!

The output of our command pipeline currently gets printed on screen where we can inspect it and see that it’s good. However, we typically want to save the output on disk for subsequent use. There’s a very easy way to redirect the text output from screen to a file – that is the redirection operator >. Just append your command pipeline with character > and the file name, and there you have it. The example below saves the reverse-complemented sequences into a new file called reverse_complemented_sequences.fasta.

paste -d "\n" <(grep ">" sequence.fasta) <(grep -v ">" sequence.fasta | rev | tr ATCG TAGC) &gt; reverse_complemented_sequences.fasta

In this example we only modified the sequence part by taking a reverse-complement. Basically one could do many other modifications to both the sequence and id parts, or convert the file into some entirely different format. There are other standard Unix tools that provide more neat text manipulation tricks and what I find especially useful is a little programming language called Awk. But that’s another story!


  1. If you’re not sure which shell you’re running, just run chsh -s /bin/bash on your prompt. That will enable bash for you. You don’t have to worry too much about what this means – Bash is a good general choice for a lot of purposes.
  2. Writing this out every single time is actually not lazy enough for me – I will cover lazy stuff like aliases and text editor tricks in future posts.
  3. Things get more complicated if the sequence part of the fasta file spans multiple lines, as it often does. The way to deal with that is a bit more involved and will be a topic of a future post.
Posted in Bioinformatics