Unix tricks for bioinformagicians

I use these Unix one-liners and such all the time in my bioinformatic work. I hope you’ll find then useful as well!

Reverse complement all sequences in a fasta file.

paste -d "\n" \
    <(grep ">" sequences.fasta) \
    <(grep -v ">" sequences.fasta | tr ATGC TACG | rev) \
    > reverse_complemented_sequences.fasta



Prepare a stability file from paired end reads.

paste \
    <(ls *R1*.fastq | awk -F"_" '{print $1}') \
    <(ls *R1*.fastq)\
    <(ls *R2*.fastq)\
    > stability.txt



Convert a fastq to a fasta. Assumes that a comma can be used as a line separator.

cat file.fastq | paste -d, - - - - | awk -F, '{print ">" $1 "\n" $2 "\n"}' > file.fasta



Trim first 24 and last 20 bases from each sequence in a fasta file.

paste -d "\n" \
    <(grep ">" sequences.fasta)
    <(grep -v ">" sequences.fasta | cut -c25- | rev | cut -c21- | rev)



Remove line breaks in a fasta file. Credit goes to this thread.

awk '/^>/ {print s ? s "\n" $0 : $0; s=""; next} \
          {s=s sprintf("%s", $0)} END {if (s) print s}' \
    sequences.fasta > sequences_with_no_line_breaks.fasta



Create size bins (intervals of 10 bases) of a fasta file.

cat file.fasta | paste -d, - - | \
    awk -F, '{f = int(length($2)/10)*10 ".fas"; print $1 "\n" $2 > f}'



Make an indexed fasta from a list of sequences.

awk '$0 = ">" NR "\n" $0' sequences.seq > sequences.fasta



Quality-trim a bunch of fastq files. Depends on FASTX-Toolkit.

ls *.fastq | while read file; \
    do echo $file; fastx_clipper -Q 33 -i $file -o ${file}_trimmed; done



Intersperse paired end reads from two fastq files into a single fastq file.

paste -d "\n" \
    <(cat reads1.fastq | paste -d% - - - -) \
    <(cat reads2.fastq | paste -d% - - - -) | tr '%' '\n' \
 > joined_reads.fastq



Calculate the number of entries in a fasta file.

grep -c ">" file.fasta



Calculate the length distribution (bin size 1000) of sequences in a fasta file.

awk '/^>/ {if (len != 0) print int(len/1000)*1000+1000; \
    len = 0; next} {len += length($0)}' sequences.fasta | \
    sort | uniq -c | sort -rn