AWK & Computational Biology

February 3, 2013

My new book gives a concise introduction to AWK and its application in the fields of bioinformatics and computational biology. Take a look here: web

Advertisements

AWK & MySQL

December 13, 2012

AWK & MySQL

Did you ever want to access MySQL from an AWK script. SPAWK gives you one solution.

Compare Strings

July 9, 2012

Ever needed to compare two strings? This script compares two text strings of equal length letter by letter and prints the statistics. Enjoy.

awk -F"\t" '
# USAGE awk -f compare-strings.awk string1 string2
BEGIN{
a = ARGV[1]; b = ARGV[2]
la = length(a); lb = length(b)
if(la!=lb){print "ERROR: UNEQUAL STRING LENGTH"; exit}
for(row=1;row<=la;row++){
  ai = substr(a,row,1)
  bi = substr(b,row,1)
  if(ai == bi){cost = 0; summatch++}
  else{cost = 1; summismatch++}
}
print "LENGTH = "la
print "MISMATCHES = "summismatch" ("summismatch/la*100"%)"
print "MATCHES = "summatch" ("summatch/la*100"%)"
}

Levenshtein Distance

January 23, 2012

Did you ever want to calculate the Levenshtein distance between two strings, i.e. the number of steps it takes to convert STRING1 to STRING2? Here is an AWK script that employs dynamic programming and two-dimensional arrays. Save the file as levenshtein.awk and have fun …

awk -F"\t" '
# USAGE: awk -f levenshtein.awk STRING1 STRING2
BEGIN{
one = ARGV[1]; two = ARGV[2]
print one"  "two;
print "Levenshtein Distance: "distance(one, two)
}
function distance (a, b){
  la = length(a); lb = length(b)
  if(la==0){return lb}; if(lb==0){return la}
  for(row=1;row<=la;row++){m[row,0] = row}
  for(col=1;col<=lb;col++){m[0,col] = col}
  for(row=1;row<=la;row++){
    ai = substr(a,row,1)
    for(col=1;col<=lb;col++){
      bi = substr(b,col,1) 
      if(ai == bi){cost = 0}
      else{cost = 1}    
      m[row,col]=min(m[row-1,col]+1,m[row,col-1]+1,m[row-1,col-1]+cost)
      }                 
    }           
  return m[la,lb]
}
function min (a, b, c){
  result = a; if(b < result){result = b}; if(c < result){result = c}
  return result
}

MySQL Tool

December 15, 2011

Wow, the year 2011 is almost over and I did not post anything … Well, let me use this last chance to present a script that evaluates the field length for all fields in a tab-delimited file. This might help you creating an appropriate MySQL table.

awk -F"\t" '
{
for(i=1; ifield[i]){
    field[i]=length($i)
  }
}
}
END{
for(x=1; x<=i; x++){
  print field[x]
}
}' filename.tab

Calculate GC-Content

July 23, 2010

Having a tabular file (ID – SEQ; see fasta2tbl) with DNA sequences you can easily calculate the GC-content and print it together with the ID.

awk '{print $1"\t"gsub(/[gc]/,”x”,$2)/length($2)}' sequences.tbl

Mutate DNA

February 8, 2010

The following script mutates DNA sequences in a tab file. Every 1-10 nucleotides a base might be changed.
Call by awk -f script.awk input.seq

BEGIN{srand()}
{ORS="";
split(tolower($1),dna,"")
for(i=1;i<=length(dna);i++){
        r=int((rand()*4)+1)
        f=int((rand()*10)+1)
        i=i+f
        if(r==1){r="A"}
        if(r==2){r="T"}
        if(r==3){r="G"}
        if(r==4){r="C"}
        dna[i]=r
}
for(i=1;i<=length(dna);i++){
        print dna[i]}
print "\n"
}

Random Numbers

February 8, 2010

To create a random number AWK’s random number generator needs to be seeded. This is best done by placing the command srand in a BEGIN block. The following script prints a random number between 1 and 4.

awk 'BEGIN{srand(); print int(rand()*4+1)}'

Find all lines followed by lines containing xyz

November 10, 2009

The following script prints all lines of a text file that are followed by a line containing the regular expression xyz.

awk '{line[NR]=$0;if($0~/xyz/){print line[NR-1]}}' input.file

Enjoy.

Include Language Settings

April 20, 2009

If sorting or thelike give strange result, try to forward environmental variables to AWK.

LC_ALL=en_GB awk '{...}'