My new book gives a concise introduction to AWK and its application in the fields of bioinformatics and computational biology. Take a look here: web
AWK & Computational Biology
February 3, 2013AWK & MySQL
December 13, 2012Did you ever want to access MySQL from an AWK script. SPAWK gives you one solution.
Compare Strings
July 9, 2012Ever needed to compare two strings? This script compares two text strings of equal length letter by letter and prints the statistics. Enjoy.
awk -F"\t" ' # USAGE awk -f compare-strings.awk string1 string2 BEGIN{ a = ARGV[1]; b = ARGV[2] la = length(a); lb = length(b) if(la!=lb){print "ERROR: UNEQUAL STRING LENGTH"; exit} for(row=1;row<=la;row++){ ai = substr(a,row,1) bi = substr(b,row,1) if(ai == bi){cost = 0; summatch++} else{cost = 1; summismatch++} } print "LENGTH = "la print "MISMATCHES = "summismatch" ("summismatch/la*100"%)" print "MATCHES = "summatch" ("summatch/la*100"%)" }
Levenshtein Distance
January 23, 2012Did you ever want to calculate the Levenshtein distance between two strings, i.e. the number of steps it takes to convert STRING1 to STRING2? Here is an AWK script that employs dynamic programming and two-dimensional arrays. Save the file as levenshtein.awk and have fun …
awk -F"\t" ' # USAGE: awk -f levenshtein.awk STRING1 STRING2 BEGIN{ one = ARGV[1]; two = ARGV[2] print one" "two; print "Levenshtein Distance: "distance(one, two) } function distance (a, b){ la = length(a); lb = length(b) if(la==0){return lb}; if(lb==0){return la} for(row=1;row<=la;row++){m[row,0] = row} for(col=1;col<=lb;col++){m[0,col] = col} for(row=1;row<=la;row++){ ai = substr(a,row,1) for(col=1;col<=lb;col++){ bi = substr(b,col,1) if(ai == bi){cost = 0} else{cost = 1} m[row,col]=min(m[row-1,col]+1,m[row,col-1]+1,m[row-1,col-1]+cost) } } return m[la,lb] } function min (a, b, c){ result = a; if(b < result){result = b}; if(c < result){result = c} return result }
MySQL Tool
December 15, 2011Wow, the year 2011 is almost over and I did not post anything … Well, let me use this last chance to present a script that evaluates the field length for all fields in a tab-delimited file. This might help you creating an appropriate MySQL table.
awk -F"\t" ' { for(i=1; ifield[i]){ field[i]=length($i) } } } END{ for(x=1; x<=i; x++){ print field[x] } }' filename.tab
Calculate GC-Content
July 23, 2010Having a tabular file (ID – SEQ; see fasta2tbl) with DNA sequences you can easily calculate the GC-content and print it together with the ID.
awk '{print $1"\t"gsub(/[gc]/,”x”,$2)/length($2)}' sequences.tbl
Mutate DNA
February 8, 2010The following script mutates DNA sequences in a tab file. Every 1-10 nucleotides a base might be changed.
Call by awk -f script.awk input.seq
BEGIN{srand()} {ORS=""; split(tolower($1),dna,"") for(i=1;i<=length(dna);i++){ r=int((rand()*4)+1) f=int((rand()*10)+1) i=i+f if(r==1){r="A"} if(r==2){r="T"} if(r==3){r="G"} if(r==4){r="C"} dna[i]=r } for(i=1;i<=length(dna);i++){ print dna[i]} print "\n" }
Random Numbers
February 8, 2010To create a random number AWK’s random number generator needs to be seeded. This is best done by placing the command srand
in a BEGIN
block. The following script prints a random number between 1 and 4.
awk 'BEGIN{srand(); print int(rand()*4+1)}'
Find all lines followed by lines containing xyz
November 10, 2009The following script prints all lines of a text file that are followed by a line containing the regular expression xyz.
awk '{line[NR]=$0;if($0~/xyz/){print line[NR-1]}}' input.file
Enjoy.
Include Language Settings
April 20, 2009If sorting or thelike give strange result, try to forward environmental variables to AWK.
LC_ALL=en_GB awk '{...}'