Kevin's GATTACA World: linux

Showing posts with label linux. Show all posts

Thursday, 13 May 2021

This command allows you to see what apps are consuming internet. ss -p

This command allows you to see what apps are consuming internet.

ss -p

Thursday, 30 May 2013

bash script to timestamp snapshot backups of directory

modified a script from

http://aaronparecki.com/articles/2010/07/09/1/how-to-back-up-dropbox-automatically-daily-weekly-monthly-snapshots

to be a simplified snapshot copy of the FOLDER and append the date to the folder name. Useful to have an backup on Dropbox since some files might go missing suddenly on shared folders.

#!/bin/bash
#eg.
#sh snapshot.sh FOLDER [go]
# 'go' will execute
path=Snaphotbackup-`date +%m-%b-%y`
# Run with "go" as the second CLI parameter to actually run the rsync command, otherwise prints the command that would have been run (useful for testing)
if [[ "$2" == "go" ]]
then
rsync -avP --delete $1 $1-$path
else
echo rsync -avP --delete $1 $1-$path
fi

Saturday, 18 May 2013

How does your bash prompt look like?

decided to add color and full path to my bash prompt to make it easier for me to see which ssh window belongs to where absolute life saver at times when you are running several things and the tabs just don't show enough info
Plus having the user@host:/fullpath/folder in your bash prompt makes it easy to copy and paste a 'url' to scp files into that folder directly instead of typing from scratch The only downside to this is if you are buried deep in folders, you are not going to have a lot of screen estate left to type long bash commands without confusing yourself

My final choice:

PS1="\[\033[35m\]\t\[\033[m\]-\[\033[36m\]\u\[\033[m\]@\[\033[32m\]\h:\[\033[33;1m\]\w\[\033[m\]\$ "

image from 8 Useful and Interesting Bash Prompts - Make Tech Easier

Source:
8 Useful and Interesting Bash Prompts - Make Tech Easier
http://www.maketecheasier.com/8-useful-and-interesting-bash-prompts/2009/09/04

How to: Change / Setup bash custom prompt (PS1)
http://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html

Friday, 22 March 2013

Adventures with my WD My Book Live (A PowerPC Debian Linux Server with 2 TB HDD)

I shoulda known to googled before probing at the CLI with stuff and I would have found out what I needed to know. but oh well damage done. What I needed to know was that it's a debian Linux (quite up to date!) with the standard perl/python/sqlite installed. CPU and RAM ain't super impressive but if you are just looping through text files I doubt that it matters a lot. Heck it's roughly equivalent to a older gen of Raspberry Pi with 256 Mb ram

The My Book Live is based upon the APM82181, a 800 MHz PowerPC 464 based platform (PDF). It has a host of features which are not utilized by the MyBook Live. For example, the PCI-E ports as well as the USB 2.0 OTG ports are fully disabled. The SATA port and GbE MAC are the only active components. The unit also has 256 MB of DRAM.(Source anandtech.com)

It's such a shame that the PCI-E ports and USB ports are disabled but at the least the root account isn't disabled which opens up possibilities to install and hack the system into a low power device with a 2 TB HDD to do a bit of bioinformatics eh?

Imagine shipping someone's genomic data in one of these babies that allows you to slice and dice the fastq file to extract pertinent info! After all it already is a web server, won't be too much of a strain to make web apps or just simple web interface as a wrapper for scripts to generate graphical reports (*Dreams of putting galaxy webserver on the WD mybooklive*)
or perhaps use HTSeq or Erange to do something that doesn't strain the 256 Mb of DRAM

Post in Comments what you might do with a 800 Mhz CPU and 256 Mb Ram with Debian under it's hood.

UPDATE: Unfortunately I have managed to brick my WD Mybooklive by being overzealous in installing stuff that required the HTTP webserver as well. DOING that to a headless server with NO terminal/keyboard access is a BAD BAD idea especially if it breaks the SSH login if it hangs at boot up :(

Sigh hope to fix it soon and will be more careful in trying to test packages on my Ubuntu box before trying it on the mybooklive

MyBookLive:~# cat /proc/cpuinfo
processor : 0
cpu : APM82181
clock : 800.000008MHz
revision : 28.130 (pvr 12c4 1c82)
bogomips : 1600.00
timebase : 800000008
platform : PowerPC 44x Platform
model : amcc,apollo3g
Memory : 256 MB

MyBookLive:~# apt-get update
Get:1 http://ftp.us.debian.org squeeze Release.gpg [1672B]
Get:2 http://ftp.us.debian.org wheezy Release.gpg [836B]
Get:3 http://ftp.us.debian.org squeeze Release [99.8kB]
Ign http://ftp.us.debian.org squeeze Release
Get:4 http://ftp.us.debian.org wheezy Release [223kB]
Ign http://ftp.us.debian.org wheezy Release
Get:5 http://ftp.us.debian.org squeeze/main Packages [6493kB]
Get:6 http://ftp.us.debian.org wheezy/main Packages [5754kB]
Fetched 12.6MB in 1min17s (163kB/s)
Reading package lists... Done

MyBookLive:~# perl -v

This is perl, v5.10.1 (*) built for powerpc-linux-gnu-thread-multi
(with 51 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

MyBookLive:~# python

Python 2.5.2 (r252:60911, Jan 24 2010, 18:51:01)

[GCC 4.3.2] on linux2

Type "help", "copyright", "credits" or "license" for more information.

MyBookLive:~# sqlite3

SQLite version 3.7.3

Enter ".help" for instructions

Enter SQL statements terminated with a ";"

sqlite>

MyBookLive:~# free

total used free shared buffers cached

Mem: 253632 250112 3520 0 53568 52352

-/+ buffers/cache: 144192 109440

Swap: 500608 146048 354560

Ok if you are interested below is the exact model of WD MyBookLive that I own right now.

Related Links
Hacking WD My Book Live
http://mybookworld.wikidot.com/mybook-live

Saturday, 9 March 2013

Linux CLI gems

Was alerted to this 'shell script' on github that contains several lines of linux commandline gems. That's how I store my code snippets too so that I can see them in contextual colors when you open in VIM or some other editor that is aware of the content (see example below)


# remove spaces from filenames in current directory

rename -n 's/[\s]/''/g' *

another good resource for picking up new CLI magic is at commandlinefu.com

Enjoy!

Wednesday, 6 February 2013

Handling R packages Feb 2013 issue Linux Journal

The kind folks at http://www.linuxjournal.com/ have provided me an 2013 Feb issue. Can't tell you how much of Linux I have picked up from there with its easy prose and graphical howtos. In the Feb 2013 issue, they have focused on the theme sys admin. Definitely useful things inside for the starting bioinformatician who wishes to dabble with working directly off a *nix machine :)

Other topics in this issue includes

In the February 2013 issue:

Manage Your Virtual Deployment with ConVirt
Use Fabric for Sysadmin Tasks on Remote Machines
Spin up Linux VMs on Azure
Make Your Android Device Play with Your Linux Box
Create a Colocated Server with Raspberry Pi

You can check out a preview of the contents here

February 2013 Issue of Linux Journal: System Administration

By Shawn Powers | Feb 01, 2013

Tuesday, 11 December 2012

odd way to promote Fedora

Hmm I won't have done a facebook ad with the exact words to describe Fedora

Fedora
Fedora is a Linux OS, a collection of software to run on your computer.
Join us today.
Like · 1,299 people like this.

even a random quote like
"Fedora has [...] released an amazingly rock-solid operating system."
− Jack Wallen, TechRepublic.com

would have enticed me to click like if I didn't know Linux

Wednesday, 1 August 2012

SSD / HDD benchmarking on Linux

Found this gem of a wiki on archlinux the distro for speed.
https://wiki.archlinux.org/index.php/SSD_Benchmarking

There are several HDD benchmarking utils/ways
some like dd (see exerpt below) are avail on all linux systems by default and good for quick and dirty benchmarking
like I recently found that running on a HDD RAID that was twice as fast might have saved me half the time on the samtools mpileup command!

Using dd

Note: This method requires the command to be executed from a mounted partition on the device of interest!

First, enter a directory on the SSD with at least 1.1 GB of free space (and one that obviously gives your user wrx permissions) and write a test file to measure write speeds and to give the device something to read:

$ cd /path/to/SSD
$ dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
1024+0 records in
1024+0 records out
w bytes (x GB) copied, y s, z MB/s

Next, clear the buffer-cache to accurately measure read speeds directly from the device:

# echo 3 > /proc/sys/vm/drop_caches
$ dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
w bytes (x GB) copied, y s, z MB/s

Now that the last file is in the buffer, repeat the command to see the speed of the buffer-cache:

$ dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
w bytes (x GB) copied, y s, z GB/s

Thursday, 22 March 2012

Flash on 64 bit Ubuntu is 1 good reason to install Google Chrome

http://support.google.com/chrome/bin/answer.py?hl=en&answer=108086

Adobe Flash is directly integrated with Google Chrome and enabled by default. Available updates for Adobe Flash are automatically included in Chrome system updates.

Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:

awk '{ print >> "my." $1 ".bed" }' my.bed

BSD's AWK

awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed

credit: Assaf Gordon on BedTools mailing list for pointing this out

On a side note,
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: https://github.com/pezmaster31/bamtools ) .

There is a SAMtools filter option as well.
Anyone benchmarked both to see which runs faster?

this discussion arose from someone trying to compute the mean coverage

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt

it doesn't scale too well with the no. of entries in the bed file
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.

So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

Friday, 24 February 2012

bash_profile bashrc Mac users think differently

Been trying to hack my Macbook to replace my Ubuntu work environment.
although MOST things are portable .. but I am finding that Mac users actually live with a lot of inconveniences for which there are hacks / solutions ..

like the aerosnap feature in Win7?
do yourself a favour and get this
https://github.com/fikovnik/ShiftIt

One thing that brings a chuckle to my face.
There's a lot of sites that 'solves' the missing bashrc problem by inserting in the the bash_profile that 'Mac uses instead'
http://superuser.com/questions/147043/where-to-find-the-bashrc-file-on-mac-os-x-snow-leopard-and-lion

Oh gosh just rename your bashrc to bash_profile and hope ur bash customizations ain't Mac averse

I am still trying to find how to write to NTFS in Lion .... I can't believe that if it's working fine in Ubuntu I might possibly have to fork out money to implement this. Seriously Apple should just pay for the NTFS licence just to make the point that Mac is friendlier to multi platforms OR allow users to implement NTFS write with the caveat that it's a reverse engineered hack

Update:
vim syntax isn't turned on by default (WHY??)
quick fix

cp /usr/share/vim/vim73/vimrc_example.vim ~/.vimrc

context coloring in bash Terminal (Why would Mac ship with default B&W color schemes?)
http://superuser.com/questions/324207/how-do-i-get-context-coloring-in-mac-os-x-terminal
Check out the link above,
essentially insert these 2 lines

export CLICOLOR=1export LSCOLORS=GxFxCxDxBxegedabagaced

into your .bash_profile (not .profile )

Note: when working with vim, try to remember that crtl works as control and command is as per Linux / Win (not fun to mix up the keys on important documents)

add this alias will keep you from pulling your hair out when working in Mac and Linux environments

alias md5sum='md5 -r'

Note that it will mean nothing in Linux
But if you liked to use 'md5sum -c' like me, you might have to install md5sum proper :( I am delaying this but I don't think looking md5sums by eye is fun or accurate)

Wednesday, 21 December 2011

Hilbertvis installation in R (no admin rights required!)

Playing around with
HilbertVis: Visualization of genomic data with the Hilbert curve

HilbertVis - Bioconductor www.bioconductor.org/packages/release/bioc/html/HilbertVis.html

Trying out an idea to use Hilbert Curves as a method to visually inspect WGS mapped bams for regions of low coverage or unequal coverage across samples.

It has a standalone GUI version that requires gtk+ packages that may not be avail on all systems.

The cool thing is that it can be installed locally within your user directory with a few simple commands

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
> source("http://bioconductor.org/biocLite.R")
> biocLite("HilbertVis")
Using R version 2.10.0, biocinstall version 2.5.11.
Installing Bioconductor version 2.5 packages:
[1] "HilbertVis"
Please wait...
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
argument 'lib' is missing: using '/usr/lib64/R/library'
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
'lib = "/usr/lib64/R/library"' is not writable
Would you like to create a personal library
'~/R/x86_64-redhat-linux-gnu-library/2.10'
to install packages into? (y/n)
y

Friday, 15 July 2011

How add new line to start and end of a file, SED / Linux Goodness

saw this usage of sed in the forums posted by ghostdog74

sed '1 i this is first line' file

sed '$ a this is last line' file

link

Monday, 27 June 2011

Note to self: CentOS Yum cache

to save bandwidth when doing multiple installs from netinstall.iso

to remember to change /etc/yum.conf

keepcache=0

to

keepcache=1

This will keep all the packages downloaded as cached.

Tuesday, 22 March 2011

Cheat Sheets Galore-bioinformatics, biology, linux,perl, python, R

started with Keith's post here
http://omicsomics.blogspot.com/2011/03/whats-on-your-cheat-sheet.html

and a thread at
http://biostar.stackexchange.com/questions/6683/bioinformatics-cheat-sheet

I have soooo many of them! *this is going to be a long post
Vim
http://www.viemu.com/vi-vim-cheat-sheet.gif
Python
http://www.addedbytes.com/cheat-sheets/python-cheat-sheet/
R (pdf)
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Perl
Hmmmm where did that go to?
AWK one liners
Sed examples
Linux common tasks

I have these too

IUPAC ambiguity codes for nucleotides:
Amino acid single letter codes.

Wednesday, 21 July 2010

Google Chrome in CentOS? You will have to wait

Rant warning:
Gah!
Another crippling experience of working with CentOS.

I still can't install chrome despite google having an official linux port due to an outdated package (lsb) on CentOS 5.4

Others are having the same issues.

Wednesday, 14 July 2010

Shiny new tool to index NGS reads G-SQZ

This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads

http://www.ncbi.nlm.nih.gov/pubmed/20605925

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (wtembe@tgen.org).

read the discussion thread in seqanswers for more tips and benchmarks

I am not affliated with the author btw

Friday, 11 June 2010

Do you talk awk ?

Neat file! I have one of my own recipes but it pales in comparison to this .. anyway.. is it possible to write 2 lines of awk code? rofl.

HANDY ONE-LINE SCRIPTS FOR AWK                               30 April 2008
Compiled by Eric Pement - eric [at] pement.org               version 0.27

Latest version of this file (in English) is usually at:
   http://www.pement.org/awk/awk1line.txt

This file will also be available in other languages:
   Chinese  - http://ximix.org/translation/awk1line_zh-CN.txt   

USAGE:

   Unix: awk '/pattern/ {print "$1"}'    # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}'    # compiled with DJGPP, Cygwin
         awk "/pattern/ {print \"$1\"}"  # GnuWin32, UnxUtils, Mingw

Note that the DJGPP compilation (for DOS or Windows-32) permits an awk
script to follow Unix quoting syntax '/like/ {"this"}'. HOWEVER, if the
command interpreter is CMD.EXE or COMMAND.COM, single quotes will not
protect the redirection arrows (<, >) nor do they protect pipes (|).
These are special symbols which require "double quotes" to protect them
from interpretation as operating system directives. If the command
interpreter is bash, ksh or another Unix shell, then single and double
quotes will follow the standard Unix usage.

Users of MS-DOS or Microsoft Windows must remember that the percent
sign (%) is used to indicate environment variables, so this symbol must
be doubled (%%) to yield a single percent sign visible to awk.

If a script will not need to be quoted in Unix, DOS, or CMD, then I
normally omit the quote marks. If an example is peculiar to GNU awk,
the command 'gawk' will be used. Please notify me if you find errors or
new commands to add to this list (total length under 65 characters). I
usually try to put the shortest script first. To conserve space, I
normally use '1' instead of '{print}' to print each line. Either one
will work.

FILE SPACING:

 # double space a file
 awk '1;{print ""}'
 awk 'BEGIN{ORS="\n\n"};1'

 # double space a file which already has blank lines in it. Output file
 # should contain no more than one blank line between lines of text.
 # NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
 # often treated as non-blank, and thus 'NF' alone will return TRUE.
 awk 'NF{print $0 "\n"}'

 # triple space a file
 awk '1;{print "\n"}'

NUMBERING AND CALCULATIONS:

 # precede each line by its line number FOR THAT FILE (left alignment).
 # Using a tab (\t) instead of space will preserve margins.
 awk '{print FNR "\t" $0}' files*

 # precede each line by its line number FOR ALL FILES TOGETHER, with tab.
 awk '{print NR "\t" $0}' files*

 # number each line of a file (number on left, right-aligned)
 # Double the percent signs if typing from the DOS command prompt.
 awk '{printf("%5d : %s\n", NR,$0)}'

 # number each line of file, but only print numbers if line is not blank
 # Remember caveats about Unix treatment of \r (mentioned above)
 awk 'NF{$0=++a " :" $0};1'
 awk '{print (NF? ++a " :" :"") $0}'

 # count lines (emulates "wc -l")
 awk 'END{print NR}'

 # print the sums of the fields of every line
 awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

 # add all fields in all lines and print the sum
 awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

 # print every line after replacing each field with its absolute value
 awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
 awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'

 # print the total number of fields ("words") in all lines
 awk '{ total = total + NF }; END {print total}' file

 # print the total number of lines that contain "Beth"
 awk '/Beth/{n++}; END {print n+0}' file

 # print the largest first field and the line that contains it
 # Intended for finding the longest string in field #1
 awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

 # print the number of fields in each line, followed by the line
 awk '{ print NF ":" $0 } '

 # print the last field of each line
 awk '{ print $NF }'

 # print the last field of the last line
 awk '{ field = $NF }; END{ print field }'

 # print every line with more than 4 fields
 awk 'NF > 4'

 # print every line where the value of the last field is > 4
 awk '$NF > 4'

STRING CREATION:

 # create a string of a specific length (e.g., generate 513 spaces)
 awk 'BEGIN{while (a++<513) s=s " "; print s}'

 # insert a string of specific length at a certain character position
 # Example: insert 49 spaces after column #6 of each input line.
 gawk --re-interval 'BEGIN{while(a++<49)s=s " "};{sub(/^.{6}/,"&" s)};1'

ARRAY CREATION:

 # These next 2 entries are not one-line scripts, but the technique
 # is so handy that it merits inclusion here.
 
 # create an array named "month", indexed by numbers, so that month[1]
 # is 'Jan', month[2] is 'Feb', month[3] is 'Mar' and so on.
 split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", month, " ")

 # create an array named "mdigit", indexed by strings, so that
 # mdigit["Jan"] is 1, mdigit["Feb"] is 2, etc. Requires "month" array
 for (i=1; i<=12; i++) mdigit[month[i]] = i

TEXT CONVERSION AND SUBSTITUTION:

 # IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
 awk '{sub(/\r$/,"")};1'   # assumes EACH line ends with Ctrl-M

 # IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
 awk '{sub(/$/,"\r")};1'

 # IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
 awk 1

 # IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
 # Cannot be done with DOS versions of awk, other than gawk:
 gawk -v BINMODE="w" '1' infile >outfile

 # Use "tr" instead.
 tr -d \r outfile            # GNU tr version 1.22 or higher

 # delete leading whitespace (spaces, tabs) from front of each line
 # aligns all text flush left
 awk '{sub(/^[ \t]+/, "")};1'

 # delete trailing whitespace (spaces, tabs) from end of each line
 awk '{sub(/[ \t]+$/, "")};1'

 # delete BOTH leading and trailing whitespace from each line
 awk '{gsub(/^[ \t]+|[ \t]+$/,"")};1'
 awk '{$1=$1};1'           # also removes extra space between fields

 # insert 5 blank spaces at beginning of each line (make page offset)
 awk '{sub(/^/, "     ")};1'

 # align all text flush right on a 79-column width
 awk '{printf "%79s\n", $0}' file*

 # center all text on a 79-character width
 awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*

 # substitute (find and replace) "foo" with "bar" on each line
 awk '{sub(/foo/,"bar")}; 1'           # replace only 1st instance
 gawk '{$0=gensub(/foo/,"bar",4)}; 1'  # replace only 4th instance
 awk '{gsub(/foo/,"bar")}; 1'          # replace ALL instances in a line

 # substitute "foo" with "bar" ONLY for lines which contain "baz"
 awk '/baz/{gsub(/foo/, "bar")}; 1'

 # substitute "foo" with "bar" EXCEPT for lines which contain "baz"
 awk '!/baz/{gsub(/foo/, "bar")}; 1'

 # change "scarlet" or "ruby" or "puce" to "red"
 awk '{gsub(/scarlet|ruby|puce/, "red")}; 1'

 # reverse order of lines (emulates "tac")
 awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*

 # if a line ends with a backslash, append the next line to it (fails if
 # there are multiple lines ending with backslash...)
 awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*

 # print and sort the login names of all users
 awk -F ":" '{print $1 | "sort" }' /etc/passwd

 # print the first 2 fields, in opposite order, of every line
 awk '{print $2, $1}' file

 # switch the first 2 fields of every line
 awk '{temp = $1; $1 = $2; $2 = temp}' file

 # print every line, deleting the second field of that line
 awk '{ $2 = ""; print }'

 # print in reverse order the fields of every line
 awk '{for (i=NF; i>0; i--) printf("%s ",$i);print ""}' file

 # concatenate every 5 lines of input, using a comma separator
 # between fields
 awk 'ORS=NR%5?",":"\n"' file

SELECTIVE PRINTING OF CERTAIN LINES:

 # print first 10 lines of file (emulates behavior of "head")
 awk 'NR < 11'

 # print first line of file (emulates "head -1")
 awk 'NR>1{exit};1'

  # print the last 2 lines of a file (emulates "tail -2")
 awk '{y=x "\n" $0; x=$0};END{print y}'

 # print the last line of a file (emulates "tail -1")
 awk 'END{print}'

 # print only lines which match regular expression (emulates "grep")
 awk '/regex/'

 # print only lines which do NOT match regex (emulates "grep -v")
 awk '!/regex/'

 # print any line where field #5 is equal to "abc123"
 awk '$5 == "abc123"'

 # print only those lines where field #5 is NOT equal to "abc123"
 # This will also print lines which have less than 5 fields.
 awk '$5 != "abc123"'
 awk '!($5 == "abc123")'

 # matching a field against a regular expression
 awk '$7  ~ /^[a-f]/'    # print line if field #7 matches regex
 awk '$7 !~ /^[a-f]/'    # print line if field #7 does NOT match regex

 # print the line immediately before a regex, but not the line
 # containing the regex
 awk '/regex/{print x};{x=$0}'
 awk '/regex/{print (NR==1 ? "match on line 1" : x)};{x=$0}'

 # print the line immediately after a regex, but not the line
 # containing the regex
 awk '/regex/{getline;print}'

 # grep for AAA and BBB and CCC (in any order on the same line)
 awk '/AAA/ && /BBB/ && /CCC/'

 # grep for AAA and BBB and CCC (in that order)
 awk '/AAA.*BBB.*CCC/'

 # print only lines of 65 characters or longer
 awk 'length > 64'

 # print only lines of less than 65 characters
 awk 'length < 64'

 # print section of file from regular expression to end of file
 awk '/regex/,0'
 awk '/regex/,EOF'

 # print section of file based on line numbers (lines 8-12, inclusive)
 awk 'NR==8,NR==12'

 # print line number 52
 awk 'NR==52'
 awk 'NR==52 {print;exit}'          # more efficient on large files

 # print section of file between two regular expressions (inclusive)
 awk '/Iowa/,/Montana/'             # case sensitive

SELECTIVE DELETION OF CERTAIN LINES:

 # delete ALL blank lines from a file (same as "grep '.' ")
 awk NF
 awk '/./'

 # remove duplicate, consecutive lines (emulates "uniq")
 awk 'a !~ $0; {a=$0}'

 # remove duplicate, nonconsecutive lines
 awk '!a[$0]++'                     # most concise script
 awk '!($0 in a){a[$0];print}'      # most efficient script

CREDITS AND THANKS:

Special thanks to the late Peter S. Tillier (U.K.) for helping me with
the first release of this FAQ file, and to Daniel Jana, Yisu Dong, and
others for their suggestions and corrections.

For additional syntax instructions, including the way to apply editing
commands from a disk file instead of the command line, consult:

  "sed & awk, 2nd Edition," by Dale Dougherty and Arnold Robbins
  (O'Reilly, 1997)

  "UNIX Text Processing," by Dale Dougherty and Tim O'Reilly (Hayden
  Books, 1987)

  "GAWK: Effective awk Programming," 3d edition, by Arnold D. Robbins
  (O'Reilly, 2003) or at http://www.gnu.org/software/gawk/manual/

To fully exploit the power of awk, one must understand "regular
expressions." For detailed discussion of regular expressions, see
"Mastering Regular Expressions, 3d edition" by Jeffrey Friedl (O'Reilly,
2006).

The info and manual ("man") pages on Unix systems may be helpful (try
"man awk", "man nawk", "man gawk", "man regexp", or the section on
regular expressions in "man ed").

USE OF '\t' IN awk SCRIPTS: For clarity in documentation, I have used
'\t' to indicate a tab character (0x09) in the scripts.  All versions of
awk should recognize this abbreviation.

#---end of file---

Yet another fasta splitter in perl

I love Miguel's explanation of bioinformatics "another way to torture biological data"
For me, I just massage them into the myriad of input formats that programs declare they require.. I never torture them!
Anyway to check out his variant on splitting fasta (I currently have python, csplit(linux), bioperl versions of fasta splitters)

Miguel's fasta splitter (into batches of fasta) requires BioPerl

Saturday, 5 June 2010

Sequence Variant Analyzer SVA

The rapidly evolving high-throughput DNA sequencing technologies have now allowed the fast generation of large amount of sequence data for the purpose of performing such whole-genome sequencing studies, at a reasonable cost. S equence V ariant A nalyzer, or SVA , is a software tool that we have been developing to analyze the genetic variants identified from such studies.

SVA is designed for two specific aims:

(1) To annotate the biological functions of the identified genetic variants, visualize and organize them;
(2) To help find the genetic variants associated with or responsible for the biological traits or medical outcomes of interest.

SVA is:

a program designed to run on a LINUX platform, with a graphical user interface (GUI), meaning that the main functions of this program can be done with clicking buttons.

a program specifically designed for analyzing genetic variants that have already been called (identified) from a whole genome sequencing study. So do not try to find a function here to align short reads and call variants - SVA is not designed for those purposes - many other software tools, for example, BWA and SAMTOOLS , were developed for those purposes.

How it works