Bioinformatics
on the command line

Paul Harrison

Victorian Bioinformatics Consortium

Purpose of this talk

Give an idea of how we do bioinformatics.

Greater facility exploring data.
Work with larger data sets.
Be a first-class citizen in the world of bioinformatics.

Helps us help you.

I'm going to be covering a lot of details to give an idea of what is possible and what to ask for.

Please interrupt.

UNIX

  • UNIX
    • Linux
      • Ubuntu (recommended)
      • Debian
      • etc
    • OS X

All have a similar underlying system, and a similar set of command line tools.

Most scientific computing is done in Linux.

I use Ubuntu, with an older style interface.

Getting started

Linux is open source. Download Ubuntu from
http://www.ubuntu.com/

If you have OS X, you're already good to go.

Opening a terminal

History

  • Originally a terminal was a machine like a typewriter.
  • Many terminals could be plugged into one computer, allowing many people to use it at once.
  • UNIX operating system allows multiple programs to run at once on the computer.

Linux / OS X:


Look for a program called "Terminal" or similar.

Some basic commands

The terminal connects you to a "shell" program which lets you run other programs. Usually the shell program is "bash".

echo Hello world
Run the echo program. It just prints the parameters it is given back at you.

ls
List files in the current directory.

nano somefile.txt
Use a simple text editor to edit somefile.txt.
I was told I need some diagrams.

The terminal connects to the computer, and initially talks to a shell program.

When the shell starts up another program it connects it to your terminal as well.

The shell goes to sleep until the program finishes.

Connect to a remote machine

ssh username@hostname.monash.edu.au
"Securely connect me to a shell on a remote machine."

scp myfile username@hostname.monash.edu.au:.
scp username@hostname.monash.edu.au:myfile .
"Securely copy files to or from a remote machine."

Windows:


Use PuTTY

Text

UNIX uses text in a variety of "languages" to refer to or describe things. These are arrangments of letters and characters according to precise rules. Some languages are very simple and others are more complicated.

  • Files and directories are referred to using a combination of file and directory names, slashes, and dots.
  • An email address is made up of a name, an @-sign, and a domain name.
  • There are many different file formats. For example, FASTA format is used to store sequence data, GFF format is used to store sequence annotations, and CSV format can be used to store tabular data.
  • The bash shell has a language for running programs, the most basic part of which is that the program name is given followed by parameters, all separated by spaces.
  • There are many other programming languages.

Other things the shell can do

  • The shell doesn't have to connect a program to the terminal, it can connect the output of one program to another program, or to a file: | >
  • Specify multiple files with wildcards: *
  • Pause running commands, run commands in the background, kill running commands: Ctrl-C, Ctrl-Z, &, fg, bg, kill

More commands

  • Create and delete directories (folders): mkdir, rmdir
  • Change the current working directory: cd
  • Copy, rename or delete files: cp, mv, rm
  • Output or view the contents of a file: cat, less
  • Get the first or last n lines of a file: head, tail
  • Compress and uncompress files: gzip, gunzip, zcat, tar, zip, unzip, etc etc
  • Search or replace a pattern of text in a file: grep, sed

This is too much. How am I meant to learn all this?

Learning more

software-carpentry.org has a good introduction specifically for scientific users, "The Shell".

To look at the manual page for a command, type

man commandname

superuser.com is a StackExchange site for the UNIX command line. Ask questions.

Find someone who knows a bit of UNIX (such as anyone in the VBC) and ask them.

Graphical user interfaces are easy to use.

Why would I learn all this?

Payoff 1:
scripting

Create a text file called hello.sh

nano hello.sh

containing this

echo Hello $1
echo How are you?

"bash" shell can use the commands in the file rather than typed in directly.

bash hello.sh Paul
Hello Paul
How are you?

We wrote a "script" or "program".

It precisely documents what we did.

We can run the same script on other data. It's exactly like having a new command that can be run from the shell.

We can give the script to other people.

Payoff 2:
More, more powerful, software

Text is a much more flexible way to communicate than pointing at things or filling in forms. Web based or graphical versions of software, where they exist, usually don't offer all of the features of the command-line version.

Your local bioinformatics people can develop command line tools much more easily than graphical or web-based tools.

Bioinformatics software:
BLAST

Look for alignments to a query sequence in a (possibly large) library of sequences.

Query is a text file in "fasta" format.

>sequencename1
ACGTGCGCTAGCTGATCGA
GCGATCGTGCAGCGCAGAA
TGCGCGCTAG
>sequencename2
GGTAGGGTAAATTGCCTAC
CGTCGATCGAGTA

etc

blastn -db nt -query myquery.fa

Very nice, but I can do that on the web quite easily.

Query a custom database, eg chromosomes of your organism of interest.

makeblastdb -dbtype nucl -in myorganism.fa
blastn -db myorganism.fa -query myquery.fa

More advanced usage generally involves taking the output of BLAST as a first step in some kind of script. For example, Torsten's "prokka" tool uses BLAST (amongst other things) to automatically annotate a sequence.

Non-trivial example

Try to work out what organism a set of reads came from.

zcat myreads.fq.gz |head -n 400 |bp_seqconvert --from fastq --to fasta \
    |blastn -task megablast -db nt -num_descriptions 20 -num_alignments 0 \
    |less
"Uncompress my compressed fastq file, take the first 400 lines, convert from FASTQ to FASTA format, run them through megablast, and let me look at the result in a text file viewer."

Bioinformatics software:
multiple alignment

Create an alignment of two or more sequences.

clustalo --seqtype DNA --in mysequences.fa

or clustalw, mafft, muscle, etc etc

Bioinformatics software:
High throughput sequencing software

Create a report on the quality of a read set:
fastqc

Assemble reads into contigs:
velvet, SPAdes, etc etc

Align reads to a known reference sequence:
SHRiMP, Bowtie2, etc etc

Many other tools:
samtools, picard, GATK, etc etc

Bioinformatics software

Hundreds of small tools for format conversion, etc.

However there is a more direct way to perform simple tasks, which I will get to shortly.

VBC

vicbioinformatics.com

prokka

Torsten Seemann

Annotate a prokaryotic sequence, eg contigs created by velvet, by similarity to existing hand-annotated genomes.

VelvetOptimizer

Simon Gladman

Find parameters for velvet to produce the best assembly.

nesoni

Paul Harrison

Toolbox, automates

  • Alignment of reads to reference.
  • Variant calling (SNPs and indels)
  • RNAseq expression analysis
  • Automated transcript / peak calling from depth of coverage
  • Various small utilities

As a way to explore data: we give you a "counts file", use nesoni to produce heatmaps, differential expression analysis.

Easy to add new tools. Nesoni has grown by people asking for things.

Nesoni cookbook
www.vicbioinformatics.com/nesoni-cookbook

Display nesoni help text:

nesoni

Example:
Produce BAM files, list of SNPs and indels, depth of coverage plots for Artemis.

nesoni make-reference: myref organism.gbk
nesoni analyse-sample: mysample myref \
    pairs: read_1.fq.gz read_2.fq.gz
(Quality-clip reads, align them to the reference sequences, filter alignments and produce depth of coverage plots, calls SNPs and indels.)

Learning more about
bioinformatics software

seqanswers.com forum.

BioStar is a StackExchange-like site for asking questions.

Ask Torsten.

Winter School in Mathematical and Computational Biology
University of Queensland, 1-5 July 2013


Life Sciences Computation Centre is planning two workshops for this year:
RNAseq expression, variant calling
(VBC is Monash node of LSCC)

Other languages

"bash" is a "shell", a language that is especially good for running other programs.

If you want to go further, my highly opinionated recommendation is:

  • Learn "R" to do statistics.
  • Learn "python" as a general purpose language.

These are fairly modern languages. Like bash, they can be used either interactively or can run a script. Easier to use than older languages, but not toys. I use R and Python almost every day.

R

R
x <- c(1,2,3,4)
y <- c(5,7,6,8)
plot(x,y)
cor.test(x,y)

Ctrl-D to exit

Works well with spreadsheet-like data.

data <- read.csv('mydata.csv')

Bioconductor:
Huge library of bioinformatics packages, notably including "limma" developed by WEHI for differential expression testing.

We can provide code to help load, analyse, and visualize your data, or suggest how to use existing packages.

Python

A teaching language with batteries included.

Good for format conversions and basic data mangling.
More flexible than a set of command-line programs.
Python is the glue in much scientific computing.

Good for large projects.

BioPython:
Load and manipulate DNA sequences, annotations, etc. Automated use of web services such as NCBI.

Again, we can provide code snippets, packages, advice on existing packages.

Pedagogy

Web-based or graphical interfaces are easy to use, but there's no next step after mastering them.

Things we know don't work:
Diagram-based languages have been tried, do not scale. Natural language (English) interfaces are an unsolved problem, would probably be terrible anyway.

But the command-line is full of historical artifacts and bioinformatics on it is a mess of inconsistent commands and data formats.

Pedagogy

Python as a model of how to actually solve this kind of problem. Python has opened up programming to many people who previously weren't able to program.
  • Not English or a diagram, but easy for an English speaker to read.
  • Presents necessary concepts progressively and self-consistently. Here's a series of steps, an if-then-else, a list, a way to do something with each item in the list, how to write a re-usable series of steps as a "function", how to make a new kind of object, and so on.
  • Abstracts out details the computer can handle by itself. You don't need to worry about brace-indentation inconsistency, compilation, memory allocation, exact representation of data in memory, hashtable tuning, etc etc, as you would in FORTRAN / C / C++ / etc.

Conclusion

Hopefully given you an idea of the path to becoming a bioinformatician. Even a few steps down this path are useful.

Ask for an account on a machine you can log in to, find a mentor. Ask how they would solve a problem and look at the elements that make up that solution.

vicbioinformatics.com/documents/command-line