Victorian Bioinformatics Consortium
Give an idea of how we do bioinformatics.
Greater facility exploring data.
Work with larger data sets.
Be a first-class citizen in the world of bioinformatics.
Helps us help you.
I'm going to be covering a lot of details to give an idea of what is possible and what to ask for.
All have a similar underlying system, and a similar set of command line tools.
Most scientific computing is done in Linux.
I use Ubuntu, with an older style interface.
Linux is open source.
Download Ubuntu from
If you have OS X, you're already good to go.
The terminal connects you to a "shell" program which lets you run other programs. Usually the shell program is "bash".
echo Hello worldRun the echo program. It just prints the parameters it is given back at you.
lsList files in the current directory.
nano somefile.txtUse a simple text editor to edit somefile.txt.
The terminal connects to the computer, and initially talks to a shell program.
When the shell starts up another program it connects it to your terminal as well.
The shell goes to sleep until the program finishes.
ssh firstname.lastname@example.org"Securely connect me to a shell on a remote machine."
scp myfile email@example.com:.
scp firstname.lastname@example.org:myfile ."Securely copy files to or from a remote machine."
UNIX uses text in a variety of "languages" to refer to or describe things.
These are arrangments of letters and characters according to precise rules.
Some languages are very simple and others are more complicated.
This is too much. How am I meant to learn all this?
To look at the manual page for a command, type
superuser.com is a StackExchange site for the UNIX command line. Ask questions.
Find someone who knows a bit of UNIX (such as anyone in the VBC) and ask them.
Graphical user interfaces are easy to use.
Why would I learn all this?
Create a text file called hello.sh
echo Hello $1 echo How are you?
"bash" shell can use the commands in the file rather than typed in directly.
bash hello.sh Paul
Hello Paul How are you?
We wrote a "script" or "program".
It precisely documents what we did.
We can run the same script on other data. It's exactly like having a new command that can be run from the shell.
We can give the script to other people.
Text is a much more flexible way to communicate than pointing at things or filling in forms. Web based or graphical versions of software, where they exist, usually don't offer all of the features of the command-line version.
Your local bioinformatics people can develop command line tools much more easily than graphical or web-based tools.
Look for alignments to a query sequence in a (possibly large) library of sequences.
Query is a text file in "fasta" format.
>sequencename1 ACGTGCGCTAGCTGATCGA GCGATCGTGCAGCGCAGAA TGCGCGCTAG >sequencename2 GGTAGGGTAAATTGCCTAC CGTCGATCGAGTA etc
blastn -db nt -query myquery.fa
Very nice, but I can do that on the web quite easily.
Query a custom database, eg chromosomes of your organism of interest.
makeblastdb -dbtype nucl -in myorganism.fa blastn -db myorganism.fa -query myquery.fa
More advanced usage generally involves taking the output of BLAST as a first step in some kind of script. For example, Torsten's "prokka" tool uses BLAST (amongst other things) to automatically annotate a sequence.
Try to work out what organism a set of reads came from.
zcat myreads.fq.gz |head -n 400 |bp_seqconvert --from fastq --to fasta \ |blastn -task megablast -db nt -num_descriptions 20 -num_alignments 0 \ |less"Uncompress my compressed fastq file, take the first 400 lines, convert from FASTQ to FASTA format, run them through megablast, and let me look at the result in a text file viewer."
Create an alignment of two or more sequences.
clustalo --seqtype DNA --in mysequences.fa or clustalw, mafft, muscle, etc etc
Create a report on the quality of a read set:
Assemble reads into contigs:
velvet, SPAdes, etc etc
Align reads to a known reference sequence:
SHRiMP, Bowtie2, etc etc
Many other tools:
samtools, picard, GATK, etc etc
Hundreds of small tools for format conversion, etc.
However there is a more direct way to perform simple tasks, which I will get to shortly.
Annotate a prokaryotic sequence, eg contigs created by velvet, by similarity to existing hand-annotated genomes.
Find parameters for velvet to produce the best assembly.
As a way to explore data: we give you a "counts file", use nesoni to produce heatmaps, differential expression analysis.
Easy to add new tools. Nesoni has grown by people asking for things.
Display nesoni help text:
Produce BAM files, list of SNPs and indels, depth of coverage plots for Artemis.
nesoni make-reference: myref organism.gbk nesoni analyse-sample: mysample myref \ pairs: read_1.fq.gz read_2.fq.gz(Quality-clip reads, align them to the reference sequences, filter alignments and produce depth of coverage plots, calls SNPs and indels.)
BioStar is a StackExchange-like site for asking questions.
Winter School in Mathematical and Computational Biology
University of Queensland, 1-5 July 2013
Life Sciences Computation Centre is planning two workshops for this year:
RNAseq expression, variant calling
(VBC is Monash node of LSCC)
"bash" is a "shell", a language that is especially good for running other programs.
If you want to go further, my highly opinionated recommendation is:
These are fairly modern languages. Like bash, they can be used either interactively or can run a script. Easier to use than older languages, but not toys. I use R and Python almost every day.
x <- c(1,2,3,4) y <- c(5,7,6,8) plot(x,y) cor.test(x,y) Ctrl-D to exit
Works well with spreadsheet-like data.
data <- read.csv('mydata.csv')
Huge library of bioinformatics packages, notably including "limma" developed by WEHI for differential expression testing.
We can provide code to help load, analyse, and visualize your data, or suggest how to use existing packages.
A teaching language with batteries included.
Good for format conversions and basic data mangling.
More flexible than a set of command-line programs.
Python is the glue in much scientific computing.
Good for large projects.
Load and manipulate DNA sequences, annotations, etc. Automated use of web services such as NCBI.
Again, we can provide code snippets, packages, advice on existing packages.
Web-based or graphical interfaces are easy to use, but there's no next step after mastering them.
Things we know don't work:
Diagram-based languages have been tried, do not scale. Natural language (English) interfaces are an unsolved problem, would probably be terrible anyway.
But the command-line is full of historical artifacts and bioinformatics on it is a mess of inconsistent commands and data formats.
Hopefully given you an idea of the path to becoming a bioinformatician. Even a few steps down this path are useful.
Ask for an account on a machine you can log in to, find a mentor. Ask how they would solve a problem and look at the elements that make up that solution.