Nesoni

Download

Read the Nesoni Cookbook

25 March 2014:

Nesoni is free software, released under the GPL (version 2).

Description

Nesoni is a high-throughput sequencing data analysis toolset, which the VBC has developed to cope with the flood of Illumina, 454, and SOLiD data now being produced.

Our work is largely with bacterial genomes, and the design tradeoffs in nesoni reflect this.

Alignment to reference

Nesoni focusses on analysing the alignment of reads to a reference genome. We use the SHRiMP read aligner, as it is able to detect small insertions and deletions in addition to SNPs.

Nesoni can call a consensus of read alignments, taking care to indicate ambiguity. This can then be used in various ways: to determine the protein level changes resulting from SNPs and indels, to find differences between multiple strains, or to produce n-way comparison data suitable for phylogenetic analysis in SplitsTree4.

Explore your data in IGV

Besides producing BAM files, Nesoni can produce multi-sample depth of coverage plots for the IGV browser. Display and quickly navigate depth of coverage plots from tens of samples simultaneously, from whole chromosome overview to the individual base level.

Tracks indicating the presence of read multi-mapping, locations of 5' and 3' read ends, and total depth are also produced.

Multiple perspectives on expression data

Nesoni provides multiple tools for exploring expression data. Don't just blindly reach into your data and pick out a handful of genes, actually understand the story your data is telling you.

  • Differential expression testing with limma and edgeR. Accompanying heatmaps make it easy to understand the tradeoffs of within group and between group variation that drive these tools.
  • Heatmaps with hierarchical clustering.
  • Non-negative Matrix Factorization, a fuzzy clustering technique.

Workflow engine

Nesoni includes tools that make it easy to write parallel processing pipelines in Python.

Pipelines are expressed as Python functions. The translation of a serial program with for-loops and function calls into a parallel program requires only simple localized modifications to the code.

Pipelines expressed in this way are composable, just like ordinary functions.

Much like make, the resultant program will only re-run tools as necessary if parameters are changed. The dependancy structure is implicit from the parallel program, if a tool needs to be re-run, only things that must execute after that tool also need re-running.

Further documentation

k-mer and De Bruijn graph tools

Nesoni also includes some highly experimental tools for working with sets of k-mers and De Bruijn graphs. You can:

  • Produce a 2D layout of a De Bruijn graph, and interact with it: zoom in to examine details, overlay read-pair data, overlay sequences on top of it (for example, to examine the behaviour of a de-novo assembler such as Velvet).
  • Clip a set of reads to remove low-frequency or high-frequency k-mers, or k-mers where there is a more frequent k-mer differing by one SNP.


This poster, presented at BA2009, describes some of Nesoni's basic capabilities.

Requirements

Python 2.7 or higher. Use of PyPy where possible is highly recommended 
for performance.

Python libraries
* Recommended:
  * BioPython [1]
* Optional (used by non-core nesoni tools):
  * matplotlib
  * numpy

External programs:
* Required:
  * SHRiMP or Bowtie2
  * samtools (0.1.19 or higher)
* Required for VCF based variant calling:
  * Picard [2]
  * Freebayes (9.9.2 or higher)
* Optional for VCF based variant calling:
  * SplitsTree4

R libraries required by R-based tools (mostly for RNA-seq):
* Required:
  * limma, edgeR (3.2.4 or higher) from BioConductor
  * seriation
* Optional:
  * goseq from BioConductor
  * NMF


[1] BioPython is used for reading GenBank files.
    Compiled modules may need to be disabled when installing in PyPy.

[2] There does not seem to be a standard way to install .jar files. 
    Nesoni will search for .jar files in directories listed in 
    environment variables $PATH and $JARPATH.



Installation

The easy way to install or upgrade:

    pip install -I nesoni

Then type "nesoni" and follow the command to install the R module.

See below for more ways to install nesoni.


 Advanced Installation 
-----------------------

From source, download and untar the source tarball, then:

    python setup.py install

Optional:

    R CMD INSTALL nesoni/nesoni-r


For PyPy it seems to be currently easiest to install nesoni in 
a virtualenv:

    virtualenv -p pypy my-pypy-env
    my-pypy-env/bin/pip install -I biopython 
    my-pypy-env/bin/pip install -I nesoni

You can also set up a CPython virtualenv like this:

    virtualenv my-python-env
    my-python-env/bin/pip install -I numpy 
    my-python-env/bin/pip install -I matplotlib 
    my-python-env/bin/pip install -I biopython 
    my-python-env/bin/pip install -I nesoni


 Installing older versions
---------------------------

The interface and API for nesoni may change between versions (I try to
keep this to a minimum). In order to run a script or python program
that needs an older version, I suggest setting up a virtualenv.

For example, if you want version 0.95:

    virtualenv -p pypy my-old-env
    my-old-env/bin/pip install -I biopython 
    my-old-env/bin/pip install -I nesoni==0.95

(Note: I don't have a neat way to make this work with the R components 
of nesoni.)



Documentation

Nesoni provides the following specific usage information when run with no parameters:


nesoni 0.117 - high-throughput sequencing data analysis toolset

Usage:

    nesoni <tool>: ...

Give <tool>: without further arguments for help on using that tool.


Alignment to reference -- core tools:

    make-reference:
                  - Set up a directory containing a reference sequence,
                    annotations, and files for SHRiMP and/or Bowtie.

    shrimp:       - Run SHRiMP 2 on a read set to set up a working
                    directory.
    
    bowtie:       - Run Bowtie 2 on a read set to set up a working
                    directory.
    
    consensus     - Filter read hits, and try to call a consensus for each 
                    position in reference.
    
    (import:      - Pipe SAM alignments to set up a working directory)
    (filter:      - Filter read hits, but do not call consensus)
    (reconsensus: - Re-call consensus, using previously filtered hits)


Alignment to reference -- VCF based tools: (under development)

These provide an alternative to consensus calling using "nesoni consensus:"
- better handling of complicated Multi-Nucleotide Polymorphisms
- can't distinguish between absence of a variant and insufficient data
  (but can distinguish absence of a variant from insufficient data 
   in a single sample if variant present in other samples)

    freebayes:    - Run FreeBayes to produce a VCF file.
    
    vcf-filter:   - Filter a VCF file, eg as produced by "nesoni freebayes:".
    
    snpeff:       - Run snpEff to annotate variants with their effects.
    
    vcf-nway:     - Summarize a VCF file in a variety of possible ways.
    
    vcf-patch:    - Patch in variants to produce genome of samples.
                    (similar to consensus_masked.fa produced by "nesoni consensus:")
    
    test-variant-call:
                  - Generate synthetic reads, see what variant is called.
    
    power-variant-call:
                  - Apply "neosni test-variant-call:" to a variety of
                    different variants over a range of depths.


Alignment to reference -- analysis tools:

    igv-plots:    - Generate plots for IGV.

    nway:         - Compare results of two or more runs of nesoni consensus,
                    amongst themselves and optionally with the reference.
                 
                    Can produce output suitable for phylogenetic analysis
                    in SplitsTree4.
        
    fisher:       - Compare results of two runs of nesoni consensus using
                    Fisher's Exact Test for each site in the reference.

    core:         - Infer core genome present in a set of strains.

    (consequences: 
                  - Determine effects at the amino acid level of SNPs and INDELs
                    called by nesoni consensus. Most of the features of this tool
                    are now a part of "samconsensus:".)


Alignment to reference -- differential expression:

    count:        - Count number of alignments to genes, using output from
                    "shrimp:".

    test-counts:  - Use edgeR or limma from BioConductor to detect differentially
                    expressed genes, using output from samcount.
    
    test-power:   - Test the statistical power of "nesoni test-counts:" with
                    simulated data.

    plot-counts:  - Plot counts against each other.
    
    norm-from-counts:
                  - Calculate normalizing multipliers from counts using TMM.
    
    norm-from-samples:
                  - Calculate normalizing multipliers from working directories.
    
    heatmap:      - Draw a heat map of counts.
    
    nmf:          - Perform a Non-negative Matrix Factorization of counts.
                    NMF is a type of fuzzy clustering.
    
    compare-tests:
                  - Compare the output from two runs of "test-counts:"
                    eg to compare the results of different "--mode"s
    
    similarity:
                  - Compare samples in a counts file in various ways.
    
    glog:
                  - Obtain glog2 RPM values from a counts file.

An R+ module is included with nesoni which will help load the output from
samcount, for analysis with BioConductor packages.


Peak calling and annotation manipulation tools:

    islands:
    transcripts:
    modes:
                  - Various peak and transcript calling algorithms.
    
    modify-features:
                  - Shift start or end position of features,
                    filter by type, change type.
    
    collapse-features:
                  - Merge overlapping features.
    
    relate-features:
                  - Find features from one set that are near or overlapping
                    features from another set.
    
    as-gff:       - Output an annotation in GFF format,
                    optionally filtering by annotation type.

    
k-mer tools: (experimental)

    bag:          - Create an index of kmers in a read set for analysis with
                    nesoni graph.
    
    graph:        - Use a bag or bags to lay out a deBruijn graph.
                    Interact with the graph in various ways.


Utility tools:

    clip:         - Remove Illumina adaptor sequences and low quality bases
                    from reads.

    shred:        - Break a sequence into small overlapping pieces.
                    In case you want to run an existing sequence through the 
                    above tools. Yes, this isn't ideal.

    as-fasta:     - Output a sequence file in FASTA format.
    
    as-userplots: - Convert a .igv file to a set of .userplot files
                    for viewing in Artemis.
    
    make-genome:  - Make an IGV .genome file.
    
    run-igv:      - Run IGV with a specified .genome file.
    
    sample:       - Randomly sample from a sequence file.
    
    stats:        - Show some statistics about a sequence or annotation file.

    fill-scaffolds: - Guess what might be in the gaps in a 454 scaffold.
    
    pastiche:     - Use MUMMER to plaster a set of contigs over reference 
                    sequences.

    changes:      - Prints out change log file.
                    

Pipeline tools:
    
    analyse-sample:
                  - Clip, align, and call consensus on a set of reads.
                  
    analyse-variants:
                  - Produce a VCF file listing SNPs and other variants in
                    a set of samples.
    
    analyse-expression:
                  - Count alignments of fragments to genes,
                    then perform various types of statistics and 
                    visualization on this.

    analyse-samples:
                  - Run "analyse-sample:" on a set of different samples,
                    then run "analyse-variants:" and/or "analyse-expression".


Contact

Older versions