Nesoni

Download

14 May 2013:

Nesoni is free software, released under the GPL (version 2).

Description

Nesoni is a high-throughput sequencing data analysis toolset, which the VBC has developed to cope with the flood of Illumina, 454, and SOLiD data now being produced.

Our work is largely with bacterial genomes, and the design tradeoffs in nesoni reflect this.

Alignment to reference

Nesoni focusses on analysing the alignment of reads to a reference genome. We use the SHRiMP read aligner, as it is able to detect small insertions and deletions in addition to SNPs.

Nesoni can call a consensus of read alignments, taking care to indicate ambiguity. This can then be used in various ways: to determine the protein level changes resulting from SNPs and indels, to find differences between multiple strains, or to produce n-way comparison data suitable for phylogenetic analysis in SplitsTree4.

Alternatively, the raw counts of bases at each position in the reference seen in two different sequenced strains can compared using Fisher's Exact Test.

Workflow engine

Nesoni includes tools that make it easy to write parallel processing pipelines in Python.

Pipelines are expressed as Python functions. The translation of a serial program with for-loops and function calls into a parallel program requires only simple localized modifications to the code.

Pipelines expressed in this way are composable, just like ordinary functions.

Much like make, the resultant program will only re-run tools as necessary if parameters are changed. The dependancy structure is implicit from the parallel program, if a tool needs to be re-run, only things that must execute after that tool also need re-running.

Further documentation

k-mer and De Bruijn graph tools

Nesoni also includes some highly experimental tools for working with sets of k-mers and De Bruijn graphs. You can:

Documentation

This poster, presented at BA2009, gives an overview of Nesoni's capabilities.

Nesoni provides the following specific usage information when run with no parameters:


nesoni 0.104 - high-throughput sequencing data analysis toolset

Usage:

    nesoni <tool>: ...

Give <tool>: without further arguments for help on using that tool.


Alignment to reference -- core tools:

    make-reference:
                  - Set up a directory containing a reference sequence,
                    annotations, and files for SHRiMP and/or Bowtie.

    shrimp:       - Run SHRiMP 2 on a read set to set up a working
                    directory.
    
    bowtie:       - Run Bowtie 2 on a read set to set up a working
                    directory.
    
    consensus     - Filter read hits, and try to call a consensus for each 
                    position in reference.
    
    (import:      - Pipe SAM alignments to set up a working directory)
    (filter:      - Filter read hits, but do not call consensus)
    (reconsensus: - Re-call consensus, using previously filtered hits)


Alignment to reference -- VCF based tools: (under development)

These provide an alternative to consensus calling using "nesoni consensus:"
- better handling of complicated Multi-Nucleotide Polymorphisms
- can't distinguish between absence of a variant and insufficient data
  (but can distinguish absence of a variant from insufficient data 
   in a single sample if variant present in other samples)

    freebayes:    - Run FreeBayes to produce a VCF file.
    
    vcf-filter:   - Filter a VCF file, eg as produced by "nesoni freebayes:".
    
    snpeff:       - Run snpEff to annotate variants with their effects.
    
    vcf-nway:     - Summarize a VCF file in a variety of possible ways.
    
    vcf-patch:    - Patch in variants to produce genome of samples.
                    (similar to consensus_masked.fa produced by "nesoni consensus:")
    
    test-variant-call:
                  - Generate synthetic reads, see what variant is called.
    
    power-variant-call:
                  - Apply "neosni test-variant-call:" to a variety of
                    different variants over a range of depths.


Alignment to reference -- analysis tools:

    igv-plots:    - Generate plots for IGV.

    nway:         - Compare results of two or more runs of nesoni consensus,
                    amongst themselves and optionally with the reference.
                 
                    Can produce output suitable for phylogenetic analysis
                    in SplitsTree4.
        
    fisher:       - Compare results of two runs of nesoni consensus using
                    Fisher's Exact Test for each site in the reference.

    normalize:    - Create normalized Artemis depth plots.
                    See also "igv-plots:".

    core:         - Infer core genome present in a set of strains.

    (consequences: 
                  - Determine effects at the amino acid level of SNPs and INDELs
                    called by nesoni consensus. Most of the features of this tool
                    are now a part of "samconsensus:".)


Alignment to reference -- differential expression:

    count:        - Count number of alignments to genes, using output from
                    "shrimp:".

    test-counts:  - Use edgeR or limma from BioConductor to detect differentially
                    expressed genes, using output from samcount.
    
    test-power:   - Test the statistical power of "nesoni test-counts:" with
                    simulated data.

    plot-counts:  - Plot counts against each other.
    
    norm-from-counts:
                  - Calculate normalizing multipliers from counts.
    
    heatmap:      - Draw a heat map of counts.
    
    nmf:          - Perform a Non-negative Matrix Factorization of counts.
                    NMF is a type of fuzzy clustering.
    
    compare-tests:
                  - Compare the output from two runs of "test-counts:"
                    eg to compare the results of different "--mode"s

An R+ module is included with nesoni which will help load the output from
samcount, for analysis with BioConductor packages.


Peak calling and annotation manipulation tools:

    islands:
    transcripts:
    modes:
                  - Various peak and transcript calling algorithms.
    
    modify-features:
                  - Shift start or end position of features,
                    filter by type, change type.
    
    collapse-features:
                  - Merge overlapping features.
    
    relate-features:
                  - Find features from one set that are near or overlapping
                    features from another set.
    
    as-gff:       - Output an annotation in GFF format,
                    optionally filtering by annotation type.

    
k-mer tools: (experimental)

    bag:          - Create an index of kmers in a read set for analysis with
                    nesoni graph.
    
    graph:        - Use a bag or bags to lay out a deBruijn graph.
                    Interact with the graph in various ways.


Utility tools:

    clip:         - Remove Illumina adaptor sequences and low quality bases
                    from reads.

    shred:        - Break a sequence into small overlapping pieces.
                    In case you want to run an existing sequence through the 
                    above tools. Yes, this isn't ideal.

    as-fasta:     - Output a sequence file in FASTA format.
    
    as-userplots: - Convert a .igv file to a set of .userplot files
                    for viewing in Artemis.
    
    make-genome:  - Make an IGV .genome file.
    
    run-igv:      - Run IGV with a specified .genome file.
    
    sample:       - Randomly sample from a sequence file.
    
    stats:        - Show some statistics about a sequence or annotation file.

    fill-scaffolds: - Guess what might be in the gaps in a 454 scaffold.
    
    pastiche:     - Use MUMMER to plaster a set of contigs over reference 
                    sequences.

    changes:      - Prints out change log file.
                    

Pipeline tools:
    
    analyse-sample:
                  - Clip, align, and call consensus on a set of reads.
                  
    analyse-variants:
                  - Produce a VCF file listing SNPs and other variants in
                    a set of samples.
    
    analyse-expression:
                  - Count alignments of fragments to genes,
                    then perform various types of statistics and 
                    visualization on this.

    analyse-samples:
                  - Run "analyse-sample:" on a set of different samples,
                    then run "analyse-variants:" and/or "analyse-expression".

If a pipeline tool is run again, it restarts only from the point affected 
by the changed parameters. The following global flags control pipeline tool 
behaviour:

    --make-cores 64
      # Approximate number of cores to use.
    --make-do ''
      # Force this selection of tool names to be recomputed.
      # Examples: --make-do all  --make-do analyse-samples/analyse-sample
    --make-done ''
      # Mark this selection of tool names as done without recomputing
      # them, if they would be recomputed.
    --make-show no
      # Show the first actions that would be made (other than those
      # specified by "--make-do"), then abort.
    --make-address 127.0.1.1
      # IP address of the network interface you want the job manager to
      # listen to.
    --make-job '__command__ &'
      # Command to launch a new python. Should either contain
      # __command__, which will be subtituted with the full shell
      # command, including the job name, or __token__ and __jobname__,
      # which should be used in something like "python -m nesoni.legion
      # __token__ __jobname__".
    --make-kill 'pkill -f __jobname__'
      # Command to kill all processes identified by __jobname__.


                    
Input files:
- sequence files can be in FASTA, FASTQ, or GENBANK format.
- annotation files can be in GENBANK or GFF format 
  (GFF is not yet supported by all tools).
- nesoni is able to read files compressed with gzip or bzip2.


Selections and sorts:
Working directories can be given a set of tags using "tag:". They also 
implicitly have a tag for the name of the directory, and a tag "all".

A selection expression is a logical expression used to select a subset 
of working directories. It may consist of (grouped by precedence):

  tag        - Working directories with tag

  [exp]      - exp

  -exp       - not exp

  exp1:exp2  - exp1 and exp2
  exp1/exp2  - exp1 or exp2
  exp1^expr2 - exp1 xor exp2

Example:

  [strain1:time1]/[-strain1:-time1]   
             - Samples either from strain1 at time1, 
               or not from strain1 and not from time1.
               Equivalently: strain1^-time1

A sort expression is a comma separated list of selection expressions,
used to sort a list of working directories.

Example:

  strain1,strain2,time1,time2,time3,replicate1,replicate2
             - Sort, grouping by strain, then by time, then by replicate


 Requirements
==============

Python 2.7 or higher. Use of PyPy where possible is highly recommended 
for performance.

Python libraries
* Recommended:
  * BioPython [1]
* Optional (used by non-core nesoni tools):
  * matplotlib
  * numpy

External programs:
* Required:
  * SHRiMP or Bowtie2
  * samtools
* Required for VCF based variant calling:
  * Picard [2]
  * Freebayes
* Optional for VCF based variant calling:
  * SplitsTree4

R libraries required by R-based tools (mostly for RNA-seq):
* Required:
  * limma, edgeR from BioConductor
  * seriation
* Optional:
  * goseq from BioConductor
  * NMF


[1] BioPython is used for reading GenBank files.
    Compiled modules may need to be disabled when installing in PyPy.

[2] There does not seem to be a standard way to install .jar files. 
    Nesoni will search for .jar files in directories listed in 
    environment variables $PATH and $JARPATH.



 Installation
==============

The easy way to install or upgrade:

    pip install -I nesoni

Then type "nesoni" and follow the command to install the R module.

See below for more ways to install nesoni.


 Advanced Installation 
-----------------------

From source, download and untar the source tarball, then:

    python setup.py install

Optional:

    R CMD INSTALL nesoni/nesoni-r


For PyPy it seems to be currently easiest to install nesoni in 
a virtualenv:

    virtualenv -p pypy my-pypy-env
    my-pypy-env/bin/pip install -I biopython 
    my-pypy-env/bin/pip install -I nesoni

You can also set up a CPython virtualenv like this:

    virtualenv my-python-env
    my-python-env/bin/pip install -I numpy 
    my-python-env/bin/pip install -I matplotlib 
    my-python-env/bin/pip install -I biopython 
    my-python-env/bin/pip install -I nesoni


 Installing older versions
---------------------------

The interface and API for nesoni may change between versions (I try to
keep this to a minimum). In order to run a script or python program
that needs an older version, I suggest setting up a virtualenv.

For example, if you want version 0.95:

    virtualenv -p pypy my-old-env
    my-old-env/bin/pip install -I biopython 
    my-old-env/bin/pip install -I nesoni==0.95

(Note: I don't have a neat way to make this work with the R components 
of nesoni.)



Contact

Older versions