SPAdes Manual: Installation and Running Guide for SPAdes Genome Assembler
How to Download and Use SPAdes Assembler
SPAdes - St. Petersburg genome assembler - is an assembly toolkit that contains various assembly pipelines for different types of sequencing data. It was originally developed for de novo assembly of bacterial and viral genomes from single-cell or isolate samples, but it has been extended to support metagenomic, plasmid, transcriptomic, and biosynthetic gene cluster assembly as well. SPAdes can also perform hybrid assembly using short reads (Illumina or IonTorrent) and long reads (PacBio, Oxford Nanopore, or Sanger). SPAdes is one of the most widely used assemblers in the field, and it has several advantages over other assemblers, such as:
It can handle complex repeat structures and large genome variations.
It can produce high-quality assemblies with low error rates and high gene completeness.
It can assemble genomes from low-coverage or unevenly distributed data.
It can assemble multiple genomes from mixed samples.
It can assemble novel sequences that are not present in reference genomes.
In this article, I will show you how to download and use SPAdes assembler for your own genome assembly projects. I will cover the following topics:
download spades assembler
How to download SPAdes binaries or source code for Linux or Mac.
How to verify your installation and run a self-test.
How to provide input data and command line options for different assembly pipelines.
How to evaluate the output files and statistics.
By the end of this article, you should be able to perform de novo genome assembly using SPAdes with confidence and ease. Let's begin!
Downloading SPAdes
The first step is to download SPAdes from its official website: http://cab.spbu.ru/software/spades/. You can choose to download either the pre-compiled binaries or the source code, depending on your operating system and preference. The latest version of SPAdes is 3.15.5, which was released on July 14th, 2022 under GPLv2 license.
Downloading SPAdes binaries for Linux
If you are using a Linux system (64-bit only), you can download the pre-compiled binaries from the website. The file name is SPAdes-3.15.5-Linux.tar.gz. You can use the following command to download it:
wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5-Linux.tar.gz
Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5-Linux.tar.gz
This will create a folder named SPAdes-3.15.5-Linux, which contains the executable files and other resources for SPAdes.
Downloading SPAdes binaries for Mac
If you are using a Mac system (64-bit only), you can download the pre-compiled binaries from the website as well. The file name is SPAdes-3.15.5-Darwin.tar.gz. You can use the following command to download it:
How to download spades assembler for Linux
Download spades assembler binaries for Mac OS
Download and compile spades assembler source code
Verify spades assembler installation
Spades assembler input formats and options
Spades assembler command line usage and examples
Spades assembler output files and formats
Spades assembler performance and benchmarks
Spades assembler manual and support
Spades assembler citation and publications
Download spades assembler for metagenomic data sets
Download spades assembler for plasmid extraction and assembly
Download spades assembler for transcriptome assembly
Download spades assembler for biosynthetic gene cluster assembly
Download spades assembler for RNA viral data sets
Spades assembler pipeline overview and comparison
Spades assembler hybrid mode with PacBio, Nanopore or Sanger reads
Spades assembler HMM-guided mode with gene models
Spades assembler coronaSPAdes pipeline for coronavirus data sets
Spades assembler rnaviralSPAdes pipeline for RNA viral data sets
Spades assembler metaviralSPAdes pipeline for viral metagenomes
Spades assembler metaSPAdes pipeline for metagenomes
Spades assembler plasmidSPAdes pipeline for plasmids from WGS data sets
Spades assembler metaplasmidSPAdes pipeline for plasmids from metagenomes
Spades assembler rnaSPAdes pipeline for RNA-Seq data sets
Spades assembler biosyntheticSPAdes pipeline for biosynthetic gene clusters
Spades assembler GAGE-B data sets benchmark results and analysis
Spades assembler stand-alone binaries and tools description and usage
Spades assembler k-mer counting tool (spades-hammer)
Spades assembler k-mer coverage read filter tool (spades-bwa)
Spades assembler k-mer cardinality estimating tool (spades-kmercount)
Spades assembler graph construction tool (spades-core)
Spades assembler long read to graph alignment tool (spaligner)
Spades assembler hybridSPAdes aligner tool (hybrid_aligner)
Spades assembler assembly evaluation tool (quast)
Download spades assembler latest version 3.15.5 from official website
Download spades assembler previous versions from GitHub repository
Download spades assembler example data sets and reference genomes
Subscribe to spades assembler updates and news via email or Twitter
Provide feedback and bug reports to spades assembler developers via email or GitHub issues
Learn more about spades assembler features and algorithms from SPAdes papers and protocols
wget http://cab.spbu.ru/files/release3.15.5/ SPAdes-3.15.5-Darwin.tar.gz
Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5-Darwin.tar.gz
This will create a folder named SPAdes-3.15.5-Darwin, which contains the executable files and other resources for SPAdes.
Downloading SPAdes source code
If you prefer to compile SPAdes from source code, or if you are using a different operating system, you can download the source code from the website as well. The file name is SPAdes-3.15.5.tar.gz. You can use the following command to download it:
wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5.tar.gz
Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5.tar.gz
This will create a folder named SPAdes-3.15.5, which contains the source code and other resources for SPAdes.
To compile SPAdes from source code, you need to have some prerequisites installed on your system, such as CMake, GCC, Python 2 or 3, zlib, bzip2, and Boost libraries. You can check the detailed instructions on how to install these prerequisites on the SPAdes website: http://cab.spbu.ru/software/spades/#prereq. Once you have installed the prerequisites, you can use the following commands to compile SPAdes:
cd SPAdes-3.15.5 ./spades_compile.sh
This will create an executable file named spades.py in the bin folder.
Installing SPAdes
After downloading and extracting (or compiling) SPAdes, you need to install it on your system. The installation process is very simple and straightforward. You just need to add the bin folder of SPAdes to your system's PATH variable, so that you can run SPAdes from any directory.
Installing SPAdes on Linux
If you are using a Linux system, you can add the bin folder of SPAdes to your PATH variable by editing your .bashrc file (or equivalent) in your home directory. You can use the following command to open the file with a text editor (such as nano):
nano /.bashrc
Then, add the following line at the end of the file (replace /path/to/SPAdes-3.15.5-Linux/bin with the actual path of your SPAdes bin folder):
export PATH=$PATH:/path/to/SPAdes-3.15.5-Linux/bin
Save and close the file, and then run the following command to apply the changes:
source /.bashrc
You can now run SPAdes from any directory by typing spades.py.
Installing SPAdes on Mac
If you are using a Mac system, you can add the bin folder of SPAdes to your PATH variable by editing your .bash_profile file (or equivalent) in your home directory. You can use the following command to open the file with a text editor (such as nano):
nano /.bash_profile
Then, add the following line at the end of the file (replace /path/to/SPAdes-3.15.5-Darwin/bin with the actual path of your SPAdes bin folder):
export PATH=$PATH:/path/to/SPAdes-3.15.5-Darwin/bin
Save and close the file, and then run the following command to apply the changes:
source /.bash_profile
You can now run SPAdes from any directory by typing spades.py.
Verifying SPAdes installation and running a self-test
After installing SPAdes, you should verify that it works properly on your system. You can do this by running a self-test that comes with SPAdes. The self-test will run SPAdes on a small dataset and check if the output matches the expected results.
To run the self-test, you need to go to the test folder of SPAdes, which is located inside the main SPAdes folder. You can use the following command to go there:
cd /path/to/SPAdes-3.15.5/test
Then, you can run the self-test by typing:
./spades.py --test
This will launch SPAdes in test mode and run it on a small dataset of E. coli reads. The test will take a few minutes to complete, and it will generate some output files in a folder named spades_test. You should see something like this at the end of the test:
===== Test passed OK =====
This means that SPAdes ran successfully and produced the correct output. If you see any errors or warnings, you should check the log file (spades.log) for more details and troubleshoot the problem.
Running SPAdes
Now that you have installed and verified SPAdes, you are ready to use it for your own genome assembly projects. To run SPAdes, you need to provide some input data and some command line options for different assembly pipelines.
Providing input data
The input data for SPAdes are sequencing reads from one or more samples. SPAdes can handle various types of reads, such as:
Illumina paired-end (PE) or mate-pair (MP) reads.
IonTorrent PE or MP reads.
PacBio single-molecule real-time (SMRT) reads.
Oxford Nanopore MinION or GridION reads.
Sanger reads.
Mixed reads from different sources.
You need to specify the type and format of your input reads using different command line options. The most common options are:
OptionDescription
-1 <filename>The file name with forward PE reads (in FASTQ or FASTA format).
-2 <filename>The file name with reverse PE reads (in FASTQ or FASTA format).
--s1 <filename>The file name with unpaired reads (in FASTQ or FASTA format).
--pacbio <filename>The file name with PacBio SMRT reads (in FASTQ or FASTA format).
--nanopore <filename>The file name with Oxford Nanopore reads (in FASTQ or FASTA format).
--sanger <filename>The file name with Sanger reads (in FASTQ or FASTA format).
--pe1-12 <filename>The file name with interlaced forward and reverse PE reads (in FASTQ or FASTA format).
--mp1-12 <filename>The file name with interlaced forward and reverse MP reads (in FASTQ or FAST A format).
You can use multiple options to provide reads from different sources or libraries. For example, if you have PE reads from Illumina and SMRT reads from PacBio, you can use the following options:
-1 illumina_pe_1.fastq -2 illumina_pe_2.fastq --pacbio pacbio_smrt.fastq
You can also use the --dataset <filename> option to provide a YAML file that describes your input data in more detail. For example, you can specify the library type, orientation, insert size, quality offset, and coverage for each file. You can find more information on how to create a YAML file on the SPAdes website: http://cab.spbu.ru/software/spades/#dataset.
Choosing command line options for different assembly pipelines
The next step is to choose the appropriate command line options for the assembly pipeline that suits your data and goal. SPAdes has several assembly pipelines for different types of data, such as:
--sc: Single-cell assembly pipeline for bacterial or viral genomes from single-cell or isolate samples.
--meta: Metagenomic assembly pipeline for mixed microbial communities.
--plasmid: Plasmid assembly pipeline for plasmid detection and extraction.
--rna: Transcriptomic assembly pipeline for RNA-Seq data.
--isolate: Isolate assembly pipeline for bacterial or viral genomes from isolate samples.
--moleculo: Moleculo assembly pipeline for long synthetic reads from Moleculo technology.
--bga: Biosynthetic gene cluster assembly pipeline for secondary metabolite gene clusters.
You can use one of these options to run the corresponding pipeline, or you can omit them to run the default pipeline, which is suitable for most cases. For example, if you want to assemble a bacterial genome from single-cell data, you can use the following option:
--sc
If you want to assemble a metagenomic sample from mixed reads, you can use the following option:
--meta
If you want to assemble a transcriptome from RNA-Seq data, you can use the following option:
--rna In addition to these pipeline options, you can also use some other options to customize your assembly process, such as:
-k <value>: The k-mer size to use for assembly. You can specify a single value (e.g. -k 21) or a comma-separated list of values (e.g. -k 21,33,55). The default value is auto, which means that SPAdes will choose the optimal k-mer size based on your data.
-t <value>: The number of threads to use for assembly. The default value is 16.
-m <value>: The amount of RAM to use for assembly in GB. The default value is 250.
--careful: The option to run SPAdes in careful mode, which will reduce the number of mismatches and short indels in the resulting assembly.
--only-assembler: The option to run only the assembly module of SPAdes, without performing error correction or read mapping.
--continue: The option to resume a previously interrupted run of SPAdes from the last available checkpoint.
You can find more information on the available command line options on the SPAdes website: http://cab.spbu.ru/software/spades/#manual.
Evaluating SPAdes output
After running SPAdes, you will get some output files and statistics in a folder named after your project. For example, if you run SPAdes with the following command:
./spades.py -1 illumina_pe_1.fastq -2 illumina_pe_2.fastq --pacbio pacbio_smrt.fastq -o my_project
You will get a folder named my_project, which contains the following files and subfolders:
File or subfolderDescription
spades.logThe log file that records the progress and status of SPAdes.
params.txtThe file that contains the parameters and options used for SPAdes.
dataset.infoThe file that contains the information about the input data.
corrected/The subfolder that contains the error-corrected reads.
mismatch_corrector/The subfolder that contains the mismatch-corrected contigs and scaffolds.
K21/ K33/ K55/ .../The subfolders that contain the intermediate assemblies for each k-mer size.
scaffolds.fastaThe final assembly file that contains the scaffolds (sequences with gaps).
contigs.fastaThe final assembly file that contains the contigs (sequences without gaps).
assembly_graph.fastgThe final assembly graph file in FASTG format.
scaffolds.pathsThe file that contains the paths of contigs in scaffolds.
contigs.paths(#message) Continue writing the article. [assistant](#message) The file that contains the paths of edges in contigs.
spades.yamlThe file that contains the summary statistics and quality metrics of the final assembly.
To evaluate the quality and accuracy of your assembly, you can look at some of these output files and statistics. For example, you can check the following metrics:
The number and length of scaffolds and contigs. You can use tools like QUAST or MetaQUAST to generate a comprehensive report on these metrics.
The N50 and NG50 values of scaffolds and contigs. These are measures of contiguity and completeness of your assembly. The higher the values, the better the assembly. You A: You can cite SPAdes using the following reference: Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. 2012 May;19(5):455-77. doi: 10.1089/cmb.2012.0021. You can also use the BibTeX format: @articlebankevich2012spades, title=SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing, author=Bankevich, Anton and Nurk, Sergey and Antipov, Dmitry and Gurevich, Alexey A and Dvorkin, Mikhail and Kulikov, Alexander S and Lesin, Vladislav M and Nikolenko, Sergey I and Pham, Son and Prjibelski, Andrey D and Pyshkin, Alexey V and Sirotkin, Alexander V and Vyahhi, Nikolay and Tesler, Glenn and Alekseyev, Max A and Pevzner, Pavel A, journal=Journal of Computational Biology, volume=19, number=5, pages=455--477, year=2012, publisher=Mary Ann Liebert Inc Q: How do I