Spats

Read Processing and Reactivity Generation for SHAPE-Seq

Spats processes reads and calculates SHAPE reactivities for SHAPE-Seq experiments on multiple RNAs. It accepts raw paired-end sequencing reads in FASTQ format, and a target sequence file containing the sequences of RNAs present in the experimental pool. Spats then performs read alignment to calculate distributions of read ends in the SHAPE (+) and (-) channel for each nucleotide in each RNA. Spats then estimates nucleotide resolution SHAPE reactivities for each RNA, using a model-driven maximum likelihood procedure based on a model of the reverse transcriptase process used in the SHAPE-Seq experiment. Spats is a collaborative effort between the Aviran Lab at UC Davis, the Pachter Lab at UC Berkeley, the Trapnell Lab at the University of Washington, and the Lucks Lab at Northwestern University.

Spats is provided under the OSI-approved Boost License. fastx_spats_toolkit_v_0.1.0 is provided under the AGPLv3 License.

Site Map

Releases

spats version 1.0.2	5/30/2017
Source code
spats version 1.0.1	9/10/2016
Source code
spats version 1.0.0	3/21/2016
Source code
spats version 0.8.0	9/4/2014
Source code
fastx_spats_toolkit version 0.1.0	9/4/2014
Source code
fastx_spats_toolkit version 0.1.0 - binary distribution for 64-bit GNU/Linux 2.6.18	9/4/2014
Source code
spats version 0.7.0	4/10/2012
Source code

Publications

Aviran S, Trapnell C, Lucks JB, Mortimer SA, Luo S, Schroth GP, Doudna JA, Arkin AP, Pachter L. Modeling and automation of sequencing-based characterization of RNA structure
PNAS 10.1073/pnas.1106541108
Aviran S, Lucks JB, Pachter L. RNA structure characterization from chemical mapping experiments
Proceedings of the 49th Allerton Conference on Communication, Control, and Computing, 1743-1750, 2011
Lucks JB, Mortimer SA, Trapnell C, Luo S, Aviran S, Schroth GP, Pachter L, Doudna JA, Arkin AP. Multiplexed RNA structure characterization with selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq)
PNAS 10.1073/pnas.1106501108
Mortimer SA, Trapnell C, Aviran S, Pachter L and Lucks JB. SHAPE-Seq High Throughput RNA Structure Analysis
Current Protocols in Chemical Biology10.1002/9780470559277.ch120019
Loughrey D, Watters KE, Settle AH, and Lucks JB. SHAPE-Seq 2.0: Systematic Optimization and Extension of High-Throughput Chemical Probing of RNA Structure with Next Generation Sequencing
Nucleic Acids Research 10.1093/nar/gku909
Aviran S and Pachter L. Rational experiment design for sequencing-based RNA structure mapping
RNA 10.1261/rna.043844.113
Watters KE, Abbott TR, and Lucks JB. Simultaneous characterization of cellular RNA structure and function with in-cell SHAPE-Seq
Nucleic Acids Research 10.1093/nar/gkv879
Watters KE, Yu AM, Strobel EJ, Settle AH, Lucks JB. Characterizing RNA structures in vitro and in vivo with selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq)
Methods 10.1016/j.ymeth.2016.04.002
Watters KE*, Strobel EJ*, Yu AM, Lis JT, Lucks JB. Cotranscriptional folding of a riboswitch at nucleotide resolution
Nature Structure and Molecular Biology 10.1038/nsmb.3316

Other Contributors

Stefanie Mortimer
Shujun Luo
Gary Schroth
Adam Arkin
Jennifer Doudna

Manual

Prerequisites
Adapter Trimming
Running Spats

Example
Detailed Usage
Input Files
Output Files

Splitting reactivities.out

Prerequisites

Spats runs on Intel-based computers running Linux or Mac OS X and that have GCC 4.0 or greater installed. Currently Spats must be built from the source code, though we recognize the need for an easier installation system. Detailed instructions on building from the source are described in the installation manual.

Adapter Trimming

Sometimes it may be necessary to trim adapter sequences from sequencing reads prior to alignment and processing. For example, when SHAPE cDNAs are generated that are shorter than the sequencing read length used in the experiment, sequences from the required library adapters may show up in the read. Without trimming, these 'hybrid' adapter-cDNA sequences will not align with RNA sequences included in the spats rna_targets.fa index file (see below). This will result in the absence of reactivity information at the very 3' end of molecules.

We highly recommend using the adapter trimming algorithm available via the spats v1.0.0 and newer releases.

Running Spats

The Spats pipeline is run in two stages: adapter trimming and spats analysis. There is an optional third stage to split the reactivities.out output file into many files. Adapter trimming is implemented in the adapter_trimmer code, while spats analysis is implemented in the spats toolkit. A typical workflow is as follows:

adapter_trimmer --A-b-sequence=<second_adapter_sequence> --A-t-sequence=<first_adapter_sequence> --read-len=35 R1.fastq R2.fastq RNA_targets.fa

spats --num-mismatches 0 -o Output RNA_targets.fa RRRY YYYR combined_R1.fastq combined_R2.fastq

where <second_adapter_sequence> is the sequence of the second adapter (present at the 3' end of cDNAs), <first_adapter_sequence> is the sequence of the first adapter (present at the 5' end of cDNAs), R1.fastq and R2.fastq are the FASTQ sequencing files from the sequencing run, and RNA_targets.fa is the FASTA formatted file containing the RNA sequences under study. Spats outputs text files (in Output) containing (+) and (-) fragment counts, beta's and theta’s, for each position in each RNA in the targets file as discussed below. Note that adapter_trimmer creates the files combined_R1.fastq and combined_R2.fastq which are then fed into the spats command.

Example

Given a paired end run (2x35bp) with target sequences (in DNA alphabet) in the file target_bcs.fa, "_1" reads in HCT20611_50bp_s_7_1_sequence.fq, and "_2" reads in HCT20611_50bp_s_7_2_sequence.fq, you can run the mapping pipeline like this:

adapter_trimmer --A-b-sequence=AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --A-t-sequence=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC --read-len=35 HCT20611_50bp_s_7_1_sequence.fq HCT20611_50bp_s_7_2_sequence.fq target_bcs.fa

spats --num-mismatches 0 -o HCT20611_50bp.final RRRY YYYR combined_R1.fastq combined_R2.fastq

Detailed Usage

adapter_trimmer

Usage: adapter_trimmer [options]* <R1_seq.fastq> <R2_seq.fastq> <targets.fa>

Arguments:

`<R1_seq.fastq>`	A fastq file containing "_1" reads from a paired-end Illumina run.
`<R2_seq.fastq>`	A fastq file containing "_2" reads from a paired-end Illumina run.
`<targets.fa>`	A FASTA file containing the target RNA sequences in the SHAPE experiment (excluding reactivty handles, Spats will add these automatically).
Options:
`--trim-match <N>`	Number of nucleotides of adapter to search for with clipper (default = 6).
`--read-len <N>`	Number of length of reads (ex. 35 for 2x35 bp paired end reads) (default = 35). (autodetected)
`--min-read-len <N>`	Number of nt on the 3' end to leave after (w/ adapter) (default = 6).
`-e, --error-rate <N>`	Allowed error rate when clipping adapters (0.1 is 1/10 bases, etc.)
`-q, --quality-min <N>`	Minimum quality score to consider.
`-p, --num-threads <N>`	Number of processors to use (default = 1).
`--allow-mismatches <N>`	Allow downstream mismatches for Spats (keep sequences that don't perfectly align).
`--set-quality <N>`	Sets all read qualities to the given quality where ascii is (N+33).
`--A-b-sequence <string>`	The sequence of A_adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). (default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC used for Illumina TrueSeq indexed reads).
`--A-t-sequence <string>`	The sequence of A_adapter_t (which contains the RT primer). (default: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT).
`--max-handle-len <N>`	Number of nucleotides in the longest (+/-) handle (default = 4 for RRRY/YYYR).
`-o/--output-dir <string>`	If running many simultaneously runs in the same folder, use this option to name the treated reads folder to avoid overwriting during processing.
Output:

`combined_R1.fastq, combined_R2.fastq`	Fastq files containing "_1" and "_2" reads (respectively) for which all adpater sequences have been removed.

spats

Usage: spats [options]* <rna_targets.fa> <treated_handle> <untreated_handle> <reads1_1[,...,readsN_1]> <reads1_1[,...,readsN_1]>

The following is a detailed description of the options used to control Spats:

Arguments:
`<rna_targets.fa>`	A FASTA file containing the target RNA sequences in the SHAPE experiment (excluding reactivty handles (i.e. RRRY/YYYR), Spats will add these automatically).
`<treated_handle>`	A nucleotide sequence with which all targets in the treated sample are tagged. You can use IUPAC codes here, e.g. `RRRY`.
`<untreated_handle>`	A nucleotide sequence with which all targets in the untreated sample are tagged. You can use IUPAC codes here, e.g. `YYYR`.
`<reads1_1[,...,readsN_1]>`	A comma separated list of FASTQ files. These must all be "_1" reads from a paired-end Illumina run (e.g. s_1_1_sequence.txt,s_2_1_sequence.txt)
`<reads1_2[,...,readsN_2]>`	A comma separated list of FASTQ files. These must all be "_2" reads from a paired-end Illumina run (e.g. s_1_2_sequence.txt,s_2_2_sequence.txt). NOTE: There must be one file here for each file in the _1 list, and the files must appear in the same order.
Options:
`-o/--output-dir <output_dir_name>`	The name of the directory in which you'd like Spats to write its output. Default `spats_out`.
`--adapter-t <adapter sequence>`	Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the `adapter_trimmer` script outlined above. The sequence of adapter_t (which contains the RT primer). This option is used when adapter trimming is desired. Example `AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC`
`--adapter-b <adapter sequence>`	Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the `adapter_trimmer` script outlined above. The sequence of adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). This option is used when adapter trimming is desired. Example `AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG`
`--num-mismatches <0-3>`	The number of mismatches you want Bowtie to allow in its alignments for each read to the target RNAs.Note: this option does not work in v0.8.0 and later because of the algorithm implemented in `adapter_trimmer`.
`--m <int>`	This option adds stringency to alignments, where <int> is the max number of aligenments allowed for each read. Reads with more than <int> number of valid alignments are not used in the reactivities calculation. Default is not stringent.
`-p/--num-threads <threads>`	The number of threads you want Bowtie to use when aligning the reads to the target RNAs
`--all-RT-starts`	If present prints out reactivties for each RT site detected. Default not used.
Output:

`see below`

Spats Pipeline Input

The spats pipeline (adapter_trimmer and spats) takes three files as input: the target RNA file in FASTA format, and two files containing the paired-end sequencing reads from the experiment in FASTQ format. The sequecing reads for both the plus and minus channels of the SHAPE experiment must be contained in the files. Please refer to the references on the right hand side for more information on how these handles are included in library preparation and sequencing. It is critical that the mated reads appear in the same order in each file. The target RNA file is in FASTA format, and contains one entry for each RNA that is included in the experiment, e.g.

		>target_GTAC
		GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
		CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
		AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCGTACAA
		ATCGGGCTTCGGTCCGGTTC
		>target_TTAA
		GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
		CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
		AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCTTAAAA
		ATCGGGCTTCGGTCCGGTTC
		...

In this case, two identical RNA sequences were barcoded with two sequences (GTAC, TTAA) and run in the same experiment.

Spats Pipeline Output

The spats pipeline produces a directory (by default called spats_out) containing several output files.

De-multiplexed reads

Spats splits the reads into the plus and minus channels according to the handle sequences you've specified. The output directory will contain FASTQ files named <treated_handle>_1.fq, <treated_handle>_2.fq, <untreated_handle>_1.fq and <untreated_handle>_2.fq

Reactivities

The reactivities.out file is a tab-delimited file of values providing the reactivity at each base of each target RNAs. The example below shows the first few rows of a run against wild-type RNaseP (barcoded with CTTG) and tRNA (barcoded with AAAG>). Data for different targets are concatenated within the same output file.

		sequence        rt_start  five_prime_offset  nucleotide  treated_mods  untreated_mods  beta           theta                c
		RNASeP_WT_CTTG     202            0               *          1144           4201         -              -               1.22577
		RNASeP_WT_CTTG     202            1               G          28             110          0              0               1.22577
		RNASeP_WT_CTTG     202            2               G          87             315          0.000908854    0.000883849     1.22577
		RNASeP_WT_CTTG     202            3               T          10             6            0.00553418     0.00539681      1.22577
		RNASeP_WT_CTTG     202            4               C          7              23           0.000459691    0.000446923     1.22577
		RNASeP_WT_CTTG     202            5               G          8              21           0.00146664     0.00142677      1.22577	
		...
		tRNA_AAAG          136            0               *          1267           30440        -              -               2.53892
		tRNA_AAAG          136            1               G          118            1145         0.0213467      0.0205298       2.53892
		tRNA_AAAG          136            2               G          28             123          0.00672447     0.00635241      2.53892
		tRNA_AAAG          136            3               C          17             51           0.00432867     0.0040774       2.53892
		tRNA_AAAG          136            4               C          84             319          0.0193321      0.0185458       2.53892
		tRNA_AAAG          136            5               T          95             529          0.0182941      0.0175276       2.53892
		...

`sequence`	Name of target RNA (from the input sequences FASTA file).
`rt_start`	The base index where RT priming started. For SHAPE-Seq v2.0 this is the end of the RNA. For RT primers that target internal sequences this is internal to the RNA sequence.
`five_prime_offset`	Position in target RNA, 0=full length
`nucleotide`	Base at position in target, *=full length
`treated_mods`	Number of (+) fragments ending at position in target
`untreated_mods`	Number of (-) fragments ending at position in target
`beta`	Beta value at position in target
`theta`	Theta value at position in target
`c`	Estimated SHAPE modifications per target

See the references, and especially Aviran, Lucks, Pachter: RNA structure characterization from chemical mapping experiments, for definitions of beta, theta and c.

Splitting reactivities.out

reactivities_split takes a Spats reactivities output file (reactivities.out) and divides it into individual reactivities files for each RNA, containing all of the RT start positions (unsplit). These reactivities files are then split into smaller files that contain only one RT start position each.

reactivities_split

Usage: reactivities_split [options] <reactivities_file.out>;

Arguments:

`<reactivities_file.out>`	Spats reactivities output file containing reactivities information for all targets.
Options:
`-h`	Prints this help menus
`-v`	Prints the version number
`-q,--quiet`	Suppresses output of reactivity filenames being generated (useful for a lot of RNA targets)
`-g,--group`	Groups all split reactivities files into one directory, regardless of RT start position (useful for random priming)
`-l,--long-only`	Only generates reactivities files for the longest RT start position of any target (most 3'), takes precedence over -g or --group