Spats

Read Processing and Reactivity Generation for SHAPE-Seq

   
Spats processes reads and calculates SHAPE reactivities for SHAPE-Seq experiments on multiple RNAs. It accepts raw paired-end sequencing reads in FASTQ format, and a target sequence file containing the sequences of RNAs present in the experimental pool. Spats then performs read alignment to calculate distributions of read ends in the SHAPE (+) and (-) channel for each nucleotide in each RNA. Spats then estimates nucleotide resolution SHAPE reactivities for each RNA, using a model-driven maximum likelihood procedure based on a model of the reverse transcriptase process used in the SHAPE-Seq experiment. Spats is a collaborative effort between the Aviran Lab at UC Davis, the Pachter Lab at UC Berkeley, the Trapnell Lab at the University of Washington, and the Lucks Lab at Northwestern University.

Spats is provided under the OSI-approved Boost License. fastx_spats_toolkit_v_0.1.0 is provided under the AGPLv3 License.
Open Source
                    Software

Site Map

Releases

Publications

Other Contributors

Links

Manual


Prerequisites


Spats runs on Intel-based computers running Linux or Mac OS X and that have GCC 4.0 or greater installed. Currently Spats must be built from the source code, though we recognize the need for an easier installation system. Detailed instructions on building from the source are described in the installation manual.



Adapter Trimming


Sometimes it may be necessary to trim adapter sequences from sequencing reads prior to alignment and processing. For example, when SHAPE cDNAs are generated that are shorter than the sequencing read length used in the experiment, sequences from the required library adapters may show up in the read. Without trimming, these 'hybrid' adapter-cDNA sequences will not align with RNA sequences included in the spats rna_targets.fa index file (see below). This will result in the absence of reactivity information at the very 3' end of molecules.


We highly recommend using the adapter trimming algorithm available via the spats v1.0.0 and newer releases.


Running Spats


The Spats pipeline is run in two stages: adapter trimming and spats analysis. There is an optional third stage to split the reactivities.out output file into many files. Adapter trimming is implemented in the adapter_trimmer code, while spats analysis is implemented in the spats toolkit. A typical workflow is as follows:


adapter_trimmer --A-b-sequence=<second_adapter_sequence> --A-t-sequence=<first_adapter_sequence> --read-len=35 R1.fastq R2.fastq RNA_targets.fa
spats --num-mismatches 0 -o Output RNA_targets.fa RRRY YYYR combined_R1.fastq combined_R2.fastq


where <second_adapter_sequence> is the sequence of the second adapter (present at the 3' end of cDNAs), <first_adapter_sequence> is the sequence of the first adapter (present at the 5' end of cDNAs), R1.fastq and R2.fastq are the FASTQ sequencing files from the sequencing run, and RNA_targets.fa is the FASTA formatted file containing the RNA sequences under study. Spats outputs text files (in Output) containing (+) and (-) fragment counts, beta's and theta’s, for each position in each RNA in the targets file as discussed below. Note that adapter_trimmer creates the files combined_R1.fastq and combined_R2.fastq which are then fed into the spats command.


Example


Given a paired end run (2x35bp) with target sequences (in DNA alphabet) in the file target_bcs.fa, "_1" reads in HCT20611_50bp_s_7_1_sequence.fq, and "_2" reads in HCT20611_50bp_s_7_2_sequence.fq, you can run the mapping pipeline like this:


adapter_trimmer --A-b-sequence=AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --A-t-sequence=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC --read-len=35 HCT20611_50bp_s_7_1_sequence.fq HCT20611_50bp_s_7_2_sequence.fq target_bcs.fa
spats --num-mismatches 0 -o HCT20611_50bp.final RRRY YYYR combined_R1.fastq combined_R2.fastq


Detailed Usage


adapter_trimmer


Usage: adapter_trimmer [options]* <R1_seq.fastq> <R2_seq.fastq> <targets.fa>
Arguments:
<R1_seq.fastq> A fastq file containing "_1" reads from a paired-end Illumina run.
<R2_seq.fastq> A fastq file containing "_2" reads from a paired-end Illumina run.
<targets.fa> A FASTA file containing the target RNA sequences in the SHAPE experiment (excluding reactivty handles, Spats will add these automatically).
Options:
--trim-match <N> Number of nucleotides of adapter to search for with clipper (default = 6).
--read-len <N> Number of length of reads (ex. 35 for 2x35 bp paired end reads) (default = 35). (autodetected)
--min-read-len <N> Number of nt on the 3' end to leave after (w/ adapter) (default = 6).
-e, --error-rate <N> Allowed error rate when clipping adapters (0.1 is 1/10 bases, etc.)
-q, --quality-min <N> Minimum quality score to consider.
-p, --num-threads <N> Number of processors to use (default = 1).
--allow-mismatches <N> Allow downstream mismatches for Spats (keep sequences that don't perfectly align).
--set-quality <N> Sets all read qualities to the given quality where ascii is (N+33).
--A-b-sequence <string> The sequence of A_adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). (default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC used for Illumina TrueSeq indexed reads).
--A-t-sequence <string> The sequence of A_adapter_t (which contains the RT primer). (default: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT).
--max-handle-len <N> Number of nucleotides in the longest (+/-) handle (default = 4 for RRRY/YYYR).
-o/--output-dir <string> If running many simultaneously runs in the same folder, use this option to name the treated reads folder to avoid overwriting during processing.
Output:
combined_R1.fastq, combined_R2.fastq Fastq files containing "_1" and "_2" reads (respectively) for which all adpater sequences have been removed.

spats


Usage: spats [options]* <rna_targets.fa> <treated_handle> <untreated_handle> <reads1_1[,...,readsN_1]> <reads1_1[,...,readsN_1]>

The following is a detailed description of the options used to control Spats:


Arguments:
<rna_targets.fa> A FASTA file containing the target RNA sequences in the SHAPE experiment (excluding reactivty handles (i.e. RRRY/YYYR), Spats will add these automatically).
<treated_handle> A nucleotide sequence with which all targets in the treated sample are tagged. You can use IUPAC codes here, e.g. RRRY.
<untreated_handle> A nucleotide sequence with which all targets in the untreated sample are tagged. You can use IUPAC codes here, e.g. YYYR.
<reads1_1[,...,readsN_1]> A comma separated list of FASTQ files. These must all be "_1" reads from a paired-end Illumina run (e.g. s_1_1_sequence.txt,s_2_1_sequence.txt)
<reads1_2[,...,readsN_2]> A comma separated list of FASTQ files. These must all be "_2" reads from a paired-end Illumina run (e.g. s_1_2_sequence.txt,s_2_2_sequence.txt). NOTE: There must be one file here for each file in the _1 list, and the files must appear in the same order.
Options:
-o/--output-dir <output_dir_name> The name of the directory in which you'd like Spats to write its output. Default spats_out.
--adapter-t <adapter sequence> Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the adapter_trimmer script outlined above. The sequence of adapter_t (which contains the RT primer). This option is used when adapter trimming is desired. Example AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
--adapter-b <adapter sequence> Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the adapter_trimmer script outlined above. The sequence of adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). This option is used when adapter trimming is desired. Example AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
--num-mismatches <0-3> The number of mismatches you want Bowtie to allow in its alignments for each read to the target RNAs.Note: this option does not work in v0.8.0 and later because of the algorithm implemented in adapter_trimmer.
--m <int> This option adds stringency to alignments, where <int> is the max number of aligenments allowed for each read. Reads with more than <int> number of valid alignments are not used in the reactivities calculation. Default is not stringent.
-p/--num-threads <threads> The number of threads you want Bowtie to use when aligning the reads to the target RNAs
--all-RT-starts If present prints out reactivties for each RT site detected. Default not used.
Output:
see below

Spats Pipeline Input


The spats pipeline (adapter_trimmer and spats) takes three files as input: the target RNA file in FASTA format, and two files containing the paired-end sequencing reads from the experiment in FASTQ format. The sequecing reads for both the plus and minus channels of the SHAPE experiment must be contained in the files. Please refer to the references on the right hand side for more information on how these handles are included in library preparation and sequencing. It is critical that the mated reads appear in the same order in each file. The target RNA file is in FASTA format, and contains one entry for each RNA that is included in the experiment, e.g.


		>target_GTAC
		GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
		CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
		AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCGTACAA
		ATCGGGCTTCGGTCCGGTTC
		>target_TTAA
		GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
		CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
		AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCTTAAAA
		ATCGGGCTTCGGTCCGGTTC
		...
		
In this case, two identical RNA sequences were barcoded with two sequences (GTAC, TTAA) and run in the same experiment.

Spats Pipeline Output


The spats pipeline produces a directory (by default called spats_out) containing several output files.


De-multiplexed reads

Spats splits the reads into the plus and minus channels according to the handle sequences you've specified. The output directory will contain FASTQ files named <treated_handle>_1.fq, <treated_handle>_2.fq, <untreated_handle>_1.fq and <untreated_handle>_2.fq


Reactivities

The reactivities.out file is a tab-delimited file of values providing the reactivity at each base of each target RNAs. The example below shows the first few rows of a run against wild-type RNaseP (barcoded with CTTG) and tRNA (barcoded with AAAG>). Data for different targets are concatenated within the same output file.


		sequence        rt_start  five_prime_offset  nucleotide  treated_mods  untreated_mods  beta           theta                c
		RNASeP_WT_CTTG     202            0               *          1144           4201         -              -               1.22577
		RNASeP_WT_CTTG     202            1               G          28             110          0              0               1.22577
		RNASeP_WT_CTTG     202            2               G          87             315          0.000908854    0.000883849     1.22577
		RNASeP_WT_CTTG     202            3               T          10             6            0.00553418     0.00539681      1.22577
		RNASeP_WT_CTTG     202            4               C          7              23           0.000459691    0.000446923     1.22577
		RNASeP_WT_CTTG     202            5               G          8              21           0.00146664     0.00142677      1.22577	
		...
		tRNA_AAAG          136            0               *          1267           30440        -              -               2.53892
		tRNA_AAAG          136            1               G          118            1145         0.0213467      0.0205298       2.53892
		tRNA_AAAG          136            2               G          28             123          0.00672447     0.00635241      2.53892
		tRNA_AAAG          136            3               C          17             51           0.00432867     0.0040774       2.53892
		tRNA_AAAG          136            4               C          84             319          0.0193321      0.0185458       2.53892
		tRNA_AAAG          136            5               T          95             529          0.0182941      0.0175276       2.53892
		...
	
sequence Name of target RNA (from the input sequences FASTA file).
rt_start The base index where RT priming started. For SHAPE-Seq v2.0 this is the end of the RNA. For RT primers that target internal sequences this is internal to the RNA sequence.
five_prime_offset Position in target RNA, 0=full length
nucleotide Base at position in target, *=full length
treated_mods Number of (+) fragments ending at position in target
untreated_mods Number of (-) fragments ending at position in target
beta Beta value at position in target
thetaTheta value at position in target
c Estimated SHAPE modifications per target
See the references, and especially Aviran, Lucks, Pachter: RNA structure characterization from chemical mapping experiments, for definitions of beta, theta and c.


Splitting reactivities.out


reactivities_split takes a Spats reactivities output file (reactivities.out) and divides it into individual reactivities files for each RNA, containing all of the RT start positions (unsplit). These reactivities files are then split into smaller files that contain only one RT start position each.


reactivities_split


Usage: reactivities_split [options] <reactivities_file.out>;
Arguments:
<reactivities_file.out> Spats reactivities output file containing reactivities information for all targets.
Options:
-h Prints this help menus
-v Prints the version number
-q,--quiet Suppresses output of reactivity filenames being generated (useful for a lot of RNA targets)
-g,--group Groups all split reactivities files into one directory, regardless of RT start position (useful for random priming)
-l,--long-only Only generates reactivities files for the longest RT start position of any target (most 3'), takes precedence over -g or --group