Manual
Prerequisites
Spats runs on Intel-based computers running Linux or Mac OS X and
that have GCC 4.0 or greater installed. Currently Spats must be built from the source code, though we recognize the need
for an easier installation system. Detailed instructions on building from the source are described in the installation manual.
Adapter Trimming
Sometimes it may be necessary to trim adapter sequences from sequencing reads prior to alignment and
processing. For example, when SHAPE cDNAs are generated that are shorter than the sequencing read length
used in the experiment, sequences from the required library adapters may show up in the read. Without
trimming, these 'hybrid' adapter-cDNA sequences will not align with RNA sequences included in the spats
rna_targets.fa index file (see below). This will result in the absence of reactivity information at the very
3' end of molecules.
We highly recommend using the adapter trimming algorithm available via the spats v1.0.0 and newer releases.
Running Spats
The Spats pipeline is run in two stages: adapter trimming and spats analysis. There is an optional third stage to split the reactivities.out output file into many files. Adapter trimming is implemented in the
adapter_trimmer code, while spats analysis is implemented in the spats toolkit. A typical workflow is as follows:
adapter_trimmer --A-b-sequence=<second_adapter_sequence> --A-t-sequence=<first_adapter_sequence> --read-len=35 R1.fastq R2.fastq RNA_targets.fa
spats --num-mismatches 0 -o Output RNA_targets.fa RRRY YYYR combined_R1.fastq combined_R2.fastq
where <second_adapter_sequence> is the sequence of the second adapter (present at the 3' end of cDNAs),
<first_adapter_sequence> is the
sequence of the first adapter (present at the 5' end of cDNAs), R1.fastq and R2.fastq are the FASTQ sequencing files from the sequencing
run, and RNA_targets.fa is the FASTA formatted file containing the RNA sequences under study.
Spats outputs text files (in Output) containing (+) and (-) fragment counts, beta's and theta’s, for each position in each RNA in the targets file
as discussed below. Note that adapter_trimmer creates the files combined_R1.fastq and combined_R2.fastq which are then fed into
the spats command.
Example
Given a paired end run (2x35bp) with target sequences (in DNA alphabet) in the file
target_bcs.fa, "_1" reads in HCT20611_50bp_s_7_1_sequence.fq, and
"_2" reads in HCT20611_50bp_s_7_2_sequence.fq, you can run the mapping pipeline like this:
adapter_trimmer --A-b-sequence=AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
--A-t-sequence=AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
--read-len=35 HCT20611_50bp_s_7_1_sequence.fq HCT20611_50bp_s_7_2_sequence.fq target_bcs.fa
spats --num-mismatches 0 -o HCT20611_50bp.final RRRY YYYR combined_R1.fastq combined_R2.fastq
Detailed Usage
adapter_trimmer
Usage: adapter_trimmer [options]* <R1_seq.fastq> <R2_seq.fastq> <targets.fa>
Arguments:
|
|
|
<R1_seq.fastq>
|
A fastq file containing "_1" reads from a paired-end Illumina run.
|
<R2_seq.fastq>
|
A fastq file containing "_2" reads from a paired-end Illumina run.
|
<targets.fa>
|
A FASTA file containing the target RNA sequences in the SHAPE experiment
(excluding reactivty handles, Spats will add these automatically).
|
Options:
|
|
--trim-match <N>
|
Number of nucleotides of adapter to search for with clipper (default = 6).
|
--read-len <N>
|
Number of length of reads (ex. 35 for 2x35 bp paired end reads) (default = 35). (autodetected)
|
--min-read-len <N>
|
Number of nt on the 3' end to leave after (w/ adapter) (default = 6).
|
-e, --error-rate <N>
|
Allowed error rate when clipping adapters (0.1 is 1/10 bases, etc.)
|
-q, --quality-min <N>
|
Minimum quality score to consider.
|
-p, --num-threads <N>
|
Number of processors to use (default = 1).
|
--allow-mismatches <N>
|
Allow downstream mismatches for Spats (keep sequences that don't perfectly align).
|
--set-quality <N>
|
Sets all read qualities to the given quality where ascii is (N+33).
|
--A-b-sequence <string>
|
The sequence of A_adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). (default:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC used for Illumina TrueSeq indexed reads).
|
--A-t-sequence <string>
|
The sequence of A_adapter_t (which contains the RT primer). (default: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT).
|
--max-handle-len <N>
|
Number of nucleotides in the longest (+/-) handle (default = 4 for RRRY/YYYR).
|
-o/--output-dir <string>
|
If running many simultaneously runs in the same folder, use this option to name the treated reads folder to avoid overwriting during processing.
|
Output:
|
|
|
combined_R1.fastq, combined_R2.fastq
|
Fastq files containing "_1" and "_2" reads (respectively) for which all adpater sequences have been removed.
|
spats
Usage: spats [options]* <rna_targets.fa> <treated_handle> <untreated_handle> <reads1_1[,...,readsN_1]> <reads1_1[,...,readsN_1]>
The following is a detailed description of the options used to control
Spats:
Arguments:
|
|
<rna_targets.fa>
|
A FASTA file containing the target RNA sequences in the SHAPE experiment
(excluding reactivty handles (i.e. RRRY/YYYR), Spats will add these automatically).
|
<treated_handle>
|
A nucleotide sequence with which all targets in the treated sample are tagged. You can use IUPAC codes here, e.g. RRRY.
|
<untreated_handle>
|
A nucleotide sequence with which all targets in the untreated sample are tagged. You can use IUPAC codes here, e.g. YYYR.
|
<reads1_1[,...,readsN_1]>
|
A comma separated list of FASTQ files. These must all be "_1" reads from a paired-end Illumina run (e.g. s_1_1_sequence.txt,s_2_1_sequence.txt)
|
<reads1_2[,...,readsN_2]>
|
A comma separated list of FASTQ files. These must all be "_2" reads
from a paired-end Illumina run (e.g. s_1_2_sequence.txt,s_2_2_sequence.txt).
NOTE: There must be one file here for each file in the _1 list,
and the files must appear in the same order.
|
Options:
|
|
-o/--output-dir <output_dir_name>
|
The name of the directory in which you'd like Spats to write its output. Default spats_out.
|
--adapter-t <adapter sequence>
|
Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the adapter_trimmer script outlined above. The sequence of
adapter_t (which contains the RT primer). This option is used when adapter trimming is desired. Example
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
|
--adapter-b <adapter sequence>
|
Deprecatated in v0.8.0 and later. It is preferrable to handle adapter trimming via the adapter_trimmer script outlined above. The sequence of
adapter_b (which is ligated to the single-stranded cDNA products of the RT reaction). This option is used when adapter trimming is desired. Example
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
|
--num-mismatches <0-3>
|
The number of mismatches you want Bowtie to allow in its alignments for each read to the target RNAs.Note: this option does not work in v0.8.0 and later because of
the algorithm implemented in adapter_trimmer.
|
--m <int>
|
This option adds stringency to alignments, where <int> is the max number of aligenments allowed for each read. Reads with more than <int> number of valid alignments are not used in the reactivities calculation. Default is not stringent.
|
-p/--num-threads <threads>
|
The number of threads you want Bowtie to use when aligning the reads to the target RNAs
|
--all-RT-starts
|
If present prints out reactivties for each RT site detected. Default not used.
|
Output:
|
|
|
see below
|
|
The spats pipeline (adapter_trimmer and spats) takes three files as input: the target RNA file in FASTA format, and two files containing
the paired-end sequencing reads from the experiment in FASTQ format. The sequecing reads for both the plus and minus channels of the
SHAPE experiment must be contained in the files. Please refer to the references on the right hand side for
more information on how these handles are included in library preparation and sequencing. It is critical
that the mated reads appear in the same order in each file. The target RNA file is in FASTA format, and
contains one entry for each RNA that is included in the experiment, e.g.
>target_GTAC
GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCGTACAA
ATCGGGCTTCGGTCCGGTTC
>target_TTAA
GGTCGTGCCTAGCGAAGTCATAAGCTAGGGCAGTCTTTAGAGGCTGACGGCAGGAAAAAAG
CCTACGTCTTCGGATATGGCTGAGTATCCTTGAAAGTGCCACAGTGACGAAGTCTCACTAG
AAATGGTGAGAGTGGAACGCGGTAAACCCCTCGACCGATCCGCTTCGGCGGATCCTTAAAA
ATCGGGCTTCGGTCCGGTTC
...
In this case, two identical RNA sequences were barcoded with two sequences (GTAC, TTAA) and run in the same
experiment.
Spats Pipeline Output
The spats pipeline produces a directory (by default called spats_out) containing several output files.
De-multiplexed reads
Spats splits the reads into the plus and minus channels according to the
handle sequences you've specified. The output directory will contain FASTQ
files named <treated_handle>_1.fq, <treated_handle>_2.fq, <untreated_handle>_1.fq and <untreated_handle>_2.fq
Reactivities
The reactivities.out file is a tab-delimited file of values
providing the reactivity at each base of each target RNAs. The example
below shows the first few rows of a run against wild-type RNaseP (barcoded with CTTG) and
tRNA (barcoded with AAAG>). Data for different targets are concatenated within the same output file.
sequence rt_start five_prime_offset nucleotide treated_mods untreated_mods beta theta c
RNASeP_WT_CTTG 202 0 * 1144 4201 - - 1.22577
RNASeP_WT_CTTG 202 1 G 28 110 0 0 1.22577
RNASeP_WT_CTTG 202 2 G 87 315 0.000908854 0.000883849 1.22577
RNASeP_WT_CTTG 202 3 T 10 6 0.00553418 0.00539681 1.22577
RNASeP_WT_CTTG 202 4 C 7 23 0.000459691 0.000446923 1.22577
RNASeP_WT_CTTG 202 5 G 8 21 0.00146664 0.00142677 1.22577
...
tRNA_AAAG 136 0 * 1267 30440 - - 2.53892
tRNA_AAAG 136 1 G 118 1145 0.0213467 0.0205298 2.53892
tRNA_AAAG 136 2 G 28 123 0.00672447 0.00635241 2.53892
tRNA_AAAG 136 3 C 17 51 0.00432867 0.0040774 2.53892
tRNA_AAAG 136 4 C 84 319 0.0193321 0.0185458 2.53892
tRNA_AAAG 136 5 T 95 529 0.0182941 0.0175276 2.53892
...
sequence | Name of target RNA (from the input sequences FASTA file). |
rt_start | The base index where RT priming started. For SHAPE-Seq v2.0 this is the end of the RNA. For RT
primers that target internal sequences this is internal to the RNA sequence. |
five_prime_offset | Position in target RNA, 0=full length |
nucleotide | Base at position in target, *=full length |
treated_mods | Number of (+) fragments ending at position in target |
untreated_mods | Number of (-) fragments ending at position in target |
beta | Beta value at position in target |
theta | Theta value at position in target |
c | Estimated SHAPE modifications per target |
See the references, and especially Aviran, Lucks, Pachter: RNA structure characterization from chemical mapping experiments, for definitions of beta, theta and c.
Splitting reactivities.out
reactivities_split takes a Spats reactivities output file (reactivities.out) and divides it into individual reactivities files for each RNA, containing all of the
RT start positions (unsplit). These reactivities files are then split into smaller files that contain only one RT start position each.
reactivities_split
Usage: reactivities_split [options] <reactivities_file.out>;
Arguments:
|
|
|
<reactivities_file.out>
|
Spats reactivities output file containing reactivities information for all targets.
|
Options:
|
|
-h
|
Prints this help menus
|
-v
|
Prints the version number
|
-q,--quiet
|
Suppresses output of reactivity filenames being generated (useful for a lot of RNA targets)
|
-g,--group
|
Groups all split reactivities files into one directory, regardless of RT start position (useful for random priming)
|
-l,--long-only
|
Only generates reactivities files for the longest RT start position of any target (most 3'), takes precedence over -g or --group
|
|