Ribostral: An RNA 3D alignment analyzer and viewer based on basepair isostericities
Table of Contents:
1. Introduction
Ribostral (Ribonucleic Structural Aligner) is a suite of programs designed
to integrate known structural data with homologous sequence alignments, with
the purpose of evaluating the quality of the alignments and guiding efforts
to improve them. The main GUI (Graphical User Interface) of this program provides
an expandable user-friendly platform through which other related programs can
be run. The related programs gather and analyze atomic resolution structure
data, parse and automatically align sequences, and perform other manipulations
on sequence alignments including extracting sub-alignments corresponding
to individual motifs or domains, removing
repeated sequences from an alignment to build a “unique” alignment
with higher phylogenetic diversity, and creating a .fasta alignment file from
.mat, which is another alignment format used by Ribostral. These tools are
covered in detail at the end of this manual. The main functions of the program,
namely analyzing, evaluating, and viewing RNA sequence alignments are discussed
first.
RNA sequence analysis programs are common, but none of them provides the valuable structural information provided by ribostral. One of the most widely used editors for manual alignment of RNA sequences is the program BioEdit, which runs under Windows platforms (1). BioEdit reads a sequence alignment file and allows the user to choose pairs of nucleotides one pair at a time and display substitution (covariation) patterns for them (in BioEdit, this is called Mutual Information Examination). The resulting substitution table, however, provides no description of what the observed substitution patterns mean in terms of structure. BioEdit does provide one kind of link between sequence alignments and structure: if the sequence alignment contains a “mask” describing cis Watson-Crick nested basepair, BioEdit will color nucleotides in the alignment in terms of how well they conform to the basepair family. A mask is a representation of the locations of basepairs occurring between nucleotides represented in the sequence alignment, with paired characters such as "(" and ")" indicating the positions of basepairs in the alignment. BioEdit however does not provide such information for any of the other eleven or so families of edge-to-edge interactions that are possible between basepairs, comprising about one third of all interactions (2). BioEdit is designed to be primarily an alignment editor and viewer, and not a tool for structural alignment of RNA the way Ribostral is. Coseq, a program that runs under UNIX platforms, measures substitution patterns of basepairs without using any structural information (Massire and Westhof, unpublished). The user then needs to analyze the structure manually to see if sequence substitution patterns agree with it. Finally, S2S is another more recent UNIX program that dynamically displays parts of structure, creates full 2D annotations of them, and shows corresponding positions in sequence alignments in an alignment editor (3). However, it too does not evaluate alignments based on isostericities of basepairs formed in structure the way Ribostral does.
Ribostral, which runs under multiple platforms, can either be used like other sequence analysis programs, i.e., to simply provide substitution patterns of basepairs in a sequence alignment without any relation to structure, or can be used as the much more powerful tool that it is designed to be: to provide substitution patterns, and at the same time superimpose structural information on top of the substitution patterns to make sense of it. Ribostral does that by coloring substitution patterns of each basepair in a way that reflects the edge-to-edge family and the isosteric subfamily it belongs to (4). Even in its more simple usage as a program that provides sequence substitution analysis without the use of 3D structural information, Ribostral is more convenient than other programs, because it allows for simultaneous analysis of lists of basepairs and produces a single and easily portable HTML output, without having to input nucleotide numbers of interest one position at a time. Interactive position-by-position analysis is also possible, where in addition to what has been described above, an integrated structure viewer is also available. In addition to providing substitution information for basepairs, Ribostral is also capable of analyzing substitution patterns for more than two nucleotides at once, such as base triples, quadruples, and so on. The sequences of whole motifs can be analyzed this way, as will be shown below.
1.1. Supported platforms and deployment process
Ribostral is designed and fully tested under Windows XP. It is free and can
be obtained from http://rna.bgsu.edu/Ribostral. The program is distributed
in two forms: MATLAB source files capable of running on any PC or MAC platform
with MATLAB version 7 SP3 or higher (with loss of some non-essential options
on
MAC), or a stand-alone program capable of running under the PC
platform, after installation of a free compiler provided by Mathworks
(details can be found here).
Figures used in this text are based on the Windows XP version with system
appearance set to Windows Classic style. Upon downloading the MATLAB source
files or the stand-alone version, the user will end up with the specific
hierarchy of folders shown in Figure 1.
Figure 1. Ribostral default installation subdirectories. Using
folders marked with asterisk (*) is optional; on Windows platforms, these are
the locations where Ribostral starts browsing for the corresponding input files.
1.2. Disclaimer
No guarantee, expressed or implied, is made as
to the suitability of this software for any purpose, computer, or person. The
author shall not be held responsible, nor be liable for any damage occurring
in any way to equipment or health while using this software.
Author contact information: Ali Mokdad, M.D., Ph.D., Department of Biological Sciences, Bowling Green State University, Bowling Green, Ohio 43403. Email: mali@bgsu.edu.
2. Executing
Ribostral
Ribostral can run as a GUI or as a script under MATLAB (by
running the program ribostralNoGUI; follow the header information to point
to your files of interest). The following discussion concerns mainly the
default GUI version.
The startup window of Ribostral is a blank GUI with
three menu options: File, Tools,
and Help
(Figure
2).
The
other
GUIs
of the
program
are activated
from
this window.
Figure 2. Ribostral’s main GUI.
To avoid ambiguity, the GUI only displays options that are allowed at the specific stage of the analysis. For example, before doing any analysis the user needs to load a sequence alignment file on which the analysis will be performed. So initially, the GUI does not display the buttons that start the analysis. These buttons will appear only after the sequence alignment file is loaded into the program. To provide additional help for the user, the upper right-hand corner of the main GUI is reserved for messages describing the status of the program, any errors in execution, or hints on how to proceed further.
2.1. Loading an alignment file
To start the analysis, the user needs to follow the hint that appears in the
upper right-hand corner of the main GUI: “Hint: start from FILE”.
By clicking on “File” from the menu bar, the user sees the options:
Open FASTA alignment, Open NT list, and Preferences (Figure 3).
Figure 3. File menu options.
The first thing to do is to open an alignment file. In its
current version, Ribostral reads only alignments in FASTA format, the simplest
and most common sequence alignment format available. In this format, each sequence
is represented by a header line preceded by the symbol “>” and
one or more sequence lines. Upon choosing “Open FASTA alignment” from
the File menu, the user can browse the local drive for alignment files (Figure
4).
Figure 4. Browsing for an alignment file.
On PC machines, the browser window initially starts looking
for alignment files in the folder:
“<
installation directory>\FASTA_alignments” (refer to Figure 1). Here,
there are two main options: reading a raw alignment file in its text form (with
extensions such as .fasta, or .txt), or reading a MATLAB data file (.mat) which
is derived from the raw alignment file. This last option is processed faster
by Ribostral, especially if the alignment of interest contains thousands of
sequences, like the 16S rRNA alignments. A .mat file is created after a FASTA
alignment is read for the first time. It is saved in the same directory and
under the same name except for the extension. Time can then be saved by reading
the data file instead of the text file the next time the same alignment is
used. Note that a new data file is only created if no data file with the same
name is already present in the directory. Therefore, if the raw FASTA alignment
is deliberately modified in any way, the old .mat file referring to it needs
to be deleted to allow for its re-creation. Notice that when a FASTA format
file is read, all characters indicating unknown nucleotides (“N”, “n”, “O”,
and “o”) are transformed into “o”, and all other characters
are capitalized. When analyzing basepairs, Ribostral only recognizes dashes
(“–”) which represent gaps or deletions, o’s, and the
four RNA nucleotide letters A, C, G, and U as valid characters in sequences.
When analyzing longer motifs, the program recognizes all characters.
2.2.
Dividing Sequences into Subgroups
After choosing the alignment file of interest, Ribostral looks in the same
directory for an Excel file called “KnownFASTAFilenames.xls”. This
is an optional user-created Excel sheet that gives additional details about
the FASTA file being read, such as the names of different groups (or domains)
of organisms it represents, and the boundaries of these groups. If this Excel
file is not present in the same folder as the FASTA file being read, or if
the exact name of the FASTA file being read cannot be found in the Excel file,
this information will be ignored and all FASTA sequences in the alignment will
be considered as one group. The analysis of specific subgroups (such as phylogenetic
domains) will not be possible later. Figure 5 shows the format of the “KnownFASTAFilenames.xls” file
and the information it contains.
Figure 5. A snapshot of the “KnownFASTAFilenames.xls” file.
The highlighted entry (number 6) is the FASTA file analyzed in this manual.
The Excel file is organized in the following way: The first
column contains the exact names of “known” FASTA alignments; the
second column contains the names of the subgroups the FASTA alignment contains
separated by empty spaces; and the third column defines the limits of these
subgroups in the order of their names, separated by empty spaces. The highlighted
file (entry 6, 5S_ABE_2004_UNIQUE.fasta) is the one used for the sample study
in this manual. It contains sequences from the phylogenetic subgroups: archaea
(sequences 0+1 to 39), bacteria (sequences 39+1 to 390), and eukarya (sequences
390+1 to 667). To ignore the first sequence in the alignment (if it is a structure
mask for example, or if it is a reference sequence that does not belong to
the subgroup), “1” can be placed instead of “0” in
the third column. In that case, the program still reads the first sequence
but does not include it in the analysis.
The main GUI then displays any known details associated with the FASTA file
and also unlocks some buttons or check boxes that permit the user to analyze
the alignment (Figure 6). Note that if this step (loading a new FASTA file)
is repeated at any stage during the analysis, all previous data that corresponds
to previously read alignments will be erased from memory and the GUI will be
reinitialized.
Figure 6. GUI changes when an alignment file
is successfully loaded. The text box in the middle displays the name of the
sequence file in memory.
At this stage, two options are possible for carrying out sequence (and optionally structure) analysis: analyzing a list of nucleotides (with or without structure data) for the complete sequence analysis of all positions listed in it, or interactively and directly analyzing individual positions on screen. The first option is covered first.
2.3.
Preparations for the analysis of a list of nucleotides
After loading a FASTA file, the user can load an Excel file containing a list
of the nucleotides of interest. This is done by activating the option “Open
NT list” from the File menu. A browser opens and on PC platforms search
for Excel files starts in the local directory “<installation directory>\NT_lists” (Figure
7).
Figure 7. Browsing for an Excel nucleotide list.
The NT list Excel file must contain names of the source
organisms and nucleotide numbers of interest. The first row is a header row
and anything in it will not be read. Figures 13 and 14 show what such a file
should look like.
Figure 8. Nucleotide list for analysis. This
Excel list from 23S rRNA allows the sequence analysis of four nucleotide positions
simultaneously.
Figure 9. Basepair list for analysis. This
Excel list from 5S rRNA includes structural data about some of the basepairs
(to show that structural data is optional).
As seen in Figures 8 and 9, the first column of the Excel sheet contains the name of the reference organism for that row, and subsequent columns contain the nucleotide numbers to be analyzed. The name must be the full name or part of a name (case insensitive) present in the loaded FASTA alignment file. It does not necessarily need to be the beginning of the name (e.g. “Halomari” can be used if the actual name in the FASTA alignment is “Eu_halomari”). If several FASTA comment lines share this name, the first of the occurrences will be considered as the reference organism (e.g. if “halo” is used, and it is present three times in the sequence alignment, first as “EU_HALOJAPO”, then as “Eu_halomari” and then as “Eu_halomedi”, the first will be used). It is advisable for the user to manually check sequence names in the FASTA file to prevent inadvertent reference to unintended organisms. The most common error is caused by incorrect spelling of the source organism in the NT list Excel file; if such a spelling cannot be found in the FASTA file an error message will appear and the program will stop. If universal numbers for the alignment are desired the word “universal” can be used instead of an organism name (similar to entry 8 in Figure 9).
Structural information (i.e. basepair type) is allowed only
for basepair lists. These are lists comprising two columns of nucleotide numbers,
like
the one
shown in Figure 9. Basepair types are named by reference to the interacting
edges of the nucleotides. These are the Watson-Crick edge (W), Hoogsteen
edge (H), and sugar edge (S). Edge-to-edge interactions in addition may
be cis (c)
or trans (t) with respect to the glycosidic bond. This gives rise to twelve
main families of basepairs (2,4), and
some intermediate families (5), only one of which with
currently characterized isostericity matrix (the bifurcated cWW/tWH family).
These thirteen
basepair types are coded as follows in order for Ribostral to understand
them (case insensitive for all except tSs, because it is a directional
interaction
with asymmetric isostericity matrix): cWW, tWW, cWH, tWH, cWS, tWS, cHH,
tHH, cHS, tHS, cSS, tSs, and bif. Asymmetric codes (such as cWS) can be
reversed,
so in the table if nucleotides 22 and 26 in this order are coded as tSW,
this is the same as 26 and 22 being tWS (see Table 1 for a definition of
all the
basepair codes used by Ribostral). Ribostral always presents data about basepairs
as one of the thirteen codes listed above (and not the reverse form).
Table 1. Basepair codes used in Ribostral.
The last column shows the codes used for constructing the structural mask in
Alignment Viewer, discussed later in text.
If the type of basepairing is known for a pair of bases, the output of the sequence analysis will be colored to indicate its isosteric subfamilies, so the investigator can determine whether the aligned nucleotides for each sequence represent isosteric or near-isosteric substitutions. If no structural information is available, the table output will only display sequence substitution data observed in the corresponding columns of the alignment.
After successfully reading a nucleotide list file, Ribostral displays a new
button allowing for the analysis of the whole list at once (Figure 10).
Figure 10. The main GUI after successful loading
of a nucleotide list file. The status bar (upper right hand corner) displays “NT
list loaded” and the new button “NT list analysis” becomes
available.
2.4.
Analyzing a list of nucleotides
Upon activation of the button “NT list analysis” the program takes
each row of the Excel NT list file and counts the number of each substitution
of its nucleotides in corresponding columns of the alignment. This process
may take a few minutes for long lists of nucleotides or for large alignments.
The MATLAB command prompt (or the DOS prompt if the compiled stand-alone version
of Ribostral is executed) displays different messages and counters indicating
the status and progress of the analysis. Upon successful execution, output
files are created in the folder “<installation directory>\Output” (refer
to Figure 1), with descriptive names indicating the NT list file and the alignment
file they represent. At the same time a new button providing a quick link to
the main output file will also appear on the main GUI.
3.
Types of output files and their interpretations
Depending on the type of input NT list, several types of output files are
possible (Figure 11). The following sections discuss each type of these output
files
in detail.
Figure 11. Output files produced depending
on input files analyzed. The first output file in each case is the “main” file
accessed directly by clicking the button “Display list results” on
the main GUI. Outputs indicated by asterisks (*) are produced optionally by
changing the program preferences.
3.1. Output
files for a non-basepair list
If the input NT list is not a list of basepairs but a list of nucleotides forming
distinct motids (such as the one shown in Figure 8), two text output files
will be created, one giving counts and the other percentages of sequences that
share a common pattern (Figure 12). The patterns are listed in alphabetical
order. To clarify this, the following example from Figure 12 is considered.
The first entry in the Figure (highlighted) shows that in the archaeal part
of the alignment, nucleotides corresponding to Haloarcula_marismortui local
numbers 545, 611, 529, and 14 (in this order) are 88% GUGC and 12% GUGo. From
this one can deduce that Nucleotides 545, 611, and 529 are GUG in 100% of the
cases, and so on. The values in the percent output file are rounded according
to the preferences of the user (Ribostral preferences will be discussed in
detail later).
Figure 12. Percent output text file created
upon sequence analysis of an NT list. Sequence patterns are listed in alphabetical
order.
A similar output file with counts instead of percents is also created. This
analysis is for the NT list shown in Figure 8.
3.2.
Output files for a basepair list
In case the Excel input NT file gives a list of basepairs from an atomic resolution
structure, Ribostral generates output files that are better tuned and more
informative for this kind of input. These include colored HTML outputs in the
form of tables with sequence substitution values. The tables use different
background colors to indicate for each basepair whether observed substitutions
are isosteric, near-isosteric, heterosteric, or forbidden as compared to what
is observed in the structure. By default, three HTML output files (and a fourth
text output in case the input Excel list is a basepair list with structure
information) are produced upon the analysis of a basepair list (see Figure
11). The main output file accessed directly by the “Display list results” button
on Ribostral is the percent covariation file whose name ends with “_COV_Percents.html”.
A similar “_COV_Counts.html” file shows the counts instead of percents.
Figure 13 displays a snapshot of such an output. A third file (ending with “_SEQ.html”)
displays the names of organisms giving rise to the values in the first two
output files. The positions of the names correspond to the values observed,
as clarified in Figure 14. Note that the basepair identity observed in structure
is indicated by bold font in the corresponding cell.
Figure 13. Percent output HTML file created
upon phylogenetic analysis of an basepair list. A similar output file with
counts instead of
percents is also
created. This analysis is for the NT list presented in Figure 8 above.
Figure 14. Matching sequence values and names.
The “_COV_Counts.html” output file (left panel) displays the sequence
counts, and the corresponding “_SEQ.html” (right panel) displays
the names of sequences. Note the value indicated by the black arrow in both
outputs: The left panel shows that there are three bacterial counts for AA
occupying the position A3/G21 (halomari local numbers). The right panel shows
that these three counts come from the sequences of G_complan, P_palmata, and
S_vulgare.
The output files have two purposes: First, to display substitution patterns corresponding to each BP position analyzed, and second, to provide information about how well each of these positions in alignments agrees with 3D structure. Based on this, if there are any potential mistakes in the alignment they are easily pinpointed for their manual correction with available sequence editors.
All three HTML output files mentioned start with a total score assigned for each phylogenetic subgroup in the alignment (in this case the subgroups are the three domains: archaea, bacteria, and eukarya). Each subgroup will have its own results printed on a separate line in the substitution tables. The areas in Figure 14 indicated by the black arrows for example represent the substitutions in eukaryal sequences of the basepair A3/G21 (using halomari local numbers). On top of each table that represents a basepair position the following information is provided: The names of the subgroups in the order they are analyzed, the numbers of sequences in each subgroup, the count of forbidden substitutions present in the table (in red color, with the letter F following the value), the count of gaps present including gaps on both sides or just on one side (in gray, with the letter G following the value), the count of isosteric substitutions (in blue, with the letter I following the value), nearly-isosteric substitutions (in cyan, with the letters NI following the value), non-isosteric but not forbidden substitutions that we refer to as heterosteric substitutions (in pink, with the letter H following the value), and finally individual scores of each subgroup at that basepair position.
The score is a value describing how structurally “acceptable” the sequence substitutions at analyzed basepair positions are. This is based on the measure of how structurally compatible these substitutions are with the basepair in the reference organism, which is typically the organism with known structure. The individual basepair scores are currently calculated based on an ad hoc formula derived from experience with structures and knowledge of the patterns of allowed substitutions for each type of basepair. This formula can be easily modified by the user, by manually entering desired weight parameters in the file “<installation directory>\Ribostral\SCORES.txt”. The formula used throughout this text is:
Individual BP score = c * SUM(3I + 2NI – H – 2F – 2G1 – 3G2) / number of sequences
Where c is the correction coefficient: c = 100 / (Highest positive weight),
in this case c =100/3.
I, NI, H, F, G1 and G2 are the counts of sequences having substitutions that
are isosteric, nearly-isosteric, heterosteric, forbidden, gap on one side,
or gaps on both sides respectively. The Highest positive weight is 3 in this
case. The correction coefficient c insures that the maximum score is +100 (in
this case, the minimum score when all substitutions are gaps on both sides
is –
100, but that is not always the case depending on the formula used). The formula
used here is asymmetric: unfavorable terms that contribute to its reduction
are more numerous and weigh slightly more than favorable terms that contribute
to its increase. But for our purposes, this is not a critical point. What is
important is to easily identify low-scoring spots in the alignment to guide
manual realignment efforts. To better locate these trouble spots in the alignment,
the score is printed in red if it is worse than a certain threshold (i.e. if
it is below zero, which is the score in case no structural data is available)
and in black otherwise. The total score printed at the top of each HTML output
file is an adjusted sum (sum divided by number of basepairs studied) of the
individual BP scores in the study (so it is actually an average of the individual
scores). This means that its maximum is also +100 and its minimum, in case
the formula presented here is used, is –100.
Note that another valid formula used can be something like:
Individual BP score = c * SUM(2I + NI – 3F) / number of sequences
Both formulas are ad hoc and the difference between their weights is not essential. The ease of reading and understanding normalized values between +100 and –100 or –150 is the reason for using a correction coefficient.
The most obvious and important element of the HTML output files discussed
thus far is the background color patterns that appear in their tables. These
colors reflect the isostericity matrices of the thirteen characterized families
of basepairs. Each family has its own pattern of isosteric, nearly-isosteric,
heterosteric, and forbidden substitutions. Boxes with the same colors are isosteric.
Boxes that are nearly-isosteric with each other have “similar” colors.
There are five groups of these “similar” colors: pink/red/orange,
red/orange/yellow (so substitutions with pink and yellow backgrounds are not
nearly isosteric to each other), blue/cyan, dark green/light green, and brown
(Figure 15). Forbidden boxes are colored in dark gray, and gaps in light gray.
Boxes with non-similar colors indicate that they are heterosteric to each other.
If the basepair family for a certain basepair is not specified, all nucleotide
boxes in its table will be colored in neutral beige (like the second BP listed
in Figure 13).
Figure 15. Basepair
families and their isosteric subfamilies. Gray colors indicate forbidden combinations
of nucleotides,
i.e., combinations
that cannot form basepairs due to steric clashes or incompatible distribution
of H-bond donor or acceptor atoms. In each family, isosteric subfamilies have
the same color, and nearly isosteric-families have similar colors (the five “similar” color
groups are shown on the bottom right corner). Letters correspond to the original
reference (4). Asterisk (*), corrected from the original
reference.
One more output file is produced by default in case a basepair
input list with structure information is analyzed. This is the “_Statistics.out” file,
which is best viewed with Microsoft Excel or WordPad. This output summarizes
some of the results by stating the percentage of basepairs analyzed that have
mostly allowed substitutions (containing <10% forbidden or gaps), the percent
of basepairs that have >10% forbidden substitutions, and the percent of
basepairs having >10% of sequences with gaps at their positions. It also
shows the percent of basepairs among the ones mostly allowed that have only
isosteric substitutions. Instead of the total score which is just one value
that describes the quality of the alignment studied, this output file gives
a more quantitative measure of the alignment quality.
When a Basepair list is analyzed, a button labeled “Plot scores” also
appears on the main GUI. When activated, this allows the user to quickly analyze
the scores of each basepair and define places of misalignment or motif swaps
(Figure 16).
Figure 16. Score plots for the archaeal
5S rRNA alignment. The red curve describes the scores of corresponding basepairs,
and the blue
line represents the average score (or total score as defined in this text)
for the whole alignment based on the basepair list provided. By modifying the
ad hoc formula according to which scores are calculated (this can be done by
changing parameters in the “SCORES.txt” file), the user can plot
any combination of one or more aspects of the alignment, such as percent isosteric
substitutions, percent isosteric and near-isosteric substitutions, and so on.
3.3. Ribostral Alignment
Viewer
There are additional output files not produced by default that can be obtained
by changing the Preferences under the File menu options. The reason for not
producing those output files by default is not to overwhelm the casual user
with too many output files, and not to unnecessarily extend the execution time.
If the “AlignViewer” option is selected in Preferences, an HTML-format
alignment viewer is created where the sequence alignment is shown in colors
indicating substitutions that are compatible with the 3D structure and those
that are not. This tool is the first of its kind taking into consideration
all thirteen families of basepairs instead of just the classical cis Watson-Crick
family. The sequences in the Alignment Viewer are colored in a way to describe
how well each column (more precisely, each pair of columns) in the alignment
agrees with structure (Figure 17). Basepairs from each sequence are colored
individually based on their isosteric agreement with the homologous basepair
from the reference sequence, which is usually the sequence with known 3D structure.
If the substitution pattern is isosteric to the one in the reference sequence
it is colored in green, if it is nearly-isosteric to it, it is colored blue,
heterosteric is pink, forbidden is red, gap on one side is dark gray, and gap
on both sides is light gray. If no basepair information is available the nucleotides
are printed in black (these color assignments can be modified by changing the
MATLAB script file “mColorCode.m”). For easy interpretation, the
color scale is printed at the top of the alignment viewer HTML file. Thanks
to this tool, it is now possible to look at the sequence alignment and directly
get a good idea about whether the alignment is structurally valid or not, and
where the major areas for local alignment mistakes (or motif swaps) are located.
Note that if base triplets are present (i.e., two basepairs with one nucleotide
in common), the basepair listed earlier in the Excel BP list file takes priority
in coloring. If the user prefers to give priority to cWW interactions for example,
then these should be listed first in the BP list.
Figure
17. Ribostral HTML alignment viewer. The visible part of the 5S alignment
describes how well sequences agree with structure, represented here by Eu_Halomari,
or sequence number 11. The color code is printed at the top. Note that the
whole content including colors can be copy/pasted into Excel or other editors
for manipulation.
Ribostral Alignment Viewer starts with several title lines
containing the names of the analyzed files as well as the legend of all codes
and colors used. Following this is the listing of all organism names and their
sequences, exactly as they appear in the original FASTA alignment. Organism
names change their color gradually as they approach the limit of the subgroup
or domain they belong to. This is how it is made clear that sequence 39 for
example is the end of the archaea domain. The first five sequences (those on
black background) represent the universal numbers, local numbers, and three
types of structural masks that describe basepairing patterns. Universal numbers
are assigned to each character seen in the sequence, including all indels (“–”).
Local numbers are the numbers that correspond to the reference sequence; unlike
universal numbers, local numbers are not assigned to indels in this sequence.
In the Alignment Viewer only decimal representatives of local numbers are listed,
i.e. the first “1” stands for 10, the first “2” stands
for 20, and so on until the first “0” is seen, which stands for
100. After that the numbering cycle is repeated, so the second “1” stands
for 110, the second “2” stands for 120, and so on.
The structural masks describe basepairing patterns reported in the Excel BP
list directly on top of the sequences. Figure 18 is a schematic description
of the information represented by structural masks.
Figure 18. Schematic explanation
of structural masks. These are one dimensional representations of structure.
The example shown here is from Helix 95 containing the sarcin/ricin motif
in the large ribosomal subunit of H. marismortui (pdb file 1S72). Color codes
correspond to basepairs. Note that the 13D and 3D masks may have several
symbols overwriting each other; only one of the cyan RG and ][ symbols displayed
in this Figure on top of each other is displayed in the 13D and 3D masks
produced by Ribostral.
The 2D mask describes only nested basepairings, irrespective of their geometric families (most of these would be helical cWW basepairs). Each nested basepair (one which does not cross in 2D with other basepairs) is represented by the two parentheses symbols “(” for its opening or 5’ nucleotide and “)” for its closing or 3’ nucleotide. The 3D mask is the same as the 2D mask, but in addition to representing nested basepairs it also represents other non-nested basepairs using the bracket symbols “[” and “]”. When one nucleotide forms more than one basepair by using more than one of its edges at once (e.g. in base triplets), the number of bracket symbols of the opening and closing type may not be equal. Therefore, it is not possible from this mask alone to know which opening brackets correspond to which closing brackets. Another limitation of both the 2D and 3D masks is that they do not tell anything about the type of basepairs formed. The 13D mask is designed to solve this problem. Here, instead of just one or two pairs of symbols representing opening and closing of basepairs, thirteen different pairs of symbols are used to represent all known families. These codes are AB for the cWW family, CD for tWW family, and so on (see Table 1). Since some families also are not symmetric (C/G cWS has a different meaning than C/G cSW), uppercase symbols and lowercase symbols are used to indicate the direction of the interaction. The symbols must be read from uppercase to lowercase direction whenever the interaction is asymmetric, with the uppercase symbol referring to the base edge of higher priority (edge priority is assigned in the order W, H, S). Thus, if an interaction exists between columns (universal numbers) 38/42 and is represented by kL, this means that nucleotides occupying these positions form a tSW interaction in the source organism (or structure). This is the same as saying that 42/38 is a tWS (or Lk) interaction. The 13D mask has the same limitation as the 3D mask, in that finding matching opening and closing symbols is not always straight forward. Using the 2D mask together with the other two masks makes it easier to at least find nested pairs.
3.4. Other output formats
Ribostral can also create other output files upon request (by checking the
option “GU special” in Preferences). These files represent another
way of classifying some similar isosteric groups together, the way the G/U
wobble basepairs are analyzed in our previous work (6).
Ribostral scripts can easily be modified to produce other similar outputs.
4.
Interactive analysis of nucleotides
Besides analyzing lists of nucleotides or basepairs at once and producing dedicated
output files that describe their results, Ribostral is also capable of analyzing
nucleotides of interest one by one and directly displaying their results on
screen. This can be done by activating the button “Interactive analysis” from
the main GUI, which starts a new GUI window (Figure 19). This option becomes
possible only after successful loading of a FASTA alignment file. If a BP list
file is also loaded before the interactive analysis is activated, the new GUI
will have more options and information extracted from that file. In the following
text we will discuss the options available in case a BP list is loaded before
activating the interactive analysis GUI.
Figure 19. Initial screenshot of the interactive analysis GUI. Here, both
a FASTA file and an NT list file were loaded from Ribostral main GUI before
starting this GUI. If only a FASTA file was loaded, some of the options or
buttons seen here would not have been made visible.
The interactive analysis tool is a powerful tool providing yet a new array of functions not provided by the list analysis tools discussed in previous sections. After specifying the right choices (such as source organism name, domain of interest, etc…) the user can either analyze a specific family of basepairs, or analyze a particular position.
4.1.
Interactive analysis of a family of basepairs
The right-hand side of the interactive analysis GUI contains the options and
buttons capable of gathering statistics about a whole family of basepairs at
once. This option is only possible if the user has loaded a basepair list before
opening this GUI. An example of what can be done here is the analysis of all
the occurrences of the tHS family in the archaeal part of the 5S rRNA alignment.
The result of such an analysis is shown in Figure 20.
Figure 20. Analysis of all occurrences of a
basepair family. All basepairs of family 10 (tHS) are analyzed here in the
archaeal
5S rRNA alignment.
The GUI shown in Figure 20 states on the top right the number
of sequences in the chosen domain (in this case the domain archeae has 39 sequences).
Below that, the number of basepairs forming this particular interaction in
the source organism is shown. In this case six such tHS basepairs are found.
The isosterically colored buttons that appeared in the middle of the GUI display
the substitutions in the sequence positions corresponding to all these six
positions at once. Notice how most of them are clumped in the yellow isosteric
subfamily (the colors are the same as defined previously for the HTML output
files). Upon clicking on any of these colored buttons the names of sequences
giving rise to them (preceded by the corresponding nucleotide numbers in the
source organism) will be displayed in the lower part of the GUI (Figure 21).
This allows for easy identification of organisms with potential mistakes in
their alignments, so that the investigator can realign them by hand.
Figure 21. Getting sequence names with specific
substitution patterns. Upon clicking any button in the substitution matrix
(button GU
is clicked here), the names of sequences giving rise to the substitutions
become displayed in the lower part of the GUI. Sequence names are preceded
by the corresponding nucleotide numbers from the source organism.
4.2. Interactive analysis of individual basepairs or motifs
In addition to analyzing a family of basepairs, it is possible through this
GUI to analyze a particular position by entering individual numbers separated
by commas “,” or dashes “-” into the editable box in
the middle of the GUI. But first, the source of local nucleotide numbers has
to be specified. If this GUI is activated after a BP or NT list is loaded,
then the sequence name of the first entry will be initially displayed in the
white editable box and gray drop-menu box that define the source sequence for
the analysis (upper left corner of the GUI). Both of these boxes can be used
to specify the source organism, but in case the sequence names in these two
boxes are different, then the white editable box takes priority. If universal
numbers are desired the word “universal” should be typed in the
white editable box and carriage return pressed. If only two
nucleotide numbers separated by “,” are entered (such as “22,
26”), the sequence substitutions for these two positions are displayed
in the same output format seen for the basepair family analysis described in
the previous section. The value in the box that corresponds to the source organism
will be printed in bold. If the nucleotide numbers entered are present in the
input Excel BP list and the interaction they make is known, the buttons will
be colored according to their BP family isostericity matrix. The family name
also will be printed on the upper left corner. If any of the buttons is clicked
the names of sequences giving rise to the value in it are displayed in the
lower part of the GUI (Figure 22). (Note: it is possible to scroll between
the basepairs from the input BP list one by one by clicking the green buttons
labeled “<<” or “>>” that appear in the
middle of the GUI).
Figure 22. Basepair interactive analysis. Notice
that if the query is “22, 26” instead of “26, 22” the
same result is obtained, since this is a known BP present in the input BP list
and
it is always colored in the same way. The BP identity present in the source
organism (also the crystal structure) is CG (printed in bold font).
If instead of a basepair position (such as “22, 26”),
the user enters only one, or more that two numbers separated by commas, or
if the user enters two numbers separated by a dash, or any logical combination
of comma-separated and dash-separated numbers, then a motif sequence analysis
will be performed and results will be produced in a different format, where
the motif pattern and its number of occurrences will be printed in text instead
of the colored table. The pattern present in the source sequence will also
be indicated by the “<<<” sign. Once again, if the user
clicks on any one of the nucleotide patterns, the names of sequences giving
rise to them will be displayed in the lower portion of the GUI (Figure 23).
Figure 23. Motif interactive analysis. Individual
nucleotides in the motif are arranged in the order of the positions entered.
Since the query here is “26-22, 37”, the letters that correspond
to local number 26 of source organism is displayed first, then those that
correspond for 25, then 24, 23, 22, and finally 37. If instead of this the
query was “22-26, 37” the result would be displayed differently.
Finally, the interactive GUI provides a quick link to a basic structure viewer through the activation of the button “Display 3D”. When this button is clicked the first time it prompts the user to choose the PDB file corresponding to the sequence studied, and atomic coordinates are read (the bigger the PDB file the slower this process). The structure of the input nucleotides is then shown. Subsequent activations of the button within the same session display the structure immediately.
4.3.
Notes
1- The user can change the color distribution of the substitution
table to be similar to another basepair family. This may be important in
case mistakes in the BP list are suspected, or in case no BP family is stated,
causing the whole table to be in one neutral color.
2- It has been shown that the colored GUI buttons may display the results
of either a whole BP family or an individual position. To prevent any confusion,
a thin white box appears everytime one of these two types of analyses is
requested. This line surrounds the output buttons as well as the input buttons
giving rise to them. The difference can be clearly seen between Figures 20
and 21 on the one hand and Figures 22 and 23 on the other.
5. Ribostral
preferences
Ribostral is a diverse program capable of dealing with data in a variety
of ways. Through its preferences, the user can enhance his experience with
the program and make it do what is best for the investigation in hands. Ribostral
preferences can be accessed from the File menu of the main GUI. They are
then saved as a data file on the users disk, for easy passing between all
subordinate Ribostral GUIs. Last positions of GUIs before exiting them are
also saved in this file, so that GUIs always open in the last known position
they were closed in. Figure 24 shows the default preferences of the program.
Figure 24. Ribostral default preferences.
The preferences are divided into several categories. Round
radio buttons belonging to the same category are mutually exclusive. The first
two categories, “Display Totals” and “Display Expected”,
affect only the list output files and do not affect the interactive analysis
GUI. If “Display Totals” is enabled, the totals of each row and
column of the BP list output are printed with the output (as in Figure 13).
Otherwise, the totals will be left out. If “Display Expected” is
enabled, the expected value for each substitution is printed in parentheses
after the observed substitution value, as in Figure 25.
Figure 25. Substitution counts with expected
values (in parentheses).
The expected value for each box in the table is obtained by applying the formula:
Expected (of box in row a, column b) = Sum of boxes in row a * Sum of boxes
in columns b / Table total
This value is a measure of the likelihood of observing substitutions in the
specific box, based on the values observed in the neighboring boxes.
Enabling the “Audio” option allows the program to play specific musical chimes every time an operation is completed or an error occurs. This is useful in case a large file is being read or analyzed. In such cases the user will be notified by the sound to carry on to the next step.
Enabling “Consider Gaps” means that the program will not ignore insertion characters (“–”) seen in the alignment. Disabling this option not only stops the program from showing gaps in the output, it also ignores gaps from any other calculations, such as score or percent substitution calculations. The “consider Ns” option does the same thing but with characters representing undetermined nucleotides (usually symbolized by “o”, “O”, “n”, and “N”, which as stated earlier are all converted to “o” when the FASTA alignment file is originally read). These two options do not affect analysis of non-basepairs, such as base triples or longer motifs.
Several rounding options are available through a dropdown menu. The choice here affects the presentation of all decimal calculations done by Ribostral. Activating any of the “Additional Output” options results in the production of one or two more output formats that Ribostral does not produce by default (discussed earlier). Preferences are saved only if the “SAVE” button is activated.
Finally, upon activating the button “Modify Score Parameters”, the file containing score parameters (“<installation directory>\Ribostral\SCORES.txt”) is opened for editing. The parameters entered in this file affect all score calculations. The resulting formula is shown on screen and is also printed out in some output files every time sequence analysis is carried out.
6. Supporting tools
Ribostral provides an expandable sequence and structure analysis platform.
Additional tools can be easily and smoothly integrated into the main program.
There are currently five additional tools that allow Ribostral to do operations
beyond structurally analyzing sequence alignments (Figure 26). Each of these
tools is discussed separately below.
Figure 26. Ribostral tools.
6.1.
Tool 1: Generate BP list from PDB
This tool currently works only on PC platforms. PDB files are text files containing
all the atomic coordinates and related information about a structure. By analyzing
this data, interacting bases can be identified and classified into geometric
families (7). Upon activation of this tool, the user first chooses a PDB file
(browsing starts from the default PDB directory associated with Ribostral “<installation
directory>\PDB_structures”). Then, the user is prompted to enter the
sequence name in the alignment that corresponds to this 3D structure. Ribostral
then creates an Excel BP list for each of the basepair families found in the
crystal structure. It also creates a separate Excel BP list containing all
of them, and another Excel BP list containing all the non-cWW ones among them
together. These lists are created in the proper format to be read directly
by Ribostral and are saved in the default “<installation directory>\NT_lists” folder.
Note that the automatic classification of basepairs is not 100% accurate, so
the resulting automatic lists should be visually compared to the 3D structure
and corrected where needed.
6.2. Tool 2: Align sequences
This tool opens a GUI that interfaces with automatic motif alignment programs
based on a hybrid Stochastic Context Free Grammars/Markov Random Field (SCFG/MRF)
model. The GUI facilitates some of the steps in running these programs, and
allows the user to parse and align motifs that have their SCFG/MRF nodes
already characterized based on known 3D structure. Currently, MATLAB script
files still
need to be modified in order to parse and align new motifs. This tool mainly
provides an easily expandable user-friendly platform for applying automatic
sequence alignment.
6.3. Tool
3: Extract parts of a FASTA file
This tool, shown in Figure 27, requires loading a FASTA file from Ribostral
main GUI before it can be activated.
Figure 27. GUI for extracting parts of a FASTA
file.
The user then simply needs to indicate the motif of interest by entering its nucleotide numbers in the editable box. All positions homologous to these numbers will be extracted from the original alignment and saved as a new sub-alignment in FASTA format. If there are one or more commas separating two or more groups of numbers, the sub-alignment will represent these breaks in continuity by four consecutive dots (....). This continuity separation code is understood by other Ribostral tools, such as the alignment parser tool discussed in the previous section. Naturally, nucleotide numbers refer to the source or reference organism specified in the GUI. Similarly to the interactive analysis GUI, in case there is disagreement between the sequence name in the white editable box and that in the gray drop-menu box, the white editable box takes precedence. The user can choose to ignore up to five organisms from the top of the sequence alignment (for instance, if these are structure masks or sequences that do not belong to the same phylogenetic group), and can choose to extract one domain or all domains at once. The extracted sub-alignment is saved in the same directory as the original source alignment file.
If there is more than one motif to be extracted, the user can create a tab-delimited input text file containing the names of all of them and their nucleotide numbers for batch processing. The nucleotide numbers should correspond to the Reference Organism Name selected in the GUI. The input text file would look like this:
h5IL 53-56,356-358 h6IL 64-69,99-103 h7IL1 128-131,231-233 h7IL2 133-136,227-229
The nucleotide positions are then extracted from the alignment the same way as if they were entered one by one in the GUI editable box. An advantage here is that the user can assign specific names to the sub-alignments containing these motifs, instead of the otherwise generic names derived from nucleotide numbers extracted.
6.4.
Tool 4: Merge & remove repeats from FASTA files
This tool opens a new GUI from which the user can browse for a minimum of one
and a maximum of three input FASTA files. Each one of the files read will be
checked for individual sequence lengths and repeats in sequence names. If some
sequence lengths are shorter than others, gap characters (“–”)
are added to the ends of the short ones. This prevents some errors when the
alignments are analyzed. If name repeats are found (the name meant here is
the first separate word after the “>” sign in a FASTA header
line), only the sequence with most valid nucleotide letters (A, C, G, and U)
is kept (it is possible to change this default criteria and choose sequences
with shortest internal gaps; this can be done by changing the value for the
variable “Criteria” in the script file mFastaPrinter.m). The resulting “unique” alignment
represents a more divergent set of phylogenetic taxa which may be more meaningful
to analyze. The “unique” FASTA files are saved in the same directory
and under the same names (with descriptive suffixes) as the original FASTA
files they are created from. If more than one original file was inputted into
this tool, a merged file containing all of them is also created.
6.5. Tool 5: Create
.fasta from .mat
This is a simple tool that allows the user to open a “*.mat” file
representing a sequence alignment in MATLAB data format and recreate the “*.fasta” file
it was originally created from. If the file contains more than one subgroup
(or phylogenetic domain), separate FASTA files containing the sequence of each
of them, in addition to one containing them all together, are created and saved
in the same directory as the input “*.mat” file.
7. Help menu
Under the help menu, the manual, website, and version information of Ribostral
can be found. To check if any updated version of the program is available,
the user can compare the local version number to the version number listed
on Ribostral website.
8. Sample study
In this section, a simple example of how Ribostral can be used to improve a
sequence alignment based on structure is given. We will do this exercise on
a small part of an alignment to keep things simple and clear. This is the starting
alignment:
>Structure
CGCCA-....-UAAGG
>Seq2
CGCCU-....-AAAUG
>Seq3
GGCAU-....-AAGGC
>Seq4
GGCAC-....-GGGUC
>Seq5
CACAC-....-GGGUG
>Seq6
CGCAG-....-CAGUG
>Seq7
AGUCG-....-CGCUG
>Seq8
UACAC-....-GGAUA
>Seq9
UGUCA-....-UAUGA
This is an alignment of an internal loop motif similar to
one seen in 16S rRNA internal loop 20 (IL20), the four small dots in the middle
are just to indicate strand dicontinuity (Ribostral
will consider points like this in an alignment as N's, in other words, they
are assigned local numbers unlike the gap symbol "-").
It is hard to tell whether this is a good alignment or not by just looking
at
it.
Some
purines
and pyrimidines
"seem" to be aligned well, there also
seems
to be a couple of cWW interactions that are covarying nicely, but
it is difficult to tell anything specific. Suppose that we know the 3D structure
of the
first sequence seen in this alignment (the one labeled "Structure"),
and that
it looks as in Figure 28:
Figure 28. Structure representation of the
first sequence in the alignment. Althought the sequence alignment is modified
to simplify the example, this motif does occur several times in rRNA, as in
internal loop 20 (IL20) of the 16S rRNA.
This structure can easily be encoded in Excel in
the form that Ribostral can handle. Figure 29 shows how this file looks like:
Figure 29. The same information in Figure 28
(BPs observed in 3D structure) written into Excel in the form required by Ribostral.
Given this structural information, we can objectively
measure the quality of the alignment. All that
needs
to be done
is to load
both
the FASTA
file and the Excel file into Ribostral, and to
analyze the output files produced. We will first look at the HTML Alignment
Viewer file,
which
displays
the alignment
with nucleotides colored in a way describing how well they conform to isostericity
rules inferred from structure:
Figure 30. The HTML Alignment Viewer output,
showing areas
where the alignment is not consistent with structure (red and pink areas of Seq7).
So clearly, this alignment is not perfect, and it needs some
enhancement based on what is isosterically acceptable in 3D and what is not
(the
score of this alignment based on the formula described in section 3.2
above is 87).
The sequence
that is the cause of the problem is also clearly identified (Seq7). To fix
the alignment, we need to edit it in a sequence editor or text editor where
nucleotides can be moved by
hand (an automatic aligner based on Stochastic Context Free Grammar/Markov
Random Fields or SCFG/MRF is in preparation; Mokdad, A., Sarver, M., Stombaugh,
J., Zirbel, C., Leontis, N.B.). Ribostral also provides some insight on how
the alignment
should be fixed through its "_Cov.html" output files (where a substitution
table for each BP position is provided with isosterically inspired colored).
In this
particular case, and because of the specific BP interactions formed in
this motif, a small manipulation of Seq7 second part (after the dots) creates
an insertion, but fixes problems with forbidden and heterosteric BPs.
After this manipulation, the resulting alignment was viewed again with the
HTML Alignment Viewer, and Figure 31 shows the results:
Figure 31. The corrected alignment as seen
again with the Alignment Viewer.
Obviously, Figure 31 shows an improvement in the alignment (new score = 96), eventhough it created an insertion in Seq7. The manipulation also created an addition gap on the far right of Seq7, but this may not be a true gap because depending on the unseen nucleotides that are present beyond this motif (remember, for simplicity the alignment in this example is just an small extract from a bigger alignment). The reason the alignment was enhanced to this degree is the fact that the CUG- at the end of Seq7 fit better the isostericity rules of the tSH, tWH, tHS, and cWW BPs they now occupy (respectively), as compared to the GCUG.
References:
1. Hall, T.A. (1999). BioEdit: a user-friendly biological sequence alignment
editor and analysis program for Windows 95/98/NT. Nucl. Acids. Symp. Ser.,
41, 95-98.
2. Leontis, N.B. and Westhof, E. (2001). Geometric nomenclature and classification
of RNA base pairs. RNA, 7, 499-512.
3. Jossinet, F. and Westhof, E. (2005). Sequence to Structure (S2S): display,
manipulate and interconnect RNA data from sequence to structure. Bioinformatics,
21, 3320-3321.
4. Leontis, N.B., Stombaugh, J. and Westhof, E. (2002). The non-Watson-Crick
base pairs and their associated isostericity matrices. Nucleic Acids Res, 30,
3497-3531.
5. Lemieux, S. and Major, F. (2002). RNA canonical and non-canonical base pairing
types: a recognition method and complete repertoire. Nucleic Acids Res, 30,
4250-4263.
6. Mokdad, A., Krasovska, M.V., Sponer, J. and Leontis, N.B. (2006). Structural
and evolutionary classification of G/U wobble basepairs in the ribosome. Nucleic
Acids Res, 34, 1326-1341.
7. Sarver M., Zirbel C., Stombaugh J., Mokdad A., Leontis N. (2006). Finding
Local and Composite Recurrent Structural Motifs in RNA 3D Structure. Journal
of Mathematical Biology (accepted into special RNA issue).
Citation:
Please cite the following work if you use
any component of Ribostral or the nomenclatures proposed in it:
Mokdad, A., and Leontis, N. (2006). Ribostral: An RNA 3D alignment analyzer
and viewer based on basepair isostericities. Bioinformatics, 22(17): 2168-70.