next up previous contents
Next: 10 Related programs Up: SAM (Sequence Alignment and Previous: 8 Regularizers and models   Contents

Subsections


9 The buildmodel estimation process

A detailed discussion of the estimation process can be found in ``Hidden Markov models in computational biology: Applications to protein modeling,'' mentioned in the Introduction. This section provides an overview of the mechanics of model estimation.

After the sequences have been divided into training and test sets, and the initial model or models have been created, buildmodel will iteratively train the model using expectation-maximization (EM). For each iteration, a comment line (beginning with a percent sign `%') is written to the output file (e.g., test.mod) that includes the iteration number and the average NLL distance between the set of training sequences and the model. Iterations continue until either an iteration gains less improvement than the stopcriterion (and noise is less than 10% of its starting value) or re-estimates iterations have been performed. When multiple models are being trained (but not multiple subfamily models) training on each model is stopped individually when that model reaches the stopcriterion (provided noise is less than one tenth its initial value).

If surgery and multiple initial models are used, one model is selected for the surgery procedure, which will attempt to prune and grow the model as appropriate. After each surgery procedure (up to nsurgery), the re-estimation process is repeated. Once either the limit on the number of surgeries is reached, or the surgery parameters produce no model modifications, the training procedure is complete.

After the model has been trained, the NLL scores for the test set are computed and reported, and the final model is written to the output file. This model file may be used as an input file to further refine the model, perhaps by setting the stopcriterion to a smaller value.


9.1 Noise and annealing

It is possible to add noise to the initial model(s). By setting initial_noise to a positive number that amount of noise is added to a model in the beginning of the program. It serves the important purpose to make models differ, if the program runs many models simultaneously -- each model will have a different noise added.

To try to avoid local minima, one can add noise to the models during their estimation, and decrease the noise level gradually in a technique similar to the general optimization method called simulated annealing. The initial level of the noise in this annealing process is called anneal_noise. If anneal_noise is greater than 0, annealing is performed. (If initial_noise is also given, that will determine the noise for the first iteration, and anneal_noise the noise in the following iterations.) During the estimation process the annealing noise is decreased by a speed determined by anneal_length. There are two ways it can be done:

Linearly:
If anneal_length is greater than or equal to 1, the noise is decreased linearly to zero in anneal_length iterations by the formula

\begin{displaymath}
\mbox{noise} = \mbox{\tt anneal\_noise}(1-\mbox{number\_of\_iterations}/\mbox{\tt anneal\_length})
\end{displaymath}

Exponentially:
If anneal_length is less than 1, the noise is decreased exponentially by multiplying the noise with anneal_length in each iteration, which gives the noise

\begin{displaymath}
\mbox{noise} = \mbox{\tt anneal\_noise} *
\mbox{\tt anneal\_length}^{\mbox{\scriptsize number\_of\_iterations}}.
\end{displaymath}

In the exponential schedule, noise injection is halted when the amount of noise reaches 10% of its initial value.

Once the noise level has been calculated, there are three possible ways noise can be added, as controlled by whether randomize is positive, negative, or zero.

positive
A set of randomize random paths are calculated through the model according the the regularizer probabilities. Each of these sequences is weighted by the amount of noise. These sequences are added to the counts generated by the real frequencies, thus the noise setting is somewhat dependent on the number of sequences being trained.

negative
A set of $-$randomize random paths are calculated through the model. With a weight of $-\mbox{noise}/\mbox{\tt randomize}$, these counts are added to the normalized (probability, rather frequency) model, and than the model is renormalized. Thus, the noise generation is similar to that of the first case, but total noise added is independent of both the number of sequences and the randomize setting.

zero
For each probability parameter in the model, a random number between 0 and the corresponding parameter in the normalized regularizer is found. This number is scaled by the level of the noise, given for instance by initial_noise, and added to the probability in the model. After doing this for the whole model, all the probabilities are normalized. For example, if the noise is a random number between 0 and 2, random pseudo counts corresponding to up to two sequences will be added to each transition and each match state. This is the fastest means of noise generation.

Note that the annealing schedules are ad hoc. Still, according to our experience even fast and crude annealing generally improves performance. By default, exponential noise at a ration of 0.8 is used with no initial_noise, an anneal_noise of 5, and a randomize setting of 50 (corresponding to 50 random sequences). These values were chosen experimentally (see the Hughey and Krogh paper mentioned in the introduction).

After a model has been created, adding too much noise to the model may eliminate all the trained information. Therefore, if an initial model or an initial alignment is specified, noise (initial_noise or anneal_noise) is reduced from the default setting by a factor of retrain_noise_scale, which has a default of 0.1. Thus, the effective noise during a retraining would be 0.5 rather than 5. The same is true of surgery iterations, discussed in the next section. In this case, the starting noise of the re-estimation process after a surgery, whether or not an initial model is specified, is the anneal_noise scaled by surgery_noise_scale parameter, which also has a default of 0.1.

A second annealing option, one based on slowing increasing the weights of the sequences being trained is discussed in Section 9.4.2.


9.2 Surgery

It is often the case that during the course of learning, some match states in the model are used by few sequences, while some insertion states are used by many sequences. Model Surgery is a means of dynamically adjusting the model during training.

Surgery will be nsurgery times: a full re-estimation process is performed including reestimates re-estimations, or until the stopcriterion is reached. By default, as in the tRNA example above, surgery is performed up to two times.

The basic operation of surgery is to delete unused match states and to insert match states in place of over-used insert states (the special node types described in Section 8.4.2 are never subjected to surgery modification). In the default case, any match state used by less than one half of the sequences is removed, forcing those sequences to use an insert state or to significantly change their alignment to the model.2 Similarly, any insert state used by more than half sequences is replaced with a number of match states approximating the average number of characters inserted by that insert state.

The surgery heuristic can be adjusted with one parameter or with three. In the first case, setting mainline_cutoff to a number other than the default 0.5 will indicate how much non-match, or main line, activity will be accepted. For example, a setting of 0.25 indicates that any match state used by less than one quarter of the sequences should be removed, while any insert state used by more than one quarter of the sequences should be expanded into a number of match states approximately equal to the average number of characters emitted by that state.

For finer tuning of the surgery process, the parameters cutmatch, cutinsert, and fracinsert, can be used. During surgery, any match state with a smaller portion of sequences than cutmatch is removed, and any insert state with a higher portion of sequences than cutinsert is replaced by the average number of characters emitted by that insert state multiplied by fracinsert. By default, fracinsert is 1.0, and cutmatch and cutinsert are both equal to mainline_cutoff.

These parameters can be set in ways that cause large amounts of surgery. For example, setting cutmatch to 0.5 and cutinsert to 0.25 will delete any match state used by fewer than half the sequences, and insert match states for any insert node used by greater than one quarter of the sequences. Typically, this will result in an oscillating model in several positions -- those positions used by more than one quarter and less than one half of the sequences. Such excessive surgery can sometimes aid in forming a good model.


9.3 Training statistics

In addition to the trained model, a report of the training procedure is included in buildmodel's output. The comment sections of this file for the training example in the introduction is reproduced below.


%  SAM: buildmodel v3.3.1 (December 20, 2001) compiled 12/27/01_15:14:02
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T99, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, C. Barrett, R. Hughey, Hidden Markov models for detecting
%  remote protein homologies, Bioinformatics 14(10):846-856, 1998.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  test   Host: perch    Thu Dec 27 16:02:45 2001
%  rph    Dir:  /projects/kestrel/rph/sam32/SAMBUILD/perch/demos
% -----------------------------------
% Internal sequence weighting method 1
% Regularizer FIM_method_train training letter frequencies (1)
% Regularizer Insert_method_train training letter frequencies (1)
% Model initial FIM_method_train training letter frequencies (1)
% Model initial insert_method_train training letter frequencies (1)
% Generic, Insert, and FIM tables dynamically reset to 
%   train_reset_inserts geometric mean of match probabilities (6)
% All models generated from regularizer.
% Sequence-model (global) (SW = 0)
% Sequence-model (global) (SW = 0)
% Sequence-model (global) (SW = 0)
%
%  Model lengths:    79  77  71
% Itera-   Average
% tion     distance
%     1   105.476   105.322   108.176
%     2   104.216   104.560   104.437
%     3   103.944   103.390   103.529
%     4   103.810   103.632   104.093
%     5   104.238   102.931   103.757
%     6   104.922   103.555   104.190
%     7   104.146   102.260   103.825
%     8   103.410   103.806   103.212
%     9   102.292   100.572   101.983
%    10   101.249   100.869   101.556
%    11   100.166   100.650   100.988
%    12    99.637    97.719    99.634
%    13    98.986    96.319    96.970
%    14    96.642    94.471    95.979
%    15    80.853    72.673    75.154
%    16    74.963    69.673    70.740
%    17    71.174    68.998    69.168
%    18    70.187    68.709    68.662

%    19    69.642    68.457    68.336
%    20    69.402    68.197    68.177
%    21    69.271    68.003    68.077
%    22    69.164    67.803    68.001
%    23    69.061    67.615    68.001
%    24    68.978    67.482    68.001
%    25    68.978    67.339    68.001
%    26    68.978    67.189    68.001
%    27    68.978    67.040    68.001
%    28    68.978    66.861    68.001
%    29    68.978    66.697    68.001
%    30    68.978    66.629    68.001
% Model 1 (counting from 0) wins!!
% -----------------------------------
%  TRAIN 10 sequences (average length 74).  Distance statistics...
%     Min:-41.36 (  2) Max:-34.91 (  6) Ave:-37.52 SampDev:  1.97 
%     MinSDs:  -1.95  MaxSDs:   1.33
%
%  Total CPU time:  user   0: 0: 2    system   0: 0: 0
%  Finished at:  Thu Dec 27 16:02:50 2001
% -----------------------------------
%
% Parameters from command line and insert files: 
%
% trainseed 4989
% randseed 0
train /projects/kestrel/rph/sam32/demos/trna10.seq
a RNA
print_frequencies 1
%
% -----------------------------------
% -----------------------------------
MODEL -  Final model for run test
Here, the initial information includes program version and run information. In this case, no initial models were specified, so buildmodel created 3 models (the default value of Nmodels) from the regularizer of the specified lengths. Next, all three models are trained, and the one with the best score is selected. In this case, the best model did not require any surgery, so the process was complete. If the best model did need surgery, that single model would be further refined. Next, statistics on the scores of the training sequences (and test sequences, if present) and CPU time are presented, followed by non-default parameter settings. The randseed entry is commented out to prevent any future training iterations from reusing the old seed. Finally, the model, along with its generic, letter count, and frequency average nodes is printed. The model has not been included in the example.

If the many_files variable is set, then the results of buildmodel are broken up into three files: the .stat file contains the run statistics and parameter settings, the .mod file contains the final model, and the .freq file contains the frequency model if print_frequencies is set.


9.4 Weighted training

SAM is able to perform a variety of weighted training options. Sequence weighting is particularly important when, as normal, the sequence data given to SAM is biased toward some type or subfamily of sequences (for example, from those organisms that have been most studied). Prior to Version 2.0, the SAM software system did not include any internal sequence weighting schemes, but could use weights generated by some other program. Version 2.0 added two internal weighting methods described below, the first of which is turned on by default when external weighting is not used. Version 3.0 includes internal weighting of alignments with several algorithm choices, which is the preferred form of sequence weighting.

9.4.1 External weighting

For all external sequence weighting options, a sequence weights file is specified with either the sequence_weights variable (for buildmodel training sets) or the alignment_weights variable (for buildmodel or modelfromalign initial alignments). In this file, any line starting with a percent sign (%) is ignored as a comment. The first non-comment line is presumed to be a description of the weights file, for example including the program that generated the sequence weights. The next non-comment line contains two integers, the number of weighted sequences and the number of weighted subfamilies. Remaining uncommented lines consist of a sequence identifier, white-space, and floating-point sequence weights, one per family. Weights can be positive, negative, or zero, and need not sum to one. If a sequence does not have a corresponding weight, its weight is set to 1.0 and a message is printed. If a weight does not have a corresponding sequence, a message is printed. Sequences and weights do not need to be in the same order within their respective files.

For plain sequence weighting, the number of families is set to 1, and each sequence is assigned a single weight in the sequence_weights file. During the re-estimation cycle, the frequency counts for each sequence will be multiplied by its weight.

Sequence weighting is particularly important in database discrimination: without sequence weighting, the model may specialize to an over-represented subset of the sequences, meaning that family members that do not happen to be in that sub-family will receive low scores.

The file gseg4.seq contains the initial 70-character segments of each of 4 globins. The last three are quite similar.


;
BAHG$VITSP
mldqqtiniikatvpvlkehgvtitttfyknlfakhpevrplfdmgrqesleqpkalamtvlaaaqnien
;
GLB$APLJU
alsaadagllaqswapvfansdangasflvalftqfpesanffndfkgksladiqaspklrdvssrifar
;
GLB$APLKU
slsaaeadlvgkswapvyankdadganfllslfekfpnnanyfadfkgksiadikaspklrdvssriftr
;
GLB$APLLI
slsaaeadlagkswapvfanknangadflvalfekfpdsanffadfkgksvadikaspklrdvssriftr
When a model is trained on this file without weighted training, the model is pecialized to the latter group of sequences, resulting in the following scores:

GLB$APLKU          70      -142.44     -135.82     4.15e-59
GLB$APLLI          70      -142.21     -134.61     1.38e-58
GLB$APLJU          70      -135.59     -128.07     9.55e-56
BAHG$VITSP         70      -101.36      -94.29     4.49e-41

The following simple weight file is an attempt to correct this bias:


% gseg4.weights
Weight file for the simple globin segment example.
% 4 sequences and 1 family
4 1
BAHG$VITSP  2.0
GLB$APLJU   0.66
GLB$APLKU   0.66
GLB$APLLI   0.66

 

Note that in this weight file, to make the results of the two examples comparable, the weights were made to sum to 4. The reason for this is that in addition to sequences, the regularizer (provided the various confidence parameters are non-zero) shapes the model. Setting all sequence weights uniformly high (e.g., 100.0) will have a similar effect to setting all the regularizer confidences to 0.

With the simple weight file, the following scores are produced.


BAHG$VITSP         70      -134.24     -126.82     3.34e-55
GLB$APLKU          70      -118.19     -111.71     1.23e-48
GLB$APLLI          70      -118.40     -111.26     1.91e-48
GLB$APLJU          70      -111.35     -104.36     1.90e-45
Here, with the three similar sequences weighted less, the model better matches (perhaps too much) the dissimilar sequence.

The above example is definitely a toy problem: weights must be set using statistically and biologically valid means.


9.4.1.1 Multi-subfamily weighting

Warning: This feature is not completely available or completely debugged.

A particularly interesting use of weighting schemes is when a family of sequences can be divided into several subfamilies. This special type of training is used whenever the number of families in a weights file is greater than one.

In this case, SAM will train one model per family in parallel so that each model can specialize to its subfamily. Although this sounds just like training each subfamily separately, there is an important difference. During the regularization procedure, the counts across all subfamily models are taken into account when re-estimating each subfamily's model. This means that, in the case of multiple alignments, a full-family multiple alignment can be generated by aligning each sequence to its appropriate subfamily model and combining the results. At the moment, subfamily modeling is not fully implemented and not recommended for use.

There are a few changes in the functionality of buildmodel when subfamily modeling is used. First, only a single suite of subfamily models is trained at a time, so multiple runs must be performed to match the functionality of starting with more than one model and selecting the best. Second, prior libraries must be used. Third, a more conservative approach to model surgery is taken. The subfamily models are always modified in parallel, and only if all the subfamily models agree on the surgery procedure (if all subfamily models believe inserting new model positions is appropriate, the minimum of all proposed insertion lengths is used). To encourage more surgery, users may wish to lower the surgery thresholds when training with multiple models. See Section 9.2.

Also, model files are treated somewhat differently. The many_files option is always turned on. The subfamily models are writing to files named, for example, runname.3.mod, where the number indicates which subfamily (starting from zero) that model is for. It is possible to retrain a suite of subfamily models by setting family_base_file to the root name of the suite of models (i.e., runname in the above example).

The hmmscore program does not yet score against multiple models--to perform database search against a suite of models, hmmscore must be run independently for each subfamily, and then the results combined by, for example, classifying each sequence as a member of the subfamily with which it scored best.


9.4.2 Annealing with Weights

Sequence weights can also be an effective means of annealing (Section 9.1) during the training process. When using this option, the sequence weights are slowly increased over the first several re-estimation cycles. Thus, at first, the sequences will have little effect on the model for the next training iteration: the regularizer and prior library will dominate, though particularly strong signals in the sequences, such as strongly conserved regions, will show through. As the training continues, the sequence weight multiplier is brought up to its final value, giving full weight to the sequences.

The annealing schedule options are similar to that available with noise generation. The relevant parameters are weight_length, which indicates how long the annealing should last, and weight_final, indicating the final sequence weight multiplier (the default is 1.0). The sequence weight multiplier found be the formulas below is multiplier by the sequence weight (which is 1.0 if no weight file is used in training) to find each sequence's weight during the given re-estimation iteration.

Linearly:
If weight_length is greater than or equal to 1, the weight multiplier is increased linearly to weight_final in anneal_length iterations by the formula

\begin{displaymath}
\mbox{multiplier} = \mbox{\tt weight\_final} *
\mbox{number\_of\_iterations} / \mbox{\tt weight\_length}
\end{displaymath}

Exponentially:
If weight_length is less than 1, the multiplier is increased exponentially until the multiplier reaches 90% of its final value as follows:

\begin{displaymath}
\mbox{multiplier} = \mbox{\tt weight\_final} *
(1.0-\mbox{\tt weight\_length}^{\mbox{\scriptsize number\_of\_iterations}}).
\end{displaymath}

Sequence weights are never zero: the first re-estimation cycle uses the first non-zero value of the weight formula.


9.4.3 Internal weighting of alignments

Version 3.0 of SAM incorporates the weighting methods used in the SAM-T99 remote homology detection method. See Section 4.. Relevant parameters include: aweight_method, for which 0 indicates no internal alignment weighting, 1 indicates Karplus relative weighting, 2 Henikoff relative weighting, and 3 flat relative weighting; aweight_bits indicating the target number of bits to save per column; and aweight_exponent indicating the weighting exponent. The Dirichlet prior_library and alignfile specified for buildmodel or modelfromalign are the remaining components of the method.

For the Henikoff and flat weighting schemes, if aweight_bits is set negative, the initial weighting is used, without adjusting the total weights to get a specified number of bits saved. The initial weighting in these schemes has a total weight of

\begin{displaymath}\mbox{number\_sequences}^{\mbox{\tt aweight\_exponent}}.\end{displaymath}

By default, internal weighting of alignments is turned on, with Henikoff relative weighting, 0.5 target bits per column, and a 0.5 exponent. If an alignment_weights file is specified, the external weights are used, and unless aweight_method is set to zero, a warning message is printed.


9.4.4 Internal weighting during training

Version 2.0 of SAM introduced two methods of internal weighting during buildmodel training. These methods are based entirely on the log-odds score of the sequence against the model being trained. Their invention was motivated by HMMer's maximum-discrimination training method. In general, these methods do not produce as good results as the alignment-based weighting.

Given a linear hidden Markov model $M$, a dynamic programming calculation can be used to calculate, for a given sequence, the probability that sequence $a$ was generated by the model, $P(a\vert M)$. The question of interest is, however, does that model match the sequence? That is, is the sequence more likely to be generated by the hidden Markov model than some other, less structured null model, $\phi$. Making the assumption that the models $M$ and $\phi$ are a priori equally likely, this reduces to the log-odds probability of

\begin{displaymath}
P(M\vert a) = \frac{p(a\vert M)}{p(a\vert M)+p(a\vert\phi)}.
\end{displaymath}

Typically, a log-odds score $S$ is used instead [2]:

\begin{eqnarray*}
S_a&=&\ln\frac{P(a\vert M)}{ P({\phi}\vert a)}\\
P(M\vert a)& =& \frac{1}{1+e^{-S}}
\end{eqnarray*}



This score measures whether the probability that the sequence was generated by the model is greater than the probability it was generated by a null model. This log-odds score is used to calculate the weight of a sequence. As model training iterations proceed, sequences that poorly match the model (i.e. with poor log-odds scores) are given higher weight.

Internal weighting method 1 uses the following equation to calculate a sequence weight.

\begin{displaymath}
W = e^{ (w-S) * (\frac{\ln K}{(w-b)} )}
\end{displaymath}

$S$ is the log-odds score of the sequence. The program keeps track of the current worst score $w$ and the current best score $b$ and these are used to decide on the two extreme weights. The worst scoring sequence will have a weight of 1, while the best scoring sequence will have a weight of $K$, typically in the 0.01 to 0.1 range. $K$ is a user-controlled parameter entered on the command line as $iweight$.

To use method 1 with $K$=.1, run buildmodel with the following arguments:


 buildmodel runname -train train.seq -internal_weight 1 -iweight .1 

Internal weighting method 2 is a variation of method 1. When using method 1, sequences with very poor scores may get excessively large weights. Method 2 modifies the weights of such outlier sequences. If a sequence scores so badly that it exceeds the median score by three standard deviations, it is weighted with a decreasing linear weight function, reaching a minimum of 1.0 for the sequence with the worst score.

Method 2:


\begin{displaymath}
W = e^{ ( w-S * \frac{\ln K}{(w-b)} )}
\end{displaymath}

is applied to scores no more than 3 deviations below the median, while

\begin{displaymath}
W = 1 - \frac{S-w}{b-w}
\end{displaymath}

is applied to the remaining sequences.

To use method 1 with $K$=0.05, run buildmodel with the following arguments:


 buildmodel runname -train train.seq -internal_weight 2 -iweight 0.05 

The current SAM default is to not use internal weighting. If internal weighting is selected and no iweight parameter entered, SAM defaults to an iweight of 0.1.

Looking again at the toy problem example demonstrated in the previous section, we saw the following scores when a model was trained on 4 globins without weighting.


GLB$APLKU          70      -142.44     -135.82     4.15e-59
GLB$APLLI          70      -142.21     -134.61     1.38e-58
GLB$APLJU          70      -135.59     -128.07     9.55e-56
BAHG$VITSP         70      -101.36      -94.29     4.49e-41

The unweighted model is overspecialized.

Internal weighting method 1 produces these scores:


GLB$APLKU          70      -107.98      -99.96     1.56e-43
GLB$APLLI          70      -109.40      -99.94     1.57e-43
GLB$APLJU          70      -106.94      -98.36     7.65e-43
BAHG$VITSP         70      -103.21      -95.55     1.27e-41

Internal weighting method 2 produces these scores:


GLB$APLLI          70      -111.28     -101.90     2.23e-44
GLB$APLKU          70      -109.35     -101.45     3.49e-44
GLB$APLJU          70      -109.54     -101.03     5.34e-44
BAHG$VITSP         70      -104.57      -96.92     3.23e-42

When using internal weighting, you can inspect all sequence weights generated during each iteration of the model building process.


buildmodel runname -train train.seq -internal_weight 2 -iweight 0.05 -print_all_weights 1

The print_all_weights option when set to 1 will produce a weight output file once per iteration. The files are named runname1.weightoutput, where 1 is the iteration number.

By default, print_all_weights is set to off.

Continuing the example of Figure 14 on page [*], performing the two commands:


buildmodel train4w -train globins50.seq
            -randseed 0 -trainseed 0 -ntrain 4 -internalweight 2
buildmodel train4wreg -train globins50.seq
            -priorlibrary recode1.20comp 
            -regularizerfile weak-gap.regularizer 
            -randseed 0 -trainseed 0 -ntrain 4 -internalweight 2   
hmmscore train4w  -i train4w.mod  -db globins50.seq -sw 2
hmmscore train4wreg -i train4wreg.mod -db globins50.seq -sw 2
results in the score histograms of Figure 15, in which the scores improve even further from the use of regularizers and Dirichlet mixtures.

Figure 15: Weighting performance
\begin{figure}\small\begin{tabular}{cc}
\psfig{figure=train4reg.ps,width=0.43\te...
...
Ntrain=10, Dirichlet, weak-gap, internal weighting\\
\end{tabular}\end{figure}


9.5 Viterbi training

For increased speed and performance (and possible worse results), Viterbi training is now possible (though in some instances not robust). The dpstyle parameter should be set to 1 for Viterbi training, rather than the default 0, which indicates EM training. It can also be set to 4 or 5 to train based on the posterior-decoded alignments (these options will more than double computation time and are not recommended). See Section 10.1. Two other values should only be used for scoring. Setting dpstyle to 2 will produce a very large dump of the posterior probabilities of the entire dynamic programming table in the file runname.pdoc. This option should be used in conjunction with the grabdp program. See Section 10.4. Setting dpstyle to 3 will produce a frequency dump for each sequence against each model as scored or aligned, using the EM dynamic programming method (also for use with grabdp). Files will be of the form runname.id.freq.

The related adpstyle parameter can be used to choose beteen Viterbi (1), posterior-decoded alignment based on transitions and character emission posteriors (4), and posterior-decoded alignment based solely on emission posteriors (5) for alignments and multiple domain alignments. See Section 10.1.

Future research will help determine the usefulness of these option. Fragment training (Section 10.1.2) with the SW variable can also be used with both the Viterbi and EM training algorithms, but has not been completely evaluated.


9.6 Building models with constraints

SAM HMM constraints provide a mechanism for associating position in a sequence with a specific node in a model. During model training, a constrained position remains associated with its node, allowing the remaining portions of the sequence to align naturally to the model. This serves as a method of incorporating prior knowledge about the training sequence, such as structurally similar regions.

Currently, there is an inconsistency in scoring constraints, in that for EM-style scoring, the constrained score is reported, which will generally provide a very strong signal. For Viterbi or posterior-alignment scoring, constraints are used to find the best path, but the unconstrained score of that path is reported, and thus the score of a constrained sequence is similar to that of an unconstrained sequence.

9.6.1 Model Node Labels

The node a residue is constrained to is specified using a node label. A node label is a unique, positive integer assigned to a node. Often it is the same as the node number, however this is not required. This is useful for allowing model surgery to add or remove a node. It is specified using the NODELABELS declaration in a model (see Section 8.4.3).

Node labels must be increasing in value from from left to right in the model. Labels are normally specified as contiguous numbers for a set of contiguous nodes to allow for easier constraint definition for contiguous residues. Labeled nodes will not be removed by model surgery; it is possible to have nodes added between labeled nodes.

When creating a model from an alignment, SAM will define node labels automatically.

9.6.2 Constraints Definition File

Constraints for individual residues are specified in line-oriented constraint definition files separate from the sequence databases. By convention, they have an extension of .cst. These files are specified using the -constraints option, which may be specified multiple times. Certain SAM programs can create new constraint definition files using the -constraints_out.

Lines containing only whitespace and those where the first non-whitespace character is ``#'' are comments and ignored. For all types of lines, leading white spaces are ignored.

The file is divided into sections starting with a line containing only the section name followed by a colon. Currently there is only one type of section containing sequence constraints, starting with


    constraints:

Multiple constraints sections may appear, allowing for the easy combining of files.

A constraint entry defines the model nodes to which residues of a sequence are constrained. Multiple entries may appear for a sequence, however a given residue of a sequence may be constrained only once. Constraints entries consists of lines in the form:


    seqid constrdef, constrdef, ...

Where the seqid is any sequence identifier consisting of non-whitespace characters and constrdef defines a set of model node labels for a range of positions in the sequence in any combination of the following formats:

Positions in sequences are identified by one-based indices and node labels are positive numbers.

A simple example of a constraint file is shown here.


constraints:
PROT1   10-30: 8, 80-82: 100
PROT2   12-28: 10, 62: 100
PROT3   16-32: 12

9.6.3 Using Constraints

To train a new model using constraints, specify one or more constraints file using the -constraints option to buildmodel. The node labels in the constraints file are assigned to nodes with the same node number. If necessary, the length of the model will be increased to allow for all labels. If surgery is used, an unlabeled node may be added or removed, resulting in the labels and node numbers differing in the final model. When retraining a model using constraints, the model must already contain labels. Constraints work with both EM and Viterbi

When creating a model from a multiple alignment with modelfromalign or using the -alignfile parameter to buildmodel, constraints can be created for specific positions or for all match positions. Constrained positions are specified using -constraints option. Model labels need not be specified and are ignored if they are. The residues specified in the file will be constrained to the nodes to which they are aligned. Insert sequences positions will not be constrained. To constrain all match positions in an alignment without creating a constraints definition file, use -constraints_from_align. With either approach, a new constraints definition file maybe required as input to other SAM programs. A new constraints file for the alignment sequences can be created using the -constraints_out option.

Constraints can be used when creating multiple-domain alignments with the hmmscore program (See Section 10.2.4.). The constraints for the sequence being aligned are specified with the -constraints option. The model must have node labels that correspond to the constraints file; thus the constraints file are normally the ones used to train the model or created when the model was defined from an alignment. If the extracted domains are to be processed by another SAM program, a constraint definition file for the domains can be written to a file specified by the -constraints_out option.


next up previous contents
Next: 10 Related programs Up: SAM (Sequence Alignment and Previous: 8 Regularizers and models   Contents
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group