Learning a Bayesian Network with BioBayesNet web server

  1. Input data
  2. Feature selection
  3. Bayesian network model learning
  4. Classification of new data
  5. Workflow scheme of BioBayesNet web-server

Input data

There are two ways in our web application to learn a network structure on the available data:
  1. If you have biological sequences ⇒ Uploading of sequencial dataset
  2. You want to upload your own data ⇒ Uploading of feature vectors

Uploading of sequencial dataset

In the case of the biological sequence dataset you have to upload sequences in FASTA format. The FASTA header must contain a sequence name and a class label. The class label is specified with class(A), where A can be any class label.

Optionally, you can specify a subsequence, for example a protein binding site within an entire promotor sequence, with site(a,b), where a and b gives the location assumed the sequence starts by position 1. If no site(a,b) is given, the whole sequence is considered. To specify subsequences on the antisense strand, a should be greater than b (e.g. site(20,10)). NOTE: The site label allows to use relative positions in the feature generation step (step 2). That means the interval -5 to -1 refers the 5 nuleotides upstream of the specified with site(a,b) subsequence.

>HS_HIST1H4C (1 - 72) site(31,41) class(background)
tcaaggggattagccttattttaacaccaataatcttatcacaaaaacctcccagaggaagaccctgtagat
>R12978 site(31,41) class(Sp1)
ggaaagtgaagtggccaccagcccaaatcaggggaggagggcaggaatgcccctgctgggaaggggctggt
>HS_DKFZp564K142 (1 - 72) site(31,41) class(background)
cgaggcgggcggatcacctgaggtcaggagttcgagaccagcctgcccaacatggcgaaaccccgtctctac

After uploading you will be directed to a feature generation input form, where you can select a number of predefined sequence features which should be generated from your input sequences for the BN learning.

Feature generation from the uploaded sequence dataset

After having uploaded biological sequences you can define which features have to be calculated from these sequences. BioBayesNet provides several types of features, ranging from simple sequence positions to sequence-dependent structural parameters for DNA and RNA sequences (Fig.1.), namely:
  1. Nucleotides at certain positions
  2. DNA structural parameters
  3. RNA structural parameters
  4. Nucleotide content
  5. Consensus motif match


Fig.1. Feature generation for the sequence dataset in the step 1.2.

Features of groups 2, 3, and 4 describe continuous properties of sequences. In order to derive a finite feature range, the continuous ranges are discretized using the entropy-based, supervised discretization algorithm by Fayyad and Irani. This procedure finds a partition of the continuous range which best separates the different classes. After clicking on submit you directed to the step 2 for feature selection.



Nucleotides at certain positions

Features of this group all have the same range, namely the four different nucleotides: A,C,G and T. This property for position i is the analogue of the ith column of a position weight matrix (PWM).
Please notice: Positions are meant relative to the specified subsequence in the command site(a,b) and can be negative and positive. If site(a,b) is not given for a sequence, position 1 refers to the sequence start. In this case, you cannot use negative positions.

DNA structural parameters

DNA structural parameters which express the sequence-dependent local variation of geometrical or physiochemical DNA properties at a subsequence. Examples are the average helical twist between two base pairs, the DNA bendability or the average melting temperature of the subsequence. A feature value for a subsequence is calculated as the mean of all dinucleotide steps in this subsequence. Values for dinucleotides were given in literature Ponamarenko et al., el Hassan MA & Calladine CR. We provide 38 different DNA properties that can be calculated from a user-defined subsequence.
Please notice: Positions are meant relative to the specified subsequence in the command site(a,b) and can be negative and positive. If site(a,b) is not given for a sequence, position 1 refers to the sequence start. In this case, you cannot use negative positions.


RNA structural parameters

RNA single-strandedness measures the probability for a given RNA subsequence to be completely single-stranded (i.e. not part of a secondary structure). For that we use RNAup from the Vienna RNA package.
Please notice: Positions are meant relative to the specified subsequence in the command site(a,b) and can be negative and positive. If site(a,b) is not given for a sequence, position 1 refers to the sequence start. In this case, you cannot use negative positions.


Nucleotide content

Subsequence nucleotide contents: These features measure the fraction a subset of nucleotides in a user-defined subsequence. An example is the fraction of pyrimidines in the subsequence from position 10 to 20.
Please notice: Positions are meant relative to the specified subsequence in the command site(a,b) and can be negative and positive. If site(a,b) is not given for a sequence, position 1 refers to the sequence start. In this case, you cannot use negative positions.


Consensus motif match

Consensus matches: features of this group decide whether there is a match of a given subsequence to a given consensus sequence. These features can take values true or false. You can specify consensus sequence matches using IUPAC codes. Enter each consensus sequence on a single line in text area, for example:
NAGHAG
YYNYY
Optionally, the search for these consensi matches can be restricted to intervals according to the given command site(a,b) in the step 1.1.
Please notice: Positions are meant relative to the specified subsequence in the command site(a,b) and can be negative and positive. If site(a,b) is not given for a sequence, position 1 refers to the sequence start. In this case, you cannot use negative positions.



Uploading of feature vectors

In this case you have to upload two files in C4.5 format (Fig. 2.):

Fig.2. Dataset upload of user-defined features in two separated files in the step 1.

This option gives you the full flexibility but it requires to generate your own features in advance. For example, one might input pre-computed features about protein sequences and/or structures to analyze protein data.


C4.5 format: Definition file

The first line of the definition file has to contain the class labels. There are two classes in the table 1: alternative and constitutive for alternatively and constitutively spliced exons. The next lines of this file specify feature names and their value ranges. This is one example in the table 1:
File 1: Definition file.
Class labels and feature names
File 2: Sample files
Data samples
alternative, constitutive.3.4, 100, high, yes, alternative.
donor_splice_site_score: continuous.-5.7, 67, medium, no, alternative.
exon_length: continuous.7.4, 167, high, yes, alternative.
flanking_intron_conservation: high, medium, low.13, 231, low, no, constitutive.
length_divisible_by_3: yes, no.9.5, 189, medium, yes, constitutive.
7.8, 345, low,no, constitutive.
Table 1. An example of user given feature vectors describing potentially discriminative features of alternatively and constitutively spliced exons. The first line of the first file has to contain the class labels (alternative and constitutive). The next lines of this file specify feature names and their value ranges. Each line of the second file contains one data sample. The features are given in the order of the first file and the class label is given at the end of each line. All lines of each file ends with dot.

You can use continuous, if you have non-discrete attribute values. Continuous ranges will be discretized, using the entropy-based, supervised discretization algorithm by Fayyad and Irani. This procedure finds a partition of the continuous range which best separates your classes.


C4.5 format: Sample file

Each line of the sample files is a comma-separated list of attribute values ending with a class label followed by a period. The attributes must be in the same order as described in the definition file. (Table 1.)



Feature selection

After the data input, in the step 2 the user can select which features are used to learn the BN. Some of the user-defined features may be omitted if they fall out of the sequence ranges for some samples or in the case of continuous features, if the discretization procedure could not find meaningful discretization intervals. Apart from manual selection, this process is assisted by an automatic feature selection method, which selects the most discriminative features from all generated or user-given features.

Fig.3. Feature Selection in the step 2.

This step also provides an overview of the value ranges and the empirical probability distribution, which could help you to select/unselect your desired features.



Baysian network model learning

The selected features in step 2 are used in the next step 3 to learn a Bayesian classifier with TAN structure. After learning, the BN classification quality is evaluated. The resultof BN learning includes
  1. two quality measures ⇒ Network quality measures
  2. downloading of the learned network ⇒ Bayesian network download
  3. graphical representation of the network structure ⇒ Bayesian network graphic
  4. final classification of your data ⇒ Classification of learning samples
  5. learning a new network ⇒ New learning
  6. classification a new data by using this BN model ⇒ Classify new data
  7. downloading feature value dataset in C4.5. format ⇒ download feature dataset in C4.5 format




Network quality measures

Bayesian Network quality measures are based on the information loss function and the average posterior probability for the correct class. The information loss is the sum of logarithmic probabilities P(class(f)|f) over all feature vectrors f. The average posterior probability is determined by averaging the posterior for the correct class.
Furthermore, the power of the individual features is estimated by computing the loss of quality if this feature is omitted during learning.



Bayesian network download

The final Bayesian Network can be downloaded as a file in the Bayesian Interchange Format (BIF-XML 0.2 ) to use it for further data classification or to use it in Bayesian Network Software such as JavaBayes.
Downloaded networks can be uploaded another time in the step 4 to classify new input samples/sequences.



Bayesian network graphic

The server also produces a graphical representation of the network structure, which allows the exploration of learned dependencies between the features (Fig. 4):

Fig.4. Bayesian Network Inference.

Besides this graphical overview, an interesting information is the distribution of a single feature, given particular values for some of the other features (variables). To this end, our tool allows to set some variables to particular values and query the a posteriori probability distribution of another variable given this setting.

Please notice: if you use Internet Explorer and want to view web sites with SVG image:
1) you'll need an SVG viewer plugin like the one from Adobe.
2) after the install please click here to see your BN graph.



Classification of learning samples

Each of your samples/sequences has been classified using the final learned BN model. The true/false positive rates were calculated separately for each class using 10fold cross-validation. A sample was counted as true positive when the probability of the (known) class was higher that for any other class. Otherwise it was counted as false positive.
The ROC curves for a certain class were calculated in the following way: For each sample of this class, the a posteriori probability for that class given the sample was used as classification threshold. If for any other sample this probability was above this threshold, it was counted as true positive when this other sample was really of the considered class and as false positive when the sample was of another class. Doing so, we get True Positive/False Positive pairs for each threshold.
In this case you can see the feature values for each data sample and how these samples were classified (Fig. 5):

Fig.4. Bayesian Network Inference.

By clicking on a table title you can sort your data.
Please notice: these classifications were done without cross-validation and were not used to calculate the network quality and performance given in the tables in this page.




New learning

You can always go back to the feature selection to select/unselect different groups of features or even upload new input data.



download feature dataset in C4.5 format

To save your work, you can download your feauture vectors as C4.5 format into 2 files. It can be again used in step 1 to upload your data for new analyse with BioBayesNet, or for other applications, like e. g. JavaBayes .



Classification of new data

There are two possibilities to achieve this page (step 4.):
  1. direct after Bayesian network model learning with a previously learned Bayesian network, or
  2. you have to upload the BIF-file of your learned BN model:
    Fig.6. Classification a BIF file with a BN.





Classification of new data in FASTA-format

Now in step 4 you need to upload a new dataset in FASTA format. In this case the FASTA header must only contain a sequence name. You can also select a class of interest and a threshold for the posterior probability (Fig. 7):
Fig.7. Classification of new data in FASTA format.





Classification of new data in C4.5 format

In this case you have to check out two points presented in the figure 8. You can also select a class of interest and a threshold for the posterior probability (Fig. 8):
Fig.8. Classification of new data in C4.5 format.





Workflow scheme of BioBayesNet web-server