Stockholms University,
Biochemistry And Biophysics, Arrhenius Väg 12,
Stockholm 106 91, Sweden
title:
Identification and classification of
glycosyltransferase, by multivariate sequence
analysis.
Maria
Rosén*, Åke Wieslander
The traditional
method to compare and classify proteins is based on
sequence alignments, where only amino acid identity
or similarity in the same position is accounted for
(e.g. BLAST). Another approach is to compare and
identify property patterns by using Multivariate
Data Analysis. Here each amino acid is described by
three values, Z scores (hydrophobicity/
hydrophilicity, side chain bulk volume, and
polarizability and charge). The periodicity of the
amino acid properties in a protein is described by
calculating the auto-covariance and
cross-covariance, where the covariance is analysed
with different "window" size, e.i. Lags, and
analysed by principal component analysis. This
method is analysing the relationship between close
and more distant neighbours. The method has
previously been used to classify compartment
proteins in E.coli, ABC transporters in Mycoplasma
pneumoniae and other genomes, and signal peptides
in E.coli, Mycoplasmas and other gram-positive
bacteria (1, 2). Here the method has been used to
classify glycosyltransferases. These have
previously been grouped into 56 different families
(CAZy), based on sequence homology (3). Several
crystal structures of different
glycosyltransferases have been solved, and there
are at least two different structure superfamilies
within the glycosyltransferases. The aim of this
study was to separate and classify sequences to the
two different superfamilies by Multivariate Data
Analysis (PLS-DA), and also further divided these
according to enzymatic mechanisms. A separation of
the two known structure groups, with proteins with
predicted structure included, was possible with
very high scores (0.72, with 1.0 for a perfect
model). A further division within the different
structure groups between alpha and beta linkages
was also possible (score of 0.73). Predicting
potential glycosyltransferases with unknown
function from the mycoplasma genomes showed that
the closest homologs from U.urealyticum were SpsA
from B.subtilis, and for the three M.pneumoniae
enzymes the neighbours were mainly eukaryotic
enzymes. These neighbours almost all belong to CAZy
family 2, with a few exceptions. This method will
be tried with whole genomes to find and classify
until know unidentified
glycosyltransferases.
References:
- 1 Edman, M.,
T. Jarhede, M. Sjostrom, and A. Wieslander.
1999. Different sequence patterns in signal
peptides from mycoplasmas, other gram-positive
bacteria, and Escherichia coli: a multivariate
data analysis. Proteins. 35:195-205.
2. Edman, M., M. Sjöström, and
Å. Wieslander. 2002. Multivariate analysis
of ABC-dependent membrane transport proteins
from five genomes reveals group-specific
features in sequence properties.
3.
Coutinho, P. M., and B. Henrissat. 1999.
Carbohydrate-Active Enzymes server at. URL:
http://afmb.cnrsmrs.fr/~cazy/CAZY/index.html.
|