International Summer School

   From Genome to Life:

    Structural, Functional and Evolutionary approaches

 


 

ROSEN Maria

Stockholms University, Biochemistry And Biophysics, Arrhenius Väg 12, Stockholm 106 91, Sweden

title: Identification and classification of glycosyltransferase, by multivariate sequence analysis.

Maria Rosén*, Åke Wieslander

The traditional method to compare and classify proteins is based on sequence alignments, where only amino acid identity or similarity in the same position is accounted for (e.g. BLAST). Another approach is to compare and identify property patterns by using Multivariate Data Analysis. Here each amino acid is described by three values, Z scores (hydrophobicity/ hydrophilicity, side chain bulk volume, and polarizability and charge). The periodicity of the amino acid properties in a protein is described by calculating the auto-covariance and cross-covariance, where the covariance is analysed with different "window" size, e.i. Lags, and analysed by principal component analysis. This method is analysing the relationship between close and more distant neighbours. The method has previously been used to classify compartment proteins in E.coli, ABC transporters in Mycoplasma pneumoniae and other genomes, and signal peptides in E.coli, Mycoplasmas and other gram-positive bacteria (1, 2). Here the method has been used to classify glycosyltransferases. These have previously been grouped into 56 different families (CAZy), based on sequence homology (3). Several crystal structures of different glycosyltransferases have been solved, and there are at least two different structure superfamilies within the glycosyltransferases. The aim of this study was to separate and classify sequences to the two different superfamilies by Multivariate Data Analysis (PLS-DA), and also further divided these according to enzymatic mechanisms. A separation of the two known structure groups, with proteins with predicted structure included, was possible with very high scores (0.72, with 1.0 for a perfect model). A further division within the different structure groups between alpha and beta linkages was also possible (score of 0.73). Predicting potential glycosyltransferases with unknown function from the mycoplasma genomes showed that the closest homologs from U.urealyticum were SpsA from B.subtilis, and for the three M.pneumoniae enzymes the neighbours were mainly eukaryotic enzymes. These neighbours almost all belong to CAZy family 2, with a few exceptions. This method will be tried with whole genomes to find and classify until know unidentified glycosyltransferases.

References:

1 Edman, M., T. Jarhede, M. Sjostrom, and A. Wieslander. 1999. Different sequence patterns in signal peptides from mycoplasmas, other gram-positive bacteria, and Escherichia coli: a multivariate data analysis. Proteins. 35:195-205.

2. Edman, M., M. Sjöström, and Å. Wieslander. 2002. Multivariate analysis of ABC-dependent membrane transport proteins from five genomes reveals group-specific features in sequence properties.

3. Coutinho, P. M., and B. Henrissat. 1999. Carbohydrate-Active Enzymes server at. URL: http://afmb.cnrsmrs.fr/~cazy/CAZY/index.html.