K lactis forum


Message 81af694b6aU-5168-887-00.htm, number 1, was posted on Tue Feb 24 '04 at 14:46:50
K lactis annotation forum is open

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


dear annotators


you will find here a place where to exchange your ideas, complain about problems and �ask questions to anyone else

this will be shared by everyone in this collaboration project

you don't need to autenticate yourself �(be "guest") unless you �want to be able to edit your posted messages.


Yvan


Message 81af694b6aU-5168-1148-00.htm, number 2, was posted on Tue Feb 24 '04 at 19:08:11
First questions from Yde (Institute of Biology, Leiden University)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


So we can start. I actually did and after some playing around I see it will take some more time than anticipated, but I try to finish in two months. Anyhow, that is not the message of today. The question I have is about naming genes. In the "green part" there is this box gene name. Up to now I only filled in a name if there was a genuine K.lactis gene known (you will be pleased to hear that the first one of that kind was your HAP2). However, is that general policy. I came across several ORFs that have high similarity over the entire length with a known S.cerevisiae gene. It is tantalizing to put "Kl" before it, although further evidence is lacking. Similarly, "Product" and "EC numbers" will those be added soleley based on similarity or not added at this stage. �
The second question I have is for you to have a look at the part I have done so far, first 100 kb or so of chromosome I. I am totally new to annotation and I feel a little insecure and since I would prefer to go along the chromosome once instead of several times correcting mistakes I would appreciate your comments (I could not find annotated parts of other chromosomes, but looked only at a few places). If there is a well-annotated part I can take that as an example that will also do.

Hope everything is fine.

Best regards

Yde


Message 81af694b6aU-5168-1153+00.htm, number 3, was posted on Tue Feb 24 '04 at 19:13:32
in reply to 81af694b6aU-5168-1148-00.htm

Re: First questions from Yde (Institute of Biology, Leiden University)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


Dear Yde,

there's no reason for the moment to suspect real sequencing problems. the duplication you are referring to is not at 100 459118643dentity level, so this might reflect an actual duplication. IS or other mobile elements could be responsible for that. keep your eyes open on this issues
it is good practice to keep track of this kind of features (until we decide they are real problems), as
a sequence revision is not on schedule at the moment.

concerning your other mail,

|At 15:48 +0100 19/02/04, H.Y.Steensma wrote:
|
|The most difficult issue is the coherence/uniformity. I guess you don't
|want to be supervisor, otherwise that would be my suggestion. You have a
|look at the (first) results of different annotators and communicate your
|remarks. It will not help your popularity but could be extremely useful.
|(Hint: maybe sending around general remarks on things you find wrong in
|the annotations will decrease the chance of hurting ego's). �� While
|thinking about this why don't have an e-mail discussion about these
|points (it probably will be very difficult to arrange a meeting on short
|notice).

we, in orsay (me including) are anyway the real "supervisors" . we are committed to this duty from the very begining of this project. you'll ear from us sooner or later !

best,

Yvan


Message 81af694b6aU-5168-1155+00.htm, number 3, was edited on Tue Feb 24 '04 at 19:15:31
and replaces message 81af694b6aU-5168-1153+00.htm

Re: First questions from Yde (Institute of Biology, Leiden University)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


Dear Yde,

i'm in charge of the annotation database : welcome to the "annotation world".
congratulation for being the first one to actively annotate !
as you have seen, many issues are raised as soon as you have actually started working.

i can answer to some of your questions and guide you for a better annotation, but some others are tightly related to the way the fungi community is used to deal with genomic and genetic data and nomenclature.
your first question falls into this category : should you add "Kl" to a gene name that is defined in yeast but not yet in lactis ?
as i come from the procaryote world, i cannot be of much help to you on this point, as we microbiologists usually do not prefix known genes with species-specific information while annotating.
but traditions and rules can be different. anyway this is a very minor point that can be fixed when annotation is eventually finished, and should �be decided ultimately by you (all k. lactis experts)

let's come to another problem: if an orf is very similar to a known enzyme, should you attribute the known's enzyme EC number to this particular orf ?
usually the answer is : yes.
but it is �your responsibility to �ascertain that the similarity/pfam/KOG data is clear cut.
for example, if you're not sure whether you're dealing with a maltose (2.3.1.79), a serine (2.3.1.30) or a galactoside O-acetyltransferase (2.3.1.18) , don't put an EC number, or at most only 2.3.1.- . �the real enzyme specificity might sometimes be determined by sequence analysis (looking for conserved motives and specific signatures), but sometimes only biochemistry or genetics can decide.
the rule is : better give less information than wrong information.

this leads to an other important aspect: trying to give an - as much as possible- accurate "definition" for an orf .
the "definition" (or �"description" field) �should be used with caution because this definition line is the one that identifies a "hit" when you perform any BLAST search in sequence databases (genbank,embl, swissprot etc). most biologist rely on this information when performing a sequence analysis.
the rule is the same as above (better give less information than wrong information) , but you should also avoid being misleading. i can see from your first annotation attempts ( eg r_klactI0013 0019 0151) that you have put "hypothetical protein" in description and added something in the functional annotation fields.
as a rule (one more), if you say "hypothetical protein" , this means that nothing more can be said ! at most you can add "contains an 'xxx' DNA-binding motif" or "belongs to 'yyy' uncharacterized family".
if you do give some functional information, you should be able to be more informative in the "description" field : eg "conserved transport protein", "probable phosphatase" , "putative multidrug resistance protein" etc.
this gives coherence to annotation and facilitates subsequent analysis of the genome. likewise, you should be careful when annotating paralogs to use the same �rules for each one. for example, don't say "multidrug resistance protein" for one copy , and "drug transporter" for the other. decide wich scheme is more appropriate, and stick to it

these are �only general rules. many other issues are raised during annotation, and i think that many of them deserve a specific answer
among the most urgent ones to be addressed are:

- shall we keep the reference to YEAST orthologs in the definition ?
- shall we privilege mips functional annotation �over COG (more general) ?
- how can we deal to keep annotation coherence among different annotators satisfactory ? (this is a major issue)

as to this last point, i strongly encourage all annotators to keep an eye on annotation going on in other chromosomes

these questions should be discussed as soon as possible by all people involved in annotation of k. lactis




best regards,


Yvan

[ This message was edited on Tue Feb 24 '04 by the author ]


Message 81af694b6aU-5168-1157-00.htm, number 4, was posted on Tue Feb 24 '04 at 19:17:20
Another question Yde (Institute of Biology, Leiden University)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


I just came across what I think is a mistake in the sequence. ORFs
I1181 to I1203 on chromosome I match for 100(100 459103081dentity)
with ORFs III1308 - III1333 at chromosome III. Since this appears to
me very unlikely, it seems an assembly mistake. Who can I ask for
more information, or better who will have a look at this.

Best regards
Yde


Message 81af694b6aU-5168-1158+00.htm, number 5, was posted on Tue Feb 24 '04 at 19:18:14
in reply to 81af694b6aU-5168-1157-00.htm

Re: Another question Yde (Institute of Biology, Leiden University)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


On Tue Feb 24 '04, yvan zivanovic wrote
---------------------------------------
>I just came across what I think is a mistake in the sequence. ORFs
>I1181 to I1203 on chromosome I match for 100(100 �459103081dentity)
>with ORFs III1308 - III1333 at chromosome III. Since this appears to
>me very unlikely, it seems an assembly mistake. Who can I ask for
>more information, or better who will have a look at this.

>Best regards
>Yde

Dear Yde,

there's no reason for the moment to suspect real sequencing problems. the duplication you are referring to is not at 100 459116111dentity level, so this might reflect an actual duplication. IS or other mobile elements could be responsible for that. keep your eyes open on this issues
it is good practice to keep track of this kind of features (until we decide they are real problems), as
a sequence revision is not on schedule at the moment.

best,

Yvan


Message 81af694b6aU-5168-1159-00.htm, number 6, was posted on Tue Feb 24 '04 at 19:19:16
A question by Cesira (Cesira_Galeotti@Chiron.it)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


�����Dear Yvan,

�����we have a few questions for you:

*�����In the Gene NAME field should we put the name of the gene, when
available?
�����(for example: r_klactIV3360/RAG1)

*�����What are we supposed to write in the field PRODUCT ?

�����Thanks.

�����Cesira


Message 81af694b6aU-5168-1161+00.htm, number 7, was posted on Tue Feb 24 '04 at 19:21:05
in reply to 81af694b6aU-5168-1159-00.htm

Re: A question by Cesira (Cesira_Galeotti@Chiron.it)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


Dear Cesira,


welcome to klactis annotation

On Tue Feb 24 '04, yvan zivanovic wrote
---------------------------------------
>�����Dear Yvan,

>�����we have a few questions for you:

>*�����In the Gene NAME field should we put the name of the gene, when
>available?
>�����(for example: r_klactIV3360/RAG1)

this is correct. in this case you could just type "RAG1" in this field .
there's been some discussion about adding an "Kl" before the gene name, but this issue has not been decided yet
anyway , this can be added afterwards

>*�����What are we supposed to write in the field PRODUCT ?

>�����Thanks.

this field is really optional, and comes mainly from earlier version of the annotation database.
as an example, if the gene in question is "Ornithine carbamoyltransferase" �then the gene name will be "argF" and the gene product is "OTCASE".
but once more, it's optional

>�����Cesira

hope this helps

best,

Yvan


Message 81af694b6aU-5168-1167+00.htm, number 3, was edited on Tue Feb 24 '04 at 19:27:09
and replaces message 81af694b6aU-5168-1155+00.htm

Re: First questions from Yde (steensma@rulbim.leidenuniv.nl)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


Dear Yde,

i'm in charge of the annotation database : welcome to the "annotation world".
congratulation for being the first one to actively annotate !
as you have seen, many issues are raised as soon as you have actually started working.

i can answer to some of your questions and guide you for a better annotation, but some others are tightly related to the way the fungi community is used to deal with genomic and genetic data and nomenclature.
your first question falls into this category : should you add "Kl" to a gene name that is defined in yeast but not yet in lactis ?
as i come from the procaryote world, i cannot be of much help to you on this point, as we microbiologists usually do not prefix known genes with species-specific information while annotating.
but traditions and rules can be different. anyway this is a very minor point that can be fixed when annotation is eventually finished, and should �be decided ultimately by you (all k. lactis experts)

let's come to another problem: if an orf is very similar to a known enzyme, should you attribute the known's enzyme EC number to this particular orf ?
usually the answer is : yes.
but it is �your responsibility to �ascertain that the similarity/pfam/KOG data is clear cut.
for example, if you're not sure whether you're dealing with a maltose (2.3.1.79), a serine (2.3.1.30) or a galactoside O-acetyltransferase (2.3.1.18) , don't put an EC number, or at most only 2.3.1.- . �the real enzyme specificity might sometimes be determined by sequence analysis (looking for conserved motives and specific signatures), but sometimes only biochemistry or genetics can decide.
the rule is : better give less information than wrong information.

this leads to an other important aspect: trying to give an - as much as possible- accurate "definition" for an orf .
the "definition" (or �"description" field) �should be used with caution because this definition line is the one that identifies a "hit" when you perform any BLAST search in sequence databases (genbank,embl, swissprot etc). most biologist rely on this information when performing a sequence analysis.
the rule is the same as above (better give less information than wrong information) , but you should also avoid being misleading. i can see from your first annotation attempts ( eg r_klactI0013 0019 0151) that you have put "hypothetical protein" in description and added something in the functional annotation fields.
as a rule (one more), if you say "hypothetical protein" , this means that nothing more can be said ! at most you can add "contains an 'xxx' DNA-binding motif" or "belongs to 'yyy' uncharacterized family".
if you do give some functional information, you should be able to be more informative in the "description" field : eg "conserved transport protein", "probable phosphatase" , "putative multidrug resistance protein" etc.
this gives coherence to annotation and facilitates subsequent analysis of the genome. likewise, you should be careful when annotating paralogs to use the same �rules for each one. for example, don't say "multidrug resistance protein" for one copy , and "drug transporter" for the other. decide wich scheme is more appropriate, and stick to it

these are �only general rules. many other issues are raised during annotation, and i think that many of them deserve a specific answer
among the most urgent ones to be addressed are:

- shall we keep the reference to YEAST orthologs in the definition ?
- shall we privilege mips functional annotation �over COG (more general) ?
- how can we deal to keep annotation coherence among different annotators satisfactory ? (this is a major issue)

as to this last point, i strongly encourage all annotators to keep an eye on annotation going on in other chromosomes

these questions should be discussed as soon as possible by all people involved in annotation of k. lactis




best regards,


Yvan

[ This message was edited on Tue Feb 24 '04 by the author ]


Message 81af694b6aU-5168-1167+00.htm, number 2, was edited on Tue Feb 24 '04 at 19:27:26
and replaces message 81af694b6aU-5168-1148-00.htm

First questions from Yde (steensma@rulbim.leidenuniv.nl)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


So we can start. I actually did and after some playing around I see it will take some more time than anticipated, but I try to finish in two months. Anyhow, that is not the message of today. The question I have is about naming genes. In the "green part" there is this box gene name. Up to now I only filled in a name if there was a genuine K.lactis gene known (you will be pleased to hear that the first one of that kind was your HAP2). However, is that general policy. I came across several ORFs that have high similarity over the entire length with a known S.cerevisiae gene. It is tantalizing to put "Kl" before it, although further evidence is lacking. Similarly, "Product" and "EC numbers" will those be added soleley based on similarity or not added at this stage. �
The second question I have is for you to have a look at the part I have done so far, first 100 kb or so of chromosome I. I am totally new to annotation and I feel a little insecure and since I would prefer to go along the chromosome once instead of several times correcting mistakes I would appreciate your comments (I could not find annotated parts of other chromosomes, but looked only at a few places). If there is a well-annotated part I can take that as an example that will also do.

Hope everything is fine.

Best regards

Yde

[ This message was edited on Tue Feb 24 '04 by the author ]


Message 81af694b6aU-5168-1168+00.htm, number 5, was edited on Tue Feb 24 '04 at 19:27:44
and replaces message 81af694b6aU-5168-1158+00.htm

Re: Another question Yde (steensma@rulbim.leidenuniv.nl)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


On Tue Feb 24 '04, yvan zivanovic wrote
---------------------------------------
>I just came across what I think is a mistake in the sequence. ORFs
>I1181 to I1203 on chromosome I match for 100(100 �459103081dentity)
>with ORFs III1308 - III1333 at chromosome III. Since this appears to
>me very unlikely, it seems an assembly mistake. Who can I ask for
>more information, or better who will have a look at this.

>Best regards
>Yde

Dear Yde,

there's no reason for the moment to suspect real sequencing problems. the duplication you are referring to is not at 100 459116111dentity level, so this might reflect an actual duplication. IS or other mobile elements could be responsible for that. keep your eyes open on this issues
it is good practice to keep track of this kind of features (until we decide they are real problems), as
a sequence revision is not on schedule at the moment.

best,

Yvan

[ This message was edited on Tue Feb 24 '04 by the author ]


Message 81af694b6aU-5168-1168+00.htm, number 4, was edited on Tue Feb 24 '04 at 19:28:01
and replaces message 81af694b6aU-5168-1157-00.htm

Another question Yde (steensma@rulbim.leidenuniv.nl)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


I just came across what I think is a mistake in the sequence. ORFs
I1181 to I1203 on chromosome I match for 100(100 �459103081dentity)
with ORFs III1308 - III1333 at chromosome III. Since this appears to
me very unlikely, it seems an assembly mistake. Who can I ask for
more information, or better who will have a look at this.

Best regards
Yde

[ This message was edited on Tue Feb 24 '04 by the author ]


Message 81af694b6aU-5169-911-00.htm, number 8, was posted on Wed Feb 25 '04 at 15:10:44
Question by Tiziana = MIPS and COG problems (tlodi@ipruniv.cce.unipr.it)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


Dear Yvan.
two questions concerning MIPS and COG functional categories.
1) In some cases (es. klactIV0107) more than one functional categories are indicated in MIPS. In this case which one has to be chosen?

2) In some case �the COG category indicated is not present in the list (and consequentely the sub categorie is absent) .One exemple is klactIV0138 for which LT (not L or T) is indicated. �What we have to do?
Thanks a lot.
� � � � � � � � �Tiziana Lodi


Message 81af694b6aU-5169-912+00.htm, number 9, was posted on Wed Feb 25 '04 at 15:12:39
in reply to 81af694b6aU-5169-911-00.htm

Re: Question by Tiziana = MIPS and COG problems (tlodi@ipruniv.cce.unipr.it)

yvan zivanovic
yvan.zivanovic@arch5.igmors.u-psud.fr


On Wed Feb 25 '04, yvan zivanovic wrote
---------------------------------------
>Dear Yvan.
>two questions concerning MIPS and COG functional categories.

Dear Tiziana

i'll answer you directly and also report this on the new forum page (http://www-archbac.u-psud.fr/genomes/r_klactis/r_klact_forum/r_klact_forum.html)

next time try posting questions by this way, and see if it works

>1) In some cases (es. klactIV0107) more than one functional categories are indicated in MIPS. In this case which one has to be chosen?


you can see that COG for �klactIV0107 says "KOG0254 Predicted transporter (major facilitator superfamily)"
so maybe the best choice to keep up consistency is to annotate in MIPS with
"67 TRANSPORT FACILITATION
67.07 C-compound and carbohydrate transporters"

>2) In some case �the COG category indicated is not present in the list (and consequentely the sub categorie is absent) .One exemple is klactIV0138 for which LT (not L or T) is indicated. �What we have to do?


if your are patient enough, scroll down the COG sub category (number) �menu until the very end (i known it's a little be tedious)
you'll see appear at the bottom :
"LT KOG0133 Deoxyribodipyrimidine photolyase/cryptochrome"

here it is !

>Thanks a lot.
> � � � � � � � � �Tiziana Lodi

Yvan


Message 84e50c15Xmk-5182-921-00.htm, number 10, was posted on Tue Mar 9 '04 at 15:22:13
K.lactis centromeres

Yde Steensma
steensma@rulbim.leidenuniv.nl


Dear co - annotators.

Old love never dies and I had a look at the centromeres. The results of which are shown below.
Please have a look at the areas specified for your chromosome to see if there are indeed also similarities with the low number (= near-centromeric) S.c genes/ORfs.

Regards

Yde

Kluyveromyces lactis Centromeres
CENI 760404 to and including 760599
Chromosome I � 760400-760600
TTCTTGCACGTGATTTGATGTTTTTAAAGTTATCCATTTTAATATTTTTATTATTTATTTTAAACTTTTCTCTATTTTTAAGAAGAAATATATATTGTTTTAATTTATTTTGAAAAAATCATTTTTAAACTATACTTGAGTATAAGAATTTATTCTAAAACAAAAATTATTTGTTGCTTTATGTTTCCGAAAATTATTTTA

CENII �1168861 to and including 1169059
Chromosome II 1168855-1169080
AAATCAATCACGTGATATTAAAAATATACTTTTTACCAAAATAATATTTTTGTAATTTTATTTGAAAGAAATCAAAAAAATTTCAATTAAAATATTTTTTATTTAATTTAATTTTAGCTTTTAATTAACATATATAATATTTTATAATTTATTCTTTCAGTTTTAATAATTTAATATTACAAAGTTTGTTTCCGAAAATTAAAATATGTTTTGACGAATTGTATAT

CENIII 1638150 to and including �1638348
Chromosome III 1638140-1638350
AATCATAAAACATCATGTGACTAAAATCAAATTTTACATTTTCTCTCTTTTTATATAAAATAAATTTCTAAACTTTTACTTTATTGCAATTAAAAATATTTTTTTATTAATTTTATATTGAAAAAATTTTAATTAAAATAATGTGGAATTTTGAATTTTTTAAAATAAAATATATGTTTTGAAAAATTTTAGTTTCCGAAAATTAATTTAT

CENIV 1187303 to and including 1187501
Chromosome IV 1187300-1187505
AATATCACGTGCATGCTTTTAATAAAAATCATCTTAAAATCTTTTTTTATTTTTTAGTTTTAAACTTTTTAGTTATTTCAGGAATATATATATTTTTTGTAATATTACTTTTATAAAGTACTTTTTTCTTTTATATTTTTCTAAAGCTTAGTTAGAATTTTAGTAAATAAATGTTTTTTAATACAGTTCCGAAAATAAATATTTTT

CENVI 1187015 to and including �1187212
Chromosome VI 1187010-1187220
TATTATGCACGTGACAATAAATAATACATTTTTATAAACTTATAATTTTTATATTTAATTTTTAAAATAACTTTATAATTATCATATTTTAATTATATATATTAATTTTTGTTAACTCTAGGTTATTTATAAAATAGAATATTTTTGATTACTACTCTAAAATTTAAATATTAAAGATGTTTTATTGTTCCGAAAATAAAAATATTAATGA


CENV 1263806 to and including 126402
Chromosome V 1263800-1264005
ATATATATCACGTGCAAATAAAATATAATTTCAAATTAGTTTTAATTCTTCTTTTATTTTTCTAGAGAAAATATCTATTAATTTTCTCATAATAGAAATAAATAAATTTTTACTTTTTTTATAATTATCTATTTAATTTAAATAAAACTTAATTCAATTTATTTTAAAATATTGTATAGTTTTACTATTCCGAAAAATAAATATAT

The red sequences are the CDEI, CDEII and CDEIII elements, CDEI and CDEIII are underlined (unfortunately this information got lost during copying).
Centromeres I, II, III, IV and VI were localized by comparison with published sequences (Heus et al., Mol. Gen. Genet. 236 (1993) 355-362).
CENV was found by searching chromosome V with �the CDEIII consensus TTCCGAAA in both orientations. Only one of these was surrounded by an AT-rich area and this one, shown above, was assumed to be the centromere as it had all the other characteristics of a K.lactis centromere.

Notes
-Some centromeres are almost subtelomeric �(II and III).
-It is also remarkable that all six centromeres have the same orientation from left to right, that is on the chromosomes as they are published they run from left to right CDEI, CDEII and CDEIII.
- On �chromosome I the centromere is surrounded by ORFs that share similarity with S.cerevisiae genes that surround the centromere there as well.
i.e. I2225 (YCL004w) -I2233 (YCL001w)- I2235 (YNL001w)- CENI- I2243 (YGR001c)- I2250 (YAL001c)- I2256 (YAR002w).
How about your chromosome?




Message 9ec30f0900A-5198-758-00.htm, number 11, was posted on Thu Mar 25 '04 at 12:38:23
possible problems

gbelska


we have noted that the ORF2065 from chromosome V and ORF2068
from chromosome V belong probably to the same gene. Please, check
for probable sequencing mistakes.

the important thing to take into account is that we actually cannot
make any correction to the sequence, in fact it's �even rather
difficult to check the sequence quality at any specific coordinate.
this is because the sequencing has been performed by genoscope, the
french genome sequencing center, which do not provided an easy access
to the primary data (moreover we should have much bigger computers
here to deal with such huge quantity of data).

this is why if you do suspect a sequence error, the only alternative
at the moment is to signal the problem in the annotation itself, eg
"probable sequencing error" in description fields and add so more
information in the comment field , eg "this gene is split in to orfs
(orf a & orf b) because of a probable sequencing error. it's also a
good idea to maintain a personal list of all problems you
encountered, so when you finish annotating we can have a look
together and see whether some particular region of the genome needs
to be reexamined for sequence quality.

IMPORTANT:
i would like to stress that, BEFORE concluding that a sequence error
is possible or probable, you should have ruled out other
possibilities such as intron presence, multiple subunit protein with
different organization (split/joined domains) between different
species, possible translational recoding or frameshifting, any other
infrequent situation