K lactis entry page

K. lactis annotation guidelines

This page is a shorthand "user manual" for using K. lactis annotation database

1) What has been done

K. lactis chromosomes sequences as available on 05/02/03 where used to build an open reading frames (ORFs) database, using ATG as single start codon, and 40 aa as the minimal ORF length.
This gave a set of 30727 ORFs (excluding mitochodrial DNA), which is the starting material for this work.

This 30727 ORFs set was annotated in a semi-automatic procedure (see details on this page: 10X auto-annotation)
Basically, 5 criteria were used to perform this auto-annotation:

Similarity with S. cerevisiaea known genes was used to establish existence and annotation (definition) of most genes ; when both sequences are very similar

Predicted coding probability of each ORF was determined using hidden Markov model (HMM) and used to confirm gene's existence

ORFs highly similar to other know genes, found in C. albicans genome and the trEMBL database where used to define annotation (when S. cerevisiae did not give a good hit )

When no good similarity existed for an ORF ( Blast expect P � 10^-30and no good overall alignment in any database) , the best hit is proposed as the default annotation for this orf

Finally, for each K. lactis defined gene , a "similarity level" with S. cerevisiae gene (ortholog) was added . This was based on rather rigid rules, in order to comply with the 'Genolevure' consortium requirements.

Steps 1-5 where performed automatically, then subjected to an individual evaluation by the primary annotators (hence "semi-automatic"). No alteration of the original (automatic) annotation was possible at this stage, only an appreciation on the given choice (good/ problem)
Several steps of automatic annotation parameters adjustments were used, and ultimately some manual corrections were performed.

This semi-automatic annotation procedure allowed us to define 5203 genes on chromosomes I-VI.
However , it is very important to take good notice of several important facts:
- First, the annotation was not performed on the completely finished genome, so a few regions were overlooked at the time of this work (mostly sub-telomeric regions, which were missing ).
- Second, very limited attention has be given to specific features such as introns, real versus potential start codons, etc.
- Third and most important, no functional annotation has been attempted at this level

The available annotation mentioned above has been used to setup the K. lactis genome annotation database, which is the starting point to the present annotation work.

2) What is to be done

As can be seen from the preceding paragraph, annotation of the K. lactis genome is a effort to extend, correct and complete the current work.
Most focus is needed on functional annotation, which is currently missing, and which should be a cornerstone for the genome analysis. Several specific tools have been designed to aid annotator in this regard (see below), as this should constitute the most important part of the work.
Several aspects of annotation have not yet been fully incorporated in the database, including information about RNAs, introns, promoters, non transcribed elements in general. This should be gradually done in the near future, maybe after I get a little more feedback on these features and the best way to deal with them.
Annotators are expected to give an appreciation on each gene's current annotation (if it exists) :
-in case of problem they should submit their own annotation after a careful evaluation of all data pertaining to a particular gene (see below), otherwise they can leave the current one untouched.
-in case an ORF has been wrongly attributed the status of a gene (there are cases), it can be deleted.
-in case an ORF is obviously a gene and has been overlooked in the automatic annotation stage, if can be added, and annotator must provide an annotation.
-Any gene that can be identified with good confidence ( i.e the annotation says it is “probably“ an already known gene because the similarity is good) should be classified in a function category according to 2 different schemes:

MIPS - CYGD functional classification
This classification is specifically designed for YEAST. It's main advantage is that it based on the considerable knowledge of yeast genetics and biology. It's main disadvantage is that is too specific for people working outside the fungi field.

NIH - Koonin KOG : clusters of orthologous genes
This a functional classification of genes based on orthology groups in completely sequenced genomes within large domains : procaryotes (archaea+ bacteria) or eucaryotes. The main advantages are that it's wildly accepted among various biologists communities, and a big effort is made on predicting the function of so far hypothetical and/or poorly characterized genes. The main disadvantage is that it does not deal specifically with the needs of fungi biologists.

Whatever the respective pros and cons of each system, annotators are requested to attribute a functional annotation in BOTH classification schemes. Although the various functional categories in each system are different , and sometimes don't overlap quite obviously, functional attribution in each system should not be contradictory with the other one as much as possible.

3) How to do the work

For this, you'll need two things:

a) access the information about each ORF
b) submit your annotation after careful evaluation

All links for access and annotation submission can be found on the K.lactis Annotation Page

ACCESSING INFORMATION:

Every ORF (all 30727 of them) from K. lactis genomic sequence has been matched against several public databases (trEMBL, SWISSProt, YEAST,C. albicans,Pfam, KOG, self match) , and all results have been compiled in a database called "Matches in Public databases"

This information can be accessed via Matches in Public Databases page.
This is a standard Query Form which allows to search with multiple criteria . It's use is rather straightforward for anyone how has already tried to retrieve valuable information from internet databases.
Results are displayed in such a way as to give access to the maximal number of information for each ORF (gene maps and sequences, blast and other tools results, direct links to public resources information).
This database allows one to work in "transversal" manner , as opposed to working on chromosome maps (see below) which strictly dependant on the spacial arrangement of genes on the genome .

Auto-annotated genes (see above ) were used to build an Annotation Database.

This information is accessed on Annotation Database Search Page.
This database will contain all newly added annotations, as well as those already defined in the automatic annotation stage.
Its content is dynamically updated as soon as you add / delete or modify existing annotation (see below)
The database is designed to allow multiple criteria requests on annotated data to ease genome analysis once the annotation is finished.
This information is also important to give a accurate feedback of the ongoing work by other collaborators on other chromosomes , for example.

Chomosome maps are displaying the actual organization of all ORFs on all 6 frames on every chromosome.

These maps are accessible individually for each chromosome on the annotation entry page (K. lactis : Map & Annotation button),
they are the main workbench for the annotation process.
Understanding the main parts of these maps design is essential for a good and efficient usage.
Here follows a brief outline, but every user should exercise a little bit before actually "diving" into annotation.
You have several ways to access Chromosome maps

- Open a map by pushing an 'K. lactis : Map & Annotation' button found on the annotation entry page .
-Jump directly to any ORF's genomic environment :

Type any valid ORF name in "Jump directly to an item" field found on the annotation entry page . You can adjust optionally the display range for the map, which will be centered around the desired element.

- Access to chromosome maps is also possible from any search result page.

The result is the same as if you have jumped directly to an ORF (see preceding paragraph)

This will bring up a window divided in three parts :

MAP (upper part) .
This is the actual 6 frames ORFs map, it starts a position 1 on the given chromosome sequence. It's a dynamical map, very each rectangle displays the location of an ORF on the sequence (on direct strand above the blue line, reverse strand below blue line). Boxed rectangles represent ORFs that have been annotated , thus they represent real "GENES". Others remain as shadows, and will never become genes unless someone adds an annotation for them.
Passing over an ORF on the map with your pointer triggers the selection of this particular element. The selected element , which appears in the field labeled "Element Info" can now be subject to any action performed in the CONTROLS part of the window.

CONTROLS (middle part)
This is a "control panel" where any action really takes place
Please note that you can type (or paste) any valid ORF name in the "Element Info" field to have it selected. This works only for ORFs belonging to the chromosome whose map is being currently displayed.
SetUp: the yellow part is the "SepUp" of the map :
you can move left or right by an offset indicated in the "Offset' field, zoom in or out, change size of the map (for small screen users), select a specific range to display by entering start / end coordinates in the corresponding fields and change the coloring scheme of the map. This feature is explained briefly in the first RESULT page when a map is started
Get Info:the blue part is where you ask specific information for the selected element .
Pushing any of the reddish button (*) will call the results for the selected element . In this example, "r_klactIV0097" is the selected element, you have just pushed the "Hits in sprot+trEMBL" button, and the blast result is displayed in the RESULT part (see below).
(*): Hits in SWISSPROT+trEMBL; Hits in SWISSPROT; Hits in KOG; Hits in HmmPfam; Paralogs; Hits in Yeast; Hits in C. albicans; CYGD:MIPS function_category. Also, you can directly Blast the selected element sequence vs NCBI Blast database (Run-psi-Blast).Other function are explained below.
Decide: the green part is the annotation submission form.
Once you have decided to add / delete or modify an existing annotation (see below for an guidelines to this issue), you must use this form to enter the relevant information in the different fields. See below for more details.

RESULTS (lower part)
This is were all results are displayed. Take note that you can have results displayed in a separate window instead, by selecting "Display results in New Window" , a toggle button that is located in the CONTROLS part of the window.

DEFINING AND SUBMITING YOUR ANNOTATION

First, a few rules to keep in mind while annotating:

- K. lactis is very close (phylogenetically speaking at least) to S. cerevisiae. The semi-automatic annotation is very heavily bearing on K. lactis /S. cerevisiae gene -to-gene comparison . This sets a very strong frame for K. lactis annotation, as S. cerevisiae is itself very thoroughly annotated.
This means that it is not necessary to duplicate the automatic annotation when genes from both strains are very close one to each other. In this (most frequent) case, you need only to add a functional annotation to the existing one : see below.
- Automatic annotation is very convenient, but it is in no way smart ! .
We have certainly overlooked some obvious errors while checking the complete annotation from this stage , so these errors have to be fixed. More complex situations ( multidomain proteins, introns, no clear-cut similarity evidence, etc ) are more probably (very) poorly annotated at present time. This part can be estimated to about 20% of all K. lactis genes, and this is where a major effort should be concentrated. Here again, functional annotation should be attempted.
- A common source of error for beginner annotators is to believe that genes can overlap ( either on the same strand or on opposite strands). This kind of configuration is extremely unlikely, especially for overlaps on opposite strands. Here is an example of two "genes" overlapping: this was wrongly generated by automatic annotation. What can sometimes be seen on chromosome maps is that 2 good gene candidates (actually ORFs in fact) overlap by a small amount. This is mostly because the second "gene" does not begin at the real start codon. Remember that ORFs are defined as the longest possible open reading frame. Another, more frequent situation, occurs when a K. lactis gene is obviously longer than all other orthologs. Selecting the right start codon is a significant part of the job. A specific tool for finding alternate start codons has been designed for this purpose .

You are now ready to start annotating:

Making your decision
Start working on the chromosome that you have chosen. Open the chromosome map ( which starts at position 1 ) and start travelling along the sequence. Select a coloring scheme (see above) that fits you needs . The maps are opened by default with a "SWISSPROT+trEMBL" hits coloring coloring scheme which provides the most complete similarity database.
Boxed ORFs are genes already annotated from the automatic annotation stage . Push the "New annotations" button to see all annotations pertaining to any particular element.
Select an element and examine results from hits in sequence databases by pushing the appropriate button (see CONTROLS above). You can also retrieve the ORF's sequence and use it with your own tools: just click on the ORF in the map. Check the coding probability of the selected ORF by using the "HMM prediction" coloring scheme. The different coloring schemes will help you focus on meaningful ORFs ( those with a significant hit in any particular database).
Don't waste your time on white or blue, box-less, small ORFs ! They are probably only "shadows".
Don't waste your time on small box-less colored (red , blue or green ) ORFs overlapping completely with a longer gene (boxed ORF)They are probably only "shadows"
Pay attention to differences between gene lengths in Blast results: they are often a useful guide to figure out the correct start codon.

Start defining the functional annotation once you have reached a good confidence in what the blast alignments say.
There are 3 main tools to help you define the functional annotation:

- Pfam : Protein families database of alignments and HMM. This is most useful to find common protein domains and families

- KOG : This is actually a highly structured and categorized sequence database. A good hit with one of the many "clusters" of sequences gives the functional classification of the query sequence. The interpretation is not completely straightforward, depending on the quality of alignment, so a "Guess" is made about the functional pathway , category and KOG number confidence level (this is based on a few simple rules : see this page). Be aware however that true fitting of proteins in the COG database are made by COGNITOR program and E. Koonin group members at NCBI.

-MIPS-CYGD : This tool gives the functional classification of the best S. cerevisae (YEAST) hit with the selected element . It is extremely important to understand that if the best YEAST hit is very weak or the alignment is problematic, then this information is not relevant at all. In order to adopt a YEAST's gene functional annotation for a K lactis gene, the similarity between both genes must be good (this is where human decision takes place)
MIPS classification of yeast genes is very detailed, and spans several levels ( functions, pathways, cellular localization, phenotypes, EC number etc.).For this work, we will limit K. lactis functional annotation to a single level of functional annotation.

Submit your annotation:
Your analysis is now done, you want to submit your annotation for a given ORF or gene (an already annotated ORF).
The annotation submission form is in the green part of the CONTROLS part.
This rather simple form requires only 2 items to be filled to allow submission: the "Element Info" field must contain a valid element name and the "Description" field must contain some text ( this is the definition that comes in the first line of all matches in a blast result file, as well as the definition line in a fasta file). All other fields are optional and left to your personal appreciation and wisdom. If you do wish to add functional annotation , you must do it by selecting the relevant categories in the "KOG" and "MIPS" pop-up menus . As you select a category from the "KOG" or "MIPS" categories pop-up menus, the content of the "KOG number" and "MIPS sub-categories" are respectively updated to reflect the main menu choice : see an example here.

Once you have filled all necessary fields in the submission form, initiate the submission by pressing the "Submit" button. You have now the last opportunity to check carefully your annotation before final confirmation ( "Confirm" button)
As soon as your submission has been acknowledged, the content of the Annotation Database is updated as well as the chromosome map itself (but you must press the "Apply Change" button in the CONTROLS part. Just pressing "reload" on your browser window won't work).

You are, of course, invited to carefully review your submission before final confirmation. But in (very unlikely ) case you made a mistake, you can always delete an existing annotation by selecting "Delete" instead of the default "Add" annotation behavior. You will be extra-warned when attempting to delete an existing annotation. Nevertheless, all annotation additions or deletions are recorded in the Annotation Database, so no information is lost. New actions performed on any element are simply stacked on top of previous information, so everyone can trace back the annotation history of any given element.

4) Getting started

Every group of annotators should work on a specific chromosome, as decided in agreement with Monique.

- Go to annotation entry page which contains links to all chromosome maps. Click on the chromosome you are intending to work on.

- The map opens up at the beginning of the chromosome, and ORFs are colored according to strength of their match in SWISSPROT+trEMBL database. Notice that most Red and Green ORFs are already boxed with a black line: these are auto-annotated genes.
First focus your attention on this category, then, when you'll be more familiar with the database usage, you will be able to address more complex problems.

- Select the first boxed ORF in the map (pass the mouse pointer over it). Be careful when moving your pointer over the map : if you pass over an other ORF than the first one you will change the selected ORF. Check that the "Element info" in the 'CONTROLS' part of the window contains the correct ORF name.

- Gather any information you need about the selected element (ORF): push any of the 'Reddish' buttons found in the blue part of 'CONTROLS'. Any annotation that exists for the selected ORF can be found by pushing the "New Annotation" button. As stated above, auto-annotation has been performed in view of close comparison between K.lactis and S. cerevisiae (YEAST). If you find that the current "DESCRIPTION" is not satisfactory, you may have to change it.

- Make up your mind with all available results :all the database searches can be displayed using the "Reddih" buttons. If you want a really up-to-date BLAST result, push "Run PSI-Blast" , you will be conducted to the NCBI psi-blast page, with the query field already filled with the selected ORF's sequence. If you are not very familiar with the interpretation of blast and/or patterns results try improving you confidence by looking at many other genes, or by asking a more expert colleague, or searching the internet information related to annotation skills. Unfortunatly, it is not possible here to write a course on "the art of annotation", but every biologist today must have performed at least from time to time a BLAST search in public databases.
As a rule of thumb, if the best hit for a blast search has an expect value GREATER than 1^e-10, then the similarity should be considered as POOR, i.e. the gene should be annotated in most cases as "Hypothetical protein". But even in this case check the Pfam domains results ("Hits in HmmPfam"). If there is some significant match with a particular domain, you may end up with an "Hypothetical protein, containing xxx domain". You might as well choose to put "Hypothetical protein" in "DESCRIPTION" and put "Containing xxx domain" in the "COMMENT" field.
Pay attention to lengths and alignment extend of matching sequences. If both sequences are about the same size and aligned over their entire length, then everything is fine. If one sequence is shorter than the other , or they don't align on the same part (head to tail, just in the middle or a portion of the sequence, one sequence is aligned with 2 different ones at different location etc...etc ), then you must try to interpret the situation, and knowing biology greatly helps in this case. Many different explanations are possible of course, including error in the sequence! The result of this analysis will be YOUR NEW ANNOTATION.

- Determine whether functional annotation can be done for the selected ORF.
No functional annotation is obviously possible for "orphan" genes, that is if the gene is at best described as "Hypothetical protein", or "weakly similar to some other gene". In any case, carefully review results from KOG and Pfam database searches ("Hits in KOG" and "Hits in HmmPfam"). If the similarity with YEAST is high enough, use the "CYGD:MIPS fun_cat" button, as explained above. This will give you the functional information for the YEAST gene, which in turn can be "captured" and used for annotating K. lactis.

- Submit your annotation:

• If the existing annotation seems to be ok, and there is no possible functional annotation, then just go to the next ORF ( no need to change anything)
• If the existing annotation seems to be ok, AND you can propose a functional annotation, then copy the current annotation in the "DESCRIPTION" field in the green part of "CONTROLS", otherwise type your new ANNOTATION in this field, if it is different from the original one. Select the correct function category in BOTH KOG and MIPS systems, if it's possible (see The annotation submission form above). If the gene is an enzyme, and you have found the right EC number, then type it in the "EC :" field in 'CONTROLS'. Any other relevant information resulting from your analysis can be incorporated in either one of preexisting fields ("Start Condon", "Gene Name", "EC :", etc) or, if does not fit in any of the preceding labels, in the "COMMENT" field, in which you can type anything you want ( a typical thing that can be found in the "COMMENTS' field could be : “ This gene is split in two separate sub-units in all other Ascomycota “ or else : "Probable sequence error , 5' end of the gene is missing").
• If you believe an ORF should not be considered as a gene, select "Delete" instead of "Add" in the "Annotation" small popup menu (in 'CONTROLS'). You don't have to type any further information in that case.
• Press "SUBMIT" (in 'CONTROLS') and review carefully the information you've just typed in. If something is not correct, just correct the wrong item in the 'CONTROLS' part and do "SUBMIT" again.

• If everything is correct, press "Confirm", you're done with this ORF. Your annotation has been incorporated in the genome annotation database. You can check this by pressing the "New Annotations" button, or by querying the selected gene on the annotation database search page (see above)

- Select the next boxed ORF on the map and start over

- When you have reached the end of the first 100kb of your chromosome, move to the next 100 kb part of the chromosome, press the "Right" button in the yellow part of 'CONTROLS'. The map will slide by the amount contained in the "Move offset:" field in the 3' direction (if you've pushed "Right" button) or 5' direction ("Left" button).
-Start over on this region as above.

Question: What about introns ?
Introns are currently analyzed in a separate way by the Lyon and Orsay groups, and will be displayed in a near future in the maps and the annotation database. If you think you have identified one or more introns during a gene analysis, just put the number of introns in the field labeled "Intron", and put all other information in the "COMMENT" field.

Question: What about RNAs and other non-coding elements?
As above, all non coding RNA species are treated separately by specialized programs and/or expert colleagues in the field , and will be incorporated later in the database. This topic is not at the top priority at the moment, and you should not spend your time on this problem (in this annotation framework). If you do wish to contribute to RNA or other genome features, please contact us directly, we will be pleased to benefit from your expertise.

Questions, problems etc for this database to : Yvan

Last modified on :Thu Jan 29 2004 17:32:20