1) What has been
done Steps 1-5 where performed
automatically, then subjected to an individual evaluation
by the primary annotators (hence "semi-automatic"). No
alteration of the original (automatic) annotation was
possible at this stage, only an appreciation on the given
choice (good/ problem) 2) What is to be
done
Whatever
the respective pros and cons of each system,
annotators
are requested to attribute a functional annotation in
BOTH classification schemes.
Although the various functional categories in each system
are different , and sometimes don't overlap quite
obviously, functional attribution in each system should
not be contradictory with the other one as much as
possible. 3) How to do the
work
Auto-annotated
genes (see
above
) were used to build an Annotation
Database.
Chomosome
maps are displaying
the actual organization of all ORFs on all 6 frames on
every chromosome.
-Jump directly to any
ORF's genomic environment :
- Access to chromosome maps
is also possible from any search
result page.
K. lactis
annotation guidelines
This page is a shorthand
"user manual" for using K. lactis annotation
database
This gave a set of 30727 ORFs (excluding mitochodrial
DNA), which is the starting material for this work.
Basically, 5 criteria were used to perform this
auto-annotation:
Several steps of automatic annotation parameters
adjustments were used, and ultimately some manual
corrections were performed.
However , it is very important to take good notice of
several important facts:
- First, the annotation was not performed on the
completely finished genome, so a few regions were
overlooked at the time of this work (mostly sub-telomeric
regions, which were missing ).
- Second, very limited attention has be given to specific
features such as introns, real versus potential start
codons, etc.
- Third and most important, no functional
annotation has been attempted at this
level
Most focus is needed on functional
annotation, which is currently missing, and
which should be a cornerstone for the genome analysis.
Several specific tools have been designed to aid
annotator in this regard (see below), as this should
constitute the most important part of the work.
Several aspects of annotation have not yet been fully
incorporated in the database, including information about
RNAs, introns, promoters, non transcribed elements in
general. This should be gradually done in the near
future, maybe after I get a little more feedback on these
features and the best way to deal with them.
Annotators are expected to give an appreciation on each
gene's current annotation (if it exists) :
-in case of problem they should submit their own
annotation after a careful evaluation of all data
pertaining to a particular gene (see below), otherwise
they can leave the current one untouched.
-in case an ORF has been wrongly attributed the status of
a gene (there are cases), it can be deleted.
-in case an ORF is obviously a gene and has been
overlooked in the automatic annotation stage, if can be
added, and annotator must provide an annotation.
-Any gene that can be identified with good confidence (
i.e the annotation says it is probably an
already known gene because the similarity is good) should
be classified in a function category according to 2
different schemes:
This classification is specifically designed for
YEAST. It's main advantage is that it based on the
considerable knowledge of yeast genetics and biology.
It's main disadvantage is that is too specific for
people working outside the fungi
field.
This a functional classification of genes based on
orthology groups in completely sequenced genomes
within large domains : procaryotes (archaea+ bacteria)
or eucaryotes. The main advantages are that it's
wildly accepted among various biologists communities,
and a big effort is made on predicting the function of
so far hypothetical and/or poorly characterized genes.
The main disadvantage is that it does not deal
specifically with the needs of fungi
biologists.
b) submit your annotation after careful evaluation
Every
ORF (all 30727 of them)
from K. lactis genomic sequence has been matched against
several public databases (trEMBL, SWISSProt, YEAST,C.
albicans,Pfam, KOG, self match) , and all results have
been compiled in a database called "Matches in
Public databases"
This is a standard Query Form which allows to search
with multiple criteria . It's use is rather
straightforward for anyone how has already tried to
retrieve valuable information from internet
databases.
Results are displayed in such a way as to give access
to the maximal number of information for each ORF
(gene maps and sequences, blast and other tools
results, direct links to public resources
information).
This database allows one to work in "transversal"
manner , as opposed to working on chromosome maps (see
below)
which strictly dependant on the spacial arrangement of
genes on the genome .
This
database will contain all newly added annotations, as
well as those already defined in the automatic
annotation stage.
Its content is
dynamically updated as soon as you add / delete or
modify existing annotation (see below)
The database is designed to allow multiple criteria
requests on annotated data to ease genome analysis
once the annotation is finished.
This information is also important to give a
accurate feedback of the ongoing work by other
collaborators on other chromosomes , for
example.
they
are the main workbench for the annotation
process.
Understanding the main parts of these maps design is
essential for a good and efficient usage.
Here follows a brief outline, but every user should
exercise a little bit before actually "diving" into
annotation.
You have several ways to access Chromosome maps
- Open a map by pushing an 'K. lactis : Map &
Annotation' button found on the annotation
entry page .
This is the actual 6 frames ORFs map, it starts a
position 1 on the given chromosome sequence. It's a
dynamical map, very each rectangle displays the
location of an ORF on the sequence (on direct
strand above the blue line, reverse strand below
blue line). Boxed rectangles represent ORFs that
have been annotated , thus they represent real
"GENES". Others remain as shadows, and will never
become genes unless someone adds an annotation for
them.
Passing over an ORF on the map with your pointer
triggers
the selection
of this
particular element. The
selected element , which
appears in the field labeled "Element Info" can now
be subject to any action performed in the CONTROLS
part of the window.
This is a "control
panel" where
any action really takes place
Please note that you can type (or paste) any valid
ORF name in the "Element Info" field to have it
selected. This works only for ORFs belonging to the
chromosome whose map is being currently
displayed.
SetUp: the yellow part is the
"SepUp" of the map :
you can move left or right by an offset indicated
in the "Offset' field, zoom in or out, change size
of the map (for small screen users), select a
specific range to display by entering start / end
coordinates in the corresponding fields and change
the coloring scheme of the map. This feature is
explained briefly in the first
RESULT page
when a map is started
Get Info:the blue part is
where you ask specific information for the
selected element .
Pushing any of the reddish button (*) will call the
results for the selected
element . In this example,
"r_klactIV0097" is the selected element, you have
just pushed the "Hits in sprot+trEMBL" button, and
the blast result is displayed in the RESULT part
(see below).
(*): Hits in SWISSPROT+trEMBL; Hits in SWISSPROT;
Hits in KOG; Hits in HmmPfam; Paralogs; Hits in
Yeast; Hits in C. albicans; CYGD:MIPS
function_category. Also, you can directly Blast the
selected element sequence vs NCBI Blast database
(Run-psi-Blast).Other function are explained
below.
Decide: the green part is the
annotation
submission form.
Once you have decided to add / delete or modify an
existing annotation (see below
for an guidelines to this issue), you must use this
form to enter the relevant information in the
different fields. See below for more
details.
This is were all results are displayed. Take note
that you can have results displayed in a separate
window instead, by selecting "Display results in
New Window" , a toggle button that is located in
the CONTROLS part of the window.
First, a few rules to keep in mind while
annotating:
- K.
lactis is very close
(phylogenetically speaking at least)
to S.
cerevisiae. The
semi-automatic annotation is very heavily bearing on K.
lactis /S. cerevisiae gene -to-gene comparison . This
sets a very strong frame for K. lactis annotation, as S.
cerevisiae is itself very thoroughly annotated.
This means that it is not necessary to duplicate the
automatic annotation when genes from both strains are
very close one to each other. In this (most frequent)
case, you need only to add a functional annotation to the
existing one : see below.
- Automatic
annotation is very
convenient, but it is
in no way smart ! .
We have certainly overlooked some obvious errors while
checking the complete annotation from this stage , so
these errors have to be fixed. More complex situations (
multidomain proteins, introns, no clear-cut similarity
evidence, etc ) are more probably (very) poorly annotated
at present time. This part can be estimated to about 20%
of all K. lactis genes, and this is where a major effort
should be concentrated. Here again, functional annotation
should be attempted.
- A
common source of error
for beginner annotators is
to believe that genes can overlap
( either on the same strand or on opposite strands). This
kind of configuration is extremely
unlikely, especially for overlaps on opposite strands.
Here is an
example of two "genes"
overlapping: this
was wrongly generated by automatic
annotation. What can sometimes be seen on chromosome maps
is that 2 good gene candidates (actually ORFs in fact)
overlap by a small amount. This is mostly because the
second "gene" does not begin at the real start codon.
Remember that ORFs are defined as the longest possible
open reading frame. Another, more frequent situation,
occurs when a K. lactis gene is obviously longer than all
other orthologs. Selecting the right start codon is a
significant part of the job. A specific
tool for finding alternate start
codons has been
designed for this purpose .
You are now ready to start annotating:
Start working on the chromosome that you have
chosen. Open the chromosome map ( which starts at
position 1 ) and start travelling along the sequence.
Select a coloring
scheme (see above)
that fits you needs . The maps are opened by default with
a "SWISSPROT+trEMBL" hits coloring coloring scheme which
provides the most complete similarity database.
Boxed ORFs are genes already annotated from the automatic
annotation stage . Push the "New annotations" button to
see all annotations pertaining to any particular
element.
Select an element and examine results from hits in
sequence databases by pushing the appropriate button (see
CONTROLS above). You can also retrieve the ORF's sequence
and use it with your own tools: just click
on the ORF in the map.
Check the coding probability of the selected ORF by using
the "HMM prediction" coloring scheme. The different
coloring schemes will help you focus on meaningful ORFs (
those with a significant hit in any particular
database).
Don't
waste your time on
white
or blue, box-less, small ORFs
! They are probably only "shadows".
Don't
waste your time on
small
box-less colored
(red , blue or green ) ORFs overlapping completely with a
longer gene (boxed ORF)They are probably only
"shadows"
Pay attention to differences between gene lengths in
Blast results: they are often a useful guide to figure
out the correct start codon.
Start defining the functional
annotation once you have reached a good
confidence in what the blast alignments say.
There are 3
main tools to help
you define the functional annotation:
- Pfam
: Protein families database of alignments and HMM. This
is most useful to find common protein domains and
families
- KOG
: This is actually a highly structured and categorized
sequence database. A good hit with one of the many
"clusters" of sequences gives the functional
classification of the query sequence. The
interpretation
is not completely
straightforward,
depending on the quality of alignment, so a
"Guess"
is made about the functional pathway , category and KOG
number confidence level (this is based on a few simple
rules : see this
page). Be aware
however that true fitting of proteins in the COG database
are made by COGNITOR program and E. Koonin group members
at NCBI.
-MIPS-CYGD
: This
tool gives the functional
classification of
the best S. cerevisae (YEAST) hit with the
selected element . It is extremely important
to understand that if the
best YEAST hit is very weak or the alignment is
problematic, then this
information is not relevant at all. In order to adopt a
YEAST's gene functional annotation for a K lactis gene,
the similarity between both genes must be
good (this is where human decision takes
place)
MIPS classification of yeast genes is very detailed, and
spans several levels ( functions, pathways, cellular
localization, phenotypes, EC number etc.).For this work,
we will
limit K. lactis functional annotation to a single level
of functional annotation.
Submit your
annotation:
Your analysis is now done, you want to submit
your annotation for a given ORF or gene (an already
annotated ORF).
The
annotation submission form
is in the green part of the CONTROLS part.
This rather simple form requires
only 2 items to be
filled to allow submission: the "Element Info" field must
contain a valid element name and the "Description" field
must contain some text ( this is the definition that
comes in the first line of all matches in a blast result
file, as well as the definition line in a fasta file).
All other fields are optional and left to your personal
appreciation and wisdom. If you do wish to add functional
annotation , you must do it by selecting the relevant
categories in the "KOG"
and "MIPS" pop-up menus
. As you select a category from the "KOG" or "MIPS"
categories pop-up menus, the content of the "KOG number"
and "MIPS sub-categories" are respectively updated to
reflect the main menu choice : see
an example here.
Once you have filled all necessary fields in the
submission form, initiate
the submission by pressing the "Submit"
button. You have now
the last opportunity to check carefully your annotation
before final confirmation ( "Confirm" button)
As soon as your submission has been acknowledged, the
content of the Annotation Database
is updated as well as the chromosome map itself (but you
must press the "Apply Change" button in the CONTROLS
part. Just pressing "reload" on your browser window won't
work).
You are, of course, invited to carefully review your
submission before final confirmation. But in (very
unlikely ) case you made a mistake, you can always delete
an existing annotation by selecting "Delete" instead of
the default "Add" annotation behavior. You will be
extra-warned
when attempting to delete
an existing annotation. Nevertheless, all annotation
additions or deletions are recorded in the
Annotation Database, so no
information is lost. New actions performed on any element
are simply stacked on top of previous information, so
everyone can trace back the annotation history of any
given element.
4) Getting
started
Every group of annotators should work on a specific
chromosome, as decided in agreement with Monique.
- Go to annotation
entry page which
contains links to all chromosome maps. Click on the
chromosome you are intending to work on.
- The map opens up at the beginning of the chromosome,
and ORFs are colored according to strength of their match
in SWISSPROT+trEMBL database. Notice that most
Red
and Green
ORFs are already boxed with a black line: these are
auto-annotated genes.
First focus your attention on this
category, then, when you'll be more familiar with the
database usage, you will be able to address more complex
problems.
- Select the first boxed ORF in the
map (pass the mouse pointer over it). Be careful when
moving your pointer over the map : if you pass over an
other ORF than the first one you will change the selected
ORF. Check that the "Element info" in the
'CONTROLS'
part of the window contains the correct ORF
name.
- Gather any information you need
about the selected element (ORF): push any of the
'Reddish' buttons found in the blue part of 'CONTROLS'.
Any annotation that exists for the selected ORF can be
found by pushing the "New Annotation" button. As stated
above, auto-annotation has been performed in view of
close comparison between K.lactis and S. cerevisiae
(YEAST). If you find that the current "DESCRIPTION" is
not satisfactory, you may have to change it.
- Make up your mind with all
available results :all the database searches can be
displayed using the "Reddih" buttons. If you want a
really up-to-date BLAST result, push "Run PSI-Blast" ,
you will be conducted to the NCBI psi-blast page, with
the query field already filled with the selected ORF's
sequence. If you are not very familiar with the
interpretation of blast and/or patterns results try
improving you confidence by looking at many other genes,
or by asking a more expert colleague, or searching the
internet information related to annotation skills.
Unfortunatly, it is not possible here to write a course
on "the art of annotation", but every biologist today
must have performed at least from time to time a BLAST
search in public databases.
As a rule of thumb, if the best
hit for a blast search
has an expect value GREATER
than 1e-10,
then the similarity should be considered as
POOR,
i.e. the gene should be annotated in most cases as
"Hypothetical protein". But even in this case check the
Pfam domains results ("Hits in HmmPfam"). If there is
some significant match with a particular domain, you may
end up with an "Hypothetical protein, containing xxx
domain". You might as well choose to put "Hypothetical
protein" in "DESCRIPTION" and put "Containing xxx domain"
in the "COMMENT" field.
Pay attention to lengths and
alignment extend of matching sequences. If both sequences
are about the same size and aligned over their entire
length, then everything is fine. If one sequence is
shorter than the other , or they don't align on the same
part (head to tail, just in the middle or a portion of
the sequence, one sequence is aligned with 2 different
ones at different location etc...etc ), then you must try
to interpret the situation, and knowing biology
greatly helps in this case. Many different
explanations are possible of course, including error in
the sequence! The result of this analysis will be
YOUR NEW ANNOTATION.
- Determine whether functional annotation can
be done for the selected ORF.
No functional annotation is obviously possible for
"orphan" genes, that is if the gene is at best described
as "Hypothetical protein", or "weakly similar to some
other gene". In any case, carefully review results from
KOG and Pfam database searches ("Hits in KOG" and "Hits
in HmmPfam"). If the similarity with YEAST is
high enough, use the "CYGD:MIPS fun_cat"
button, as explained above. This will give you the
functional information for the YEAST gene, which in turn
can be "captured" and used for annotating K. lactis.
- Submit your annotation:
If the existing annotation seems to be ok, and
there is no possible functional annotation, then just go
to the next ORF ( no need to change anything)
If the existing annotation seems to be ok,
AND you can propose a functional
annotation, then copy the current annotation
in the "DESCRIPTION" field in the green part
of "CONTROLS", otherwise type your new
ANNOTATION in this field, if it is different
from the original one. Select the correct function
category in BOTH
KOG and MIPS systems,
if it's possible (see The
annotation submission form
above). If the gene is an enzyme, and you have found the
right EC number, then type it in the "EC :" field in
'CONTROLS'. Any other relevant information resulting from
your analysis can be incorporated in
either one of preexisting fields ("Start Condon", "Gene
Name", "EC :", etc) or, if does not fit in any of the
preceding labels, in the "COMMENT" field, in which you
can type anything you want ( a
typical thing that can be found in the "COMMENTS' field
could be : This gene is split in two separate
sub-units in all other Ascomycota or else :
"Probable sequence error , 5' end of the gene is
missing").
If you believe an ORF should not be considered as
a gene, select "Delete" instead of "Add" in the
"Annotation" small popup menu (in 'CONTROLS'). You don't
have to type any further information in that case.
Press "SUBMIT" (in 'CONTROLS') and review
carefully
the information you've just typed in. If something is not
correct, just correct the wrong item in the 'CONTROLS'
part and do "SUBMIT" again.
If everything is correct, press "Confirm", you're
done with this ORF. Your annotation has been incorporated
in the genome annotation database. You can check this by
pressing the "New Annotations" button, or by querying the
selected gene on the annotation
database search page
(see above)
- Select the next boxed ORF on the map and start over
- When you have reached the end of the first 100kb of
your chromosome, move to the next 100
kb part of the chromosome, press the "Right"
button in the yellow part of 'CONTROLS'. The map will
slide by the amount contained in the "Move offset:" field
in the 3' direction (if you've pushed "Right" button) or
5' direction ("Left" button).
-Start over on this region as above.
Question: What about introns ?
Introns are currently analyzed in a separate way by the
Lyon and Orsay groups, and will be displayed in a near
future in the maps and the annotation database. If you
think you have identified one or more introns during a
gene analysis, just put the number of introns in the
field labeled "Intron", and put all other information in
the "COMMENT" field.
Question: What about RNAs and other
non-coding elements?
As above, all non coding RNA species are treated
separately by specialized programs and/or expert
colleagues in the field , and will be incorporated later
in the database. This topic is not at the top priority at
the moment, and you should not spend your time on this
problem (in this annotation framework). If you do wish to
contribute to RNA or other genome features, please
contact us directly, we will be pleased to benefit from
your expertise.