Most scientists working in the biological sciences or an overlapping field have encountered various ways of identifying genes and proteins. There are many different types of identifiers. For example, searching for the PDB ID: 2IW3 (which represents elongation factor 3 in yeast strain S288C) on UniProt gives us a results column labeled “Gene names” that includes no less than six (!) ways to refer to the gene that produces this particular protein. This can be frustrating – it is easy to get into trouble when you think you have a consistent gene naming scheme when you do not, especially if you want to cross-reference gene lists.
It is not my intention to attempt to unify these different naming schemes. This would certainly just result in a seventh convention floating around to increased general confusion.
Instead, I will be telling you about my favorite naming convention: the Open Reading Frame (ORF) nomenclature. These seven-character identifiers of the type “YLR249W” provide the address of a gene within the genome of a specific organism.
Character 1: a single letter telling you in this case that we are referring to yeast
Character 2: a single letter referring to the chromosome on which this gene can be found. Yeast contains chrI, chrII, …, chrXVI which are denoted A, B, …, P. So, in the example of YLR249W we now know that we have a yeast ORF on chromosome XII.
Character 3: L or R, denoting the left and right arms of the chromosome. Continuing our example, our ORF is in yeast chromosome XII’s right arm
Characters 4-6: these three characters tell us the ORF number within this chromosome. We can now say that we are looking at ORF 249 in yeast chromosome XII’s right arm.
Character 7: W or C for Watson or Crick, referring to the chromosome strand on which this ORF is found. The Watson strand refers to the 5′ – 3′ direction strand while the Crick strand is 3′ – 5′. We now have a complete gene address: the 249th ORF on the right arm of yeast chromosome XII on the Watson strand.
These ORF identifiers provide a surprising amount of information given their compact format that can be read out in plain English with some practice. One shortcoming is that they do not provide information on the function of the gene, while the gene name (for our example it’s YEF3) in some cases but not all contains at least a hint of functional information.