A bit of evolution
What is a tree of life? What is a mutation? What is natural selection? How to build a tree of life with molecular data? What are 'orthologs'?
Here is a sketch of a tree of life proposed by Darwin in 1837 (First Notebook on Transmutation of Species).
In Darwin's time, species were first compared on the basis of their morphology - for example, analyses of the size, shape and structure of bones, the presence of hair or scales, or, for plants, the position of leaves on a stem.
The evolution of the tree of life...
New technologies (DNA sequencing), access to sequences (DNA & proteins) and the advancement of bioinformatics and statistical techniques have changed the way species are classified. We can now estimate the degree of relatedness between species and build a tree of life by comparing the sequences of their genes and/or proteins.
How does it work?
To understand how it is possible to build a tree of life by comparing genes or protein sequences, it is necessary to (re)dive into some basic notions of evolution.
Here, for example, is the tree of life published by scientists in 2006. They compared 31 proteins from 191 different species.
The molecular basis for evolution
The molecular basis for evolution are the random changes that occur in DNA.
Mutations and natural selection
Depending on where these mutations are located in the genome, they can affect mechanisms which are fundamental to the biology, physiology, and development of organisms. Some mutations do not.
– Certain mutations have no impact because they are found outside of genes or outside the regions which regulate gene expression.
– Mutations located within introns generally have less impact than those located in exons.
– The change(s) in the amino acid(s) do not affect the 3D structure and the functioning of the protein.
If a mutation leads to a change in amino acid that modifies the function of a protein and this change confers an advantage for an individual in a particular environment, this particular individual will be able to survive and/or reproduce more quickly : the mutation will then find itself in the generations that will follow.
And sometimes, a new characteristic, even a new species appears. But if the environment changes again, this new species may not survive!
Note that in multicellular organisms, only the mutations present in the DNA of the cells involved in reproduction can have an impact on the following generations (ovules, sperm, pollen, spores, ...). The impacts are often visible only after several generations. And as the time between 2 generations is sometimes very long (25 years for man or turtle, for example), it is difficult to 'see' the evolution!
Natural selection: an example
In the following example, every circle represents an individual. All these individuals belong to the same species.
The red mutation is deleterious : it is possibly the cause of a rare genetic disease. The green mutation gives an advantage to individuals who carry it in the environment in which they live. As a consequence, over many generations, the number of individuals with the green mutation will increase, until the green mutation is the most frequent within the population in the given environment. The red mutation will disappear. In another environment, it is the red mutation that could have been selected!
The bird feeder
The length of the bird beaks is influenced by a number of genes. But one gene in particular attracted the attention of researchers in 2017: COLA45.
In the populations of great tits studied, this gene had two alleles, T and C. The C allele is associated with a longer beak and is more frequently found in the populations of UK great tits than in the populations of Dutch great tits.
The selection for longer beaks may be specific to the UK: something in the environment in this country has favored the great tits with the C allele and a long beak.
Hypothesis: The long beaks might confer an advantage in the UK, as they could allow the birds to access the food provided in bird feeders more easily, bird feeders which are particularly frequent in the gardens of this country.
Thanks to trackers placed on the birds, the researchers discovered that the tits with the C allele used the bird feeders more often than those with the T allele.
This indicates that the researchers proposed a good hypothesis: the availability of food in the bird feeders could give an advantage to birds with longer beaks, which could access food more easily.
The evolution of the length of the beaks of these birds has been observed for 25 years. The genetic analyses have been carried out on more than 2,300 birds: 490,000 mutations have been studied. But these data, to be validated, will require additional, meticulous studies on the genetics of these birds and on their environment.
Building a tree of life with molecular data: important concepts
1. Reference genome
Each species is made up of a multitude of individuals. Each individual is unique, and the genome of each individual within a species is unique!
In order to compare species on the basis of their genome, biologists work with a reference genome that has been chosen for each species. For each species whose genome has been sequenced, there is a reference genome sequence and a set of 'reference' gene and protein sequences.
It is thus possible to compare either the sequences of the entire genomes (but this is not simple and does not always make sense), or the sequences of genes or proteins.
And that's not all! We must compare what is comparable!
2. Orthology: another important concept
To classify species, we must compare the 'same' characters! It is crucial to compare the sequence of the same gene or same protein present in different species.
The quest for these orthologs is a very important step to study evolution at the molecular level.
Here is the DNA sequence corresponding to the insulin gene in the reference genome for man, chimpanzee, cattle and fish (Danio rerio). In red: the exons.
These 4 genes are ‘orthologous’: they code for a similar protein, which has the same biological function and a common ancestor.
Insulin exists in all vertebrates since the myxines (a very ancient taxon, including individuals who lived probably some 100 million years ago) until man (a more recent taxon, who appeared about 7 million years ago).
Here are the amino acid sequences of the insulin protein in man, chimpanzee, fish and cow:
Have fun finding the differences!
Building a tree of life with molecular data: the basic concept.
It is possible to construct a tree of life by comparing the amino acid sequences of these orthologous proteins.
We can, for instance, compare the amino acid sequences and ‘count’ the differences. The simplified procedure can be carried out manually in the following example:
These observations can be represented by a tree.
Build trees with the program Philophylo (in FR)
- Select a protein.
- Philophylo searches for the sequences of this protein in different species in the UniProtKB/Swiss-Prot database.
- Philophylo compares protein sequences… in bioinformatics language, it builds a ‘multiple sequence alignment’.
- Bioinformatics programs evaluate the differences or similarities observed in the alignment. The result is ‘modeled’ as a tree. The phyla correspond to hypothetical ancestral organisms.
- Remark: Philophylo does not build a genuine phylogenetic tree (the calculations would be much more complicated and too long!).
Who is the ‘cousin’ of the cucumber?
– You can find out who has a common ancestor with the cucumber by building a phylogenetic tree with the Ethylene Receptor sequences.
Who is the ‘cousin’ of the dodo or the mammoth?
– You can find out who has a common ancestor with the dodo or the mammoth by building a phylogenetic tree with the Cytochrome B sequences.
Experts compare tens of thousands of sequences, using complex bioinformatics and statistical programs.
To build this 'new' tree of life in 2016, the researchers compared 16 different proteins from 3,830 species.
The hypothetical common ancestor of all species (located in the center of the tree) is called LUCA (Last Universal Common Ancestor).
It theoretically lived 3.5 to 4 billion years ago and would have been composed of only one cell.
Source: The physiology and habitat of the last universal common ancestor (2016) - Physiology, phylogeny, and LUCA (2016)
These trees of life are continuously updated with new data!
But this history of life will always remain approximate, because we do not have and will never have access to all the sequences of all the organisms that live or have lived on Earth!
The challenges, an overview
The challenges of building a tree of life with molecular data are many:
(1) To have access to the genome's sequences of the species of interest (sequencing),
(2) To have access to information about the location of genes within the genome sequences & to determine the corresponding protein sequences (annotation),
(3) To determine which gene(s) or protein(s) are 'orthologs (‘quest for orthologs’).
And that's where bioinformatics comes in!
What did you think?