So far, each chapter of this book has concentrated on a gene or genes, tacitly assuming that they are the things that matter in the genome. Genes, remember, are stretches of D N A that comprise the recipe for proteins. But ninety-seven per cent of our genome does not consist of true genes at all. It consists of a menagerie 1 2 4 G E N O M E
of strange entities called pseudogenes, retropseudogenes, satellites, minisatellites, microsatellites, transposons and retrotransposons: all collectively known as 'junk D N A ' , or sometimes, probably more accurately, as 'selfish D N A ' . Some of these are genes of a special kind, but most are just chunks of D N A that are never transcribed into the language of protein. Since the story of this stuff follows naturally from the tale of sexual conflict related in the last chapter, this chapter will be devoted to junk D N A .
Fortunately this is a good place to tell the story, because I have nothing more particular to say about chromosome 8. That is not to imply that it is a boring chromosome, or that it possesses few genes, just that none of the genes yet found on chromosome 8 has caught my rather impatient attention. (For its size, chromosome 8 has been relatively neglected, and is one of the least mapped chromosomes.) Junk D N A is found on every chromosome. Yet, ironically, junk D N A is the first part of the human genome that has found a real, practical, everyday use in the human world. It has led to D N A fingerprinting.
Genes are protein recipes. But not all protein recipes are desirable.
The commonest protein recipe in the entire human genome is the gene for a protein called reverse transcriptase. Reverse transcriptase is a gene that serves no purpose at all as far as the human body is concerned. If every copy of it were carefully and magically removed from the genome of a person at the moment of conception, the person's health, longevity and happiness would be more likely to be improved than damaged. Reverse transcriptase is vital for a certain kind of parasite. It is an extremely useful — nay essential - part of the genome of the AIDS virus: a crucial contributor to its ability to infect and kill its victims. For human beings, in contrast, the gene is a nuisance and a threat. Yet it is one of the commonest genes in the whole genome. There are several hundred copies of it, possibly thousands, spread about the human chromosomes. This is an astonishing fact, akin to discovering that the commonest use of cars is for getting away from crimes. Why is it there?
A clue comes from what reverse transcriptase does. It takes an S E L F - I N T E R E S T 1 2 5
RNA copy of a gene, copies it back into D N A and stitches it back into the genome. It is a return ticket for a copy of a gene. By this means the A I D S virus can integrate a copy of its own genome into human D N A the better to conceal it, maintain it and get it efficiently copied. A good many of the copies of the reverse transcriptase gene in the human genome are there because recognisable 'retroviruses'
put them there, long ago or even relatively recently. There are several thousand nearly complete viral genomes integrated into the human genome, most of them now inert or missing a crucial gene. These
'human endogenous retroviruses' or Hervs, account for 1.3% of the entire genome. That may not sound like much, but 'proper' genes account for only 3%. If you think being descended from apes is bad for your self-esteem, then get used to the idea that you are also descended from viruses.
But why not cut out the middle man? A viral genome could drop most of the virus's genes and keep just the reverse transcriptase gene. Then this streamlined parasite could give up the laborious business of trying to jump from person to person in spit or during sex, and instead just hitchhike down the generations within its victims' genomes. A true genetic parasite. Such 'retrotransposons' are far commoner even than retroviruses. The commonest of all is a sequence of 'letters' known as a L I N E - 1 . This is a 'paragraph' of D N A , between a thousand and six thousand 'letters' long, that includes a complete recipe for reverse transcriptase near the middle.
L I N E - 1 s are not only very common - there may be 100,000 copies of them in each copy of your genome — but they are also gregarious, so that the paragraph may be repeated several times in succession on the chromosome. They account for a staggering 14.6% of the entire genome, that is, they are nearly five times as common as
'proper' genes. The implications of this are terrifying. L I N E - 1 s have their own return tickets. A single L I N E - 1 can get itself transcribed, make its own reverse transcriptase, use that reverse transcriptase to make a D N A copy of itself and insert that copy anywhere among the genes. This is presumably how there come to be so many copies of L I N E - 1 in the first place. In other words, 1 2 6 G E N O M E
this repetitive 'paragraph' of 'text' is there because it is good at getting itself duplicated - no other reason.
'A flea hath smaller fleas that on him prey; and these have smaller fleas to bite 'em, and so proceed ad infinitum.' If L I N E - 1 s are about, they, too, can be parasitised by sequences that drop the reverse transcriptase gene and use the ones in L I N E - 1 s . Even commoner than L I N E - 1 s are shorter 'paragraphs' called Alus. Each Alu contains between 180 and 280 'letters', and seems to be especially good at using other people's reverse transcriptase to get itself duplicated. The Alu text may be repeated a million times in the human genome - amounting to perhaps ten per cent of the entire 'book'.2
For reasons that are not entirely clear, the typical Alu sequence bears a close resemblance to a real gene, the gene for a part of a protein-making machine called the ribosome. This gene, unusually, has what is called an internal promoter, meaning that the message
' R E A D M E ' is written in a sequence in the middle of the gene.
It is thus an ideal candidate for proliferation, because it carries the signal for its own transcription and does not rely on landing near another such promoter sequence. As a result, each Alu gene is probably a 'pseudogene'. Pseudogenes are, to follow a common analogy, rusting wrecks of genes that have been holed below the waterline by a serious mutation and sunk. They now lie on the bottom of the genomic ocean, gradually growing rustier (that is, accumulating more mutations) until they no longer even resemble the gene they once were. For example, there is a rather nondescript gene on chromosome 9, which, if you take a copy of it and then probe the genome for sequences that resemble this gene, you will find at fourteen locations on eleven chromosomes: fourteen ghostly hulks that have sunk. They were redundant copies that, one after another, mutated and stopped being used. The same may well be true of most genes — that for every working gene, there are a handful of wrecked copies elsewhere in the genome. The interesting thing about this particular set of fourteen is that they have been sought not just in people, but in monkeys, too. Three of the human pseudogenes were sunk after the split between Old- World monkeys and S E L F - I N T E R E S T I 2 7
New-World monkeys. That means, say the scientists breathlessly, they were relieved of their coding functions 'only' around thirty-five million years ago.3
Alus have proliferated wildly, but they too have done so in comparatively recent times. Alus are found only in primates, and are divided into five different families, some of which have appeared only since the chimpanzees and we parted company (that is, within the last five million years). Other animals have different short repetitive 'paragraphs'; mice have ones called B1s.