A field guide · est. by schlein-lab

Pseudo­genes.
Hidden in plain sight.

About 15 198 of them sit in the human genome. For half a century we called them junk DNA. They aren't. Some regulate cancer. Some confound clinical diagnostics. All of them are part of the story of how genomes evolve.

For the curious
A plain-language guide
For scientists
Curated literature
For clinicians
Diagnostic relevance
Atlas plate of a human chromosome with highlighted paralog regions
§ I — What is a pseudogene

Almost a gene.
Not quite.

A pseudogene is a piece of DNA that looks like a gene but doesn't make a working protein. It is the molecular equivalent of a photocopy of a key — recognisable, the right shape, but somewhere along the way the cuts didn't quite line up. Most are relics of duplications that happened tens of millions of years ago, still sitting in our genome long after they stopped doing what their ancestors did.

For decades, biologists treated them as evolutionary debris. The term itself — coined by Jacq, Miller and Brownlee in 1977 — was slightly dismissive: pseudo, "false." A century of textbooks pushed the same line: pseudogenes are mistakes, gene fossils, the parts of the genome that don't matter.

Then, around 2010, a series of papers showed that some pseudogenes produce RNAs that regulate the expression of their parent genes. One of them, PTENP1, turned out to be a tumour suppressor in its own right. The clean line between "gene" and "junk" stopped being clean.

Some pseudogenes do nothing. Some regulate cancer. Some are diagnostic landmines. We mostly don't yet know which is which.

Anatomical illustration: parent gene next to its pseudogene twin with annotated defects
Parent & twin · annotated
§ II — Four ways a gene becomes a ghost

Pseudogenes don't all
arrive the same way.

They can be born from a misfired retrocopy, from a redundant duplicate that drifted, or from a gene we no longer need. Each origin leaves a different fingerprint — and matters differently to the people who study them.

Diagram of retrotransposition
i. processed

The retrocopy

Most pseudogenes in the human genome are processed pseudogenes. A gene's mRNA got reverse-transcribed back into DNA and inserted somewhere new — without its introns, often with a poly-A tail still attached. They are scattered all over the genome, frequently far from the parent.

62% of all human pseudogenes (≈ 9 487 in GENCODE v49)
Diagram of gene duplication producing an unprocessed pseudogene
ii. unprocessed

The duplicate that drifted

A gene was duplicated — a whole copy, introns and all — and then one of the two copies accumulated mutations until it stopped producing a working protein. Unprocessed pseudogenes typically sit right next to their parent, which is why short-read aligners struggle to tell them apart. PMS2/PMS2CL is the classic case.

~ 13% of pseudogenes (≈ 1 949)
Time-axis diagram of a unitary pseudogene losing function
iii. unitary

The functional gene we lost

A unitary pseudogene is the only copy in the genome — and it's broken. There was no duplication; the gene simply stopped working in our lineage. Many olfactory receptors are unitary pseudogenes in humans, fully functional in dogs and rodents. A small molecular record of what we used to be able to smell.

~ 290 in the human genome
Diagram of polymorphic pseudogene status varying between individuals
iv. polymorphic

The pseudogene that isn't always one

Some "pseudogenes" are functional in some people and broken in others. CASP12 is a pseudogene in most humans but functional in some West African populations. FCGR2C's status varies too. Polymorphic pseudogenes blur the line — and they're a reminder that "gene" vs "pseudogene" is sometimes a question of whose genome you happen to look at.

A small but consequential class
§ III — The junk-DNA story

We were wrong
for half a century.

Vintage manuscript page being struck through, suggesting the revision of an old idea

"Junk DNA" was coined by Susumu Ohno in 1972 — a label for the chunks of the mammalian genome that don't code for proteins. Pseudogenes were the canonical example. Broken copies of working genes, kept around by inertia, doing nothing.

That story held until the early 2000s. Then four things happened — and the consensus collapsed.

  1. 1977
    "Pseudogene" coined

    Jacq, Miller & Brownlee describe a 5S rRNA pseudogene in Xenopus. The term — pseudo, false — sticks.

  2. 1972
    Junk DNA

    Susumu Ohno argues that most of the mammalian genome carries no function. Pseudogenes become the textbook example.

  3. 2010
    PTENP1 regulates PTEN

    Poliseno et al. show in Nature that PTENP1's mRNA acts as a competing endogenous RNA — soaking up microRNAs that would otherwise suppress PTEN. The first hard evidence that a "junk" pseudogene actively governs a tumour suppressor.

  4. 2012
    GENCODE catalogues them

    Pei et al. publish the first comprehensive pseudogene resource: 11 216 in the human genome at the time. The community now has a shared substrate.

  5. 2020s
    A real research field

    Cheetham, Faulkner & Dinger publish a major Nature Reviews Genetics review framing pseudogenes as a genuine class of regulatory and protein-producing elements. The question is no longer "do they do anything?" but "which ones do, and how?"

§ IV — Why clinicians should care

When two genes
are almost the same,
a wrong call isn't theoretical.

Pseudogenes lie next to their parents in stretches of 88–99% sequence identity. Short-read sequencers can't always tell which gene a read actually came from. The result is silent: a variant gets called on the parent gene when the read came from the pseudogene — and the patient gets a wrong report.

Anatomical-style figure with disease loci marked at three points
PMS2 PMS2CL
Lynch syndrome
~100% identity exons 11–15
SMN1 SMN2
Spinal muscular atrophy
single-nt difference exon 7
CYP2D6 CYP2D7 / CYP2D8P
Pharmacogenomics
multi-copy locus, deeply paralogous
STRC STRCP1
Non-syndromic hearing loss
96.81% identity
CYP21A2 CYP21A1P
Congenital adrenal hyperplasia
~98% identity
CHEK2 CHEK2P2
Hereditary breast cancer risk
paralog confounds NGS calls
IKBKG IKBKGP1
Incontinentia pigmenti
X-linked, paralog at distance
NEB NEB triplicate
Nemaline myopathy
99.66–99.79%, triplicate exons 82–105

These are the loci where labs traditionally fall back on hand-tuned workarounds — MLPA assays, custom mappability filters, manual review. The general-purpose audit layer that replaces those scripts is pseudocaller, below.

§ V — Open tools

Software for working
on pseudogenes.

schlein-lab builds and maintains a small constellation of open-source tools for clinical and research genomics. The flagship for this field guide is pseudocaller; the rest of the family handles related problems — assembly, visualisation, distributed analysis.

FEATURED · pseudocaller.com

pseudocaller

A paralog-aware audit layer for short-read NGS. Scores every read against 384 692 paralog-specific variants across 6 128 catalogued paralog pairs, corrects depth-based copy-number estimates, and flags variant calls sitting on contaminated reads — across the entire human genome, in a single pipeline pass.

Open pseudocaller.com →
Atlas-style still life of scientific instruments
§ VI — A reading list

The literature corner.

A hand-picked, regularly updated reading list. If you only have time for one paper, take Cheetham et al. 2020 — it is the modern-era review and the field's current centre of gravity.

Stack of antique scientific journals
  1. Modern review · start here

    Overcoming challenges and dogmas to understand the functions of pseudogenes

    Cheetham SW, Faulkner GJ, Dinger ME. Nature Reviews Genetics 21(3): 191–201. 2020.

    The authoritative modern review. Synthesises the case for pseudogenes as a real class of regulatory elements, lays out the methodological challenges, and recasts the entire field beyond the junk-DNA frame.

  2. Landmark · regulatory pseudogene

    A coding-independent function of gene and pseudogene mRNAs regulates tumour biology

    Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. Nature 465(7301): 1033–1038. 2010.

    The PTEN/PTENP1 paper. The first hard evidence that a pseudogene's transcript can soak up microRNAs and thereby regulate its parent gene — here, a tumour suppressor. This single finding cracked the field open.

  3. Resource · the catalogue

    The GENCODE pseudogene resource

    Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu X, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. Genome Biology 13(9): R51. 2012.

    The first comprehensive, manually-curated pseudogene catalogue for the human genome. Gave the community a shared substrate; everything downstream — including the 6 128-pair reference set used by pseudocaller — descends from this work.

  4. Comparative genomics

    Comparative analysis of pseudogenes across three phyla

    Sisu C, Pei B, Leng J, Frankish A, et al. PNAS 111(39): 13361–13366. 2014.

    A cross-species look at pseudogene evolution in human, fly and worm. Shows that pseudogene formation rates and types vary dramatically across phyla — and frames the human's processed-pseudogene-heavy profile as a mammalian peculiarity.

  5. Mechanism · early framing

    Pseudogenes: pseudo-functional or key regulators in health and disease?

    Pink RC, Wicks K, Caley DP, Punch EK, Jacobs L, Carter DR. RNA 17(7): 792–798. 2011.

    A mid-decade review that sat on the inflection point — written just after PTENP1 but before the GENCODE catalogue. Useful for understanding how the community's question shifted from "are pseudogenes anything?" to "which are something?"

  6. Methodology · pipeline

    PseudoPipe: an automated pseudogene identification pipeline

    Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M. Bioinformatics 22(12): 1437–1439. 2006.

    The original automated pseudogene-identification pipeline. The methodology behind most subsequent catalogues; still cited when new pseudogene-mining tools are described.

  7. Pharmacogenomics · CYP

    CYP2D6 / CYP2D7 / CYP2D8P — primary literature collection

    Curated PubMed search · > 200 papers

    A live PubMed query for the most clinically consequential pharmacogenomic paralog locus. CYP2D6 metabolises a quarter of all clinically used drugs; its two pseudogene partners are the textbook example of why short-read NGS struggles in paralog regions.

  8. Live search · everything new

    All pseudogene papers, 2024–present

    PubMed live filter

    For when you want to see what's new. The rate of pseudogene-tagged papers has roughly tripled since 2010, and 2024–2026 has produced a wave of pseudogene-related work in cancer transcriptomics, long-read clinical genomics, and ceRNA networks.

§ VII — Frequently asked

For the curious.

Plain-language answers to the questions we get most often from people outside genetics. If you have a clinical or technical question that isn't here, the literature corner above is the next stop.

i.

Are pseudogenes basically broken genes?

Some are. Many aren't quite. A pseudogene is a stretch of DNA that looks like a gene but doesn't make a working protein the way the parent gene does. Some pseudogenes still get transcribed into RNA, and that RNA can do useful work on its own — regulating other genes, sponging up microRNAs, even (rarely) coding for short proteins. "Broken" is too strong; "different job" is closer to the truth for the interesting ones.

ii.

Can a pseudogene make me sick?

Indirectly, yes — in two ways. First, a few pseudogenes regulate the activity of disease-relevant parent genes (PTENP1 regulates the PTEN tumour suppressor; disrupting it has been linked to cancer). Second, and more commonly relevant in clinical practice, pseudogenes confound DNA tests: short-read sequencing can confuse a pseudogene with its parent gene, and a wrong read assignment can produce a wrong variant call on a clinical report. That's the problem pseudocaller exists to fix.

iii.

Why do pseudogenes exist at all?

Mostly because evolution doesn't bother removing things that aren't actively harmful. Pseudogenes are usually relics of duplications or retrotransposition events — a copy of a gene was made, the original kept doing its job, and the copy was free to drift. Over millions of years some accumulated breaking mutations and stopped producing protein. Removing them wouldn't help; keeping them doesn't hurt; so they stay.

iv.

How many pseudogenes are in my genome?

The current authoritative count, from GENCODE v49, is 15 198. That number drifts as the catalogue gets refined — older estimates put it as high as 19 000; stricter modern criteria have pulled it down. Either way, the count is roughly comparable to the number of protein-coding genes (≈ 20 000). Half the genome's gene-shaped DNA does the day-to-day cellular work; the other half is still being figured out.

v.

Are pseudogenes still evolving?

Yes. New pseudogenes are continually born by retrotransposition — and a small fraction occasionally evolve back into functional genes, a process called pseudogene reactivation. The flow is genuinely two-way. The genome is a living archive, not a fossil.

vi.

Should I be worried about my clinical genetic test?

Not unduly — most tests don't sit in pseudogene-confounded regions. But for a small set of clinically critical loci (PMS2, SMN1, CYP2D6, STRC, CYP21A2 and a few others) labs have known about the pseudogene problem for years and use special workarounds. If your test result feels surprising, it is reasonable to ask whether the lab used a paralog-aware variant calling pipeline for that specific gene.

vii.

What is the deal with "junk DNA"?

It's a 1972 metaphor that aged poorly. The term covered most of the genome that doesn't code for proteins — including pseudogenes, repetitive elements, regulatory sequences, and non-coding RNAs. We now know much of that "junk" has function (regulation, structure, splicing control). A fair fraction probably is genuinely non-functional. The honest position is: parts of the genome are still uncatalogued, and "junk" was a placeholder that never stopped being a placeholder.