A field guide · est. by schlein-lab

Pseudogenes.
Hidden in plain sight.

About 15 198 of them sit in the human genome. For half a century we called them junk DNA. They aren't. Some regulate cancer. Some confound clinical diagnostics. All of them are part of the story of how genomes evolve.

Begin the chapter See the tools

For the curious

A plain-language guide

For scientists

Curated literature

For clinicians

Diagnostic relevance

Atlas plate of a human chromosome with highlighted paralog regions

§ I — What is a pseudogene

Almost a gene.
Not quite.

A pseudogene is a piece of DNA that looks like a gene but doesn't make a working protein. It is the molecular equivalent of a photocopy of a key — recognisable, the right shape, but somewhere along the way the cuts didn't quite line up. Most are relics of duplications that happened tens of millions of years ago, still sitting in our genome long after they stopped doing what their ancestors did.

For decades, biologists treated them as evolutionary debris. The term itself — coined by Jacq, Miller and Brownlee in 1977 — was slightly dismissive: pseudo, "false." A century of textbooks pushed the same line: pseudogenes are mistakes, gene fossils, the parts of the genome that don't matter.

Then, around 2010, a series of papers showed that some pseudogenes produce RNAs that regulate the expression of their parent genes. One of them, PTENP1, turned out to be a tumour suppressor in its own right. The clean line between "gene" and "junk" stopped being clean.

Some pseudogenes do nothing. Some regulate cancer. Some are diagnostic landmines. We mostly don't yet know which is which.

Anatomical illustration: parent gene next to its pseudogene twin with annotated defects — Parent & twin · annotated

§ II — Four ways a gene becomes a ghost

Pseudogenes don't all
arrive the same way.

They can be born from a misfired retrocopy, from a redundant duplicate that drifted, or from a gene we no longer need. Each origin leaves a different fingerprint — and matters differently to the people who study them.

i. processed

The retrocopy

Most pseudogenes in the human genome are processed pseudogenes. A gene's mRNA got reverse-transcribed back into DNA and inserted somewhere new — without its introns, often with a poly-A tail still attached. They are scattered all over the genome, frequently far from the parent.

62% of all human pseudogenes (≈ 9 487 in GENCODE v49)

Diagram of gene duplication producing an unprocessed pseudogene

ii. unprocessed

The duplicate that drifted

A gene was duplicated — a whole copy, introns and all — and then one of the two copies accumulated mutations until it stopped producing a working protein. Unprocessed pseudogenes typically sit right next to their parent, which is why short-read aligners struggle to tell them apart. PMS2/PMS2CL is the classic case.

~ 13% of pseudogenes (≈ 1 949)

Time-axis diagram of a unitary pseudogene losing function

iii. unitary

The functional gene we lost

A unitary pseudogene is the only copy in the genome — and it's broken. There was no duplication; the gene simply stopped working in our lineage. Many olfactory receptors are unitary pseudogenes in humans, fully functional in dogs and rodents. A small molecular record of what we used to be able to smell.

~ 290 in the human genome

Diagram of polymorphic pseudogene status varying between individuals

iv. polymorphic

The pseudogene that isn't always one

Some "pseudogenes" are functional in some people and broken in others. CASP12 is a pseudogene in most humans but functional in some West African populations. FCGR2C's status varies too. Polymorphic pseudogenes blur the line — and they're a reminder that "gene" vs "pseudogene" is sometimes a question of whose genome you happen to look at.

A small but consequential class

§ III — The junk-DNA story

We were wrong
for half a century.

Vintage manuscript page being struck through, suggesting the revision of an old idea

"Junk DNA" was coined by Susumu Ohno in 1972 — a label for the chunks of the mammalian genome that don't code for proteins. Pseudogenes were the canonical example. Broken copies of working genes, kept around by inertia, doing nothing.

That story held until the early 2000s. Then four things happened — and the consensus collapsed.

1977

"Pseudogene" coined

Jacq, Miller & Brownlee describe a 5S rRNA pseudogene in Xenopus. The term — pseudo, false — sticks.
1972

Junk DNA

Susumu Ohno argues that most of the mammalian genome carries no function. Pseudogenes become the textbook example.
2010

PTENP1 regulates PTEN

Poliseno et al. show in Nature that PTENP1's mRNA acts as a competing endogenous RNA — soaking up microRNAs that would otherwise suppress PTEN. The first hard evidence that a "junk" pseudogene actively governs a tumour suppressor.
2012

GENCODE catalogues them

Pei et al. publish the first comprehensive pseudogene resource: 11 216 in the human genome at the time. The community now has a shared substrate.
2020s

A real research field

Cheetham, Faulkner & Dinger publish a major Nature Reviews Genetics review framing pseudogenes as a genuine class of regulatory and protein-producing elements. The question is no longer "do they do anything?" but "which ones do, and how?"

§ IV — Why clinicians should care

When two genes
are almost the same,
a wrong call isn't theoretical.

Pseudogenes lie next to their parents in stretches of 88–99% sequence identity. Short-read sequencers can't always tell which gene a read actually came from. The result is silent: a variant gets called on the parent gene when the read came from the pseudogene — and the patient gets a wrong report.

Anatomical-style figure with disease loci marked at three points

PMS2 ↔ PMS2CL

Lynch syndrome

~100% identity exons 11–15

SMN1 ↔ SMN2

Spinal muscular atrophy

single-nt difference exon 7

CYP2D6 ↔ CYP2D7 / CYP2D8P

Pharmacogenomics

multi-copy locus, deeply paralogous

STRC ↔ STRCP1

Non-syndromic hearing loss

96.81% identity

CYP21A2 ↔ CYP21A1P

Congenital adrenal hyperplasia

~98% identity

CHEK2 ↔ CHEK2P2

Hereditary breast cancer risk

paralog confounds NGS calls

IKBKG ↔ IKBKGP1

Incontinentia pigmenti

X-linked, paralog at distance

NEB ↔ NEB triplicate

Nemaline myopathy

99.66–99.79%, triplicate exons 82–105

These are the loci where labs traditionally fall back on hand-tuned workarounds — MLPA assays, custom mappability filters, manual review. The general-purpose audit layer that replaces those scripts is pseudocaller, below.

§ V — Open tools

Software for working
on pseudogenes.

schlein-lab builds and maintains a small constellation of open-source tools for clinical and research genomics. The flagship for this field guide is pseudocaller; the rest of the family handles related problems — assembly, visualisation, distributed analysis.

FEATURED · pseudocaller.com

pseudocaller

A paralog-aware audit layer for short-read NGS. Scores every read against 384 692 paralog-specific variants across 6 128 catalogued paralog pairs, corrects depth-based copy-number estimates, and flags variant calls sitting on contaminated reads — across the entire human genome, in a single pipeline pass.

Open pseudocaller.com →

Atlas-style still life of scientific instruments

branch-assembler.com

BRANCH

Breakpoint-Resolved Assembly of Non-diploid Copy-number Heterogeneity. Somatic-mosaic-aware genome assembly for PacBio HiFi with vPCR-based CNV quantification.

Open →

variantpaths.com

variantpaths

Standalone genome-graph viewer for structural-variant paths. Native binary formats (.vpf/.vpz). Rust + egui, runs anywhere.

Open →

nano-zyrkel.com

nano-zyrkel

SDK for autonomous micro-agents that live in GitHub repositories and run on Actions — watch a feed, track a dataset, build a dashboard, all without a server.

Open →

§ VI — A reading list

The literature corner.

A hand-picked, regularly updated reading list. If you only have time for one paper, take Cheetham et al. 2020 — it is the modern-era review and the field's current centre of gravity.

Modern review · start here

Overcoming challenges and dogmas to understand the functions of pseudogenes

Cheetham SW, Faulkner GJ, Dinger ME. Nature Reviews Genetics 21(3): 191–201. 2020.

The authoritative modern review. Synthesises the case for pseudogenes as a real class of regulatory elements, lays out the methodological challenges, and recasts the entire field beyond the junk-DNA frame.

PMID 31848476 DOI
Landmark · regulatory pseudogene

A coding-independent function of gene and pseudogene mRNAs regulates tumour biology

Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. Nature 465(7301): 1033–1038. 2010.

The PTEN/PTENP1 paper. The first hard evidence that a pseudogene's transcript can soak up microRNAs and thereby regulate its parent gene — here, a tumour suppressor. This single finding cracked the field open.

PMID 20577206 DOI
Resource · the catalogue

The GENCODE pseudogene resource

Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu X, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. Genome Biology 13(9): R51. 2012.

The first comprehensive, manually-curated pseudogene catalogue for the human genome. Gave the community a shared substrate; everything downstream — including the 6 128-pair reference set used by pseudocaller — descends from this work.

PMID 22951037 DOI
Comparative genomics

Comparative analysis of pseudogenes across three phyla

Sisu C, Pei B, Leng J, Frankish A, et al. PNAS 111(39): 13361–13366. 2014.

A cross-species look at pseudogene evolution in human, fly and worm. Shows that pseudogene formation rates and types vary dramatically across phyla — and frames the human's processed-pseudogene-heavy profile as a mammalian peculiarity.

PMID 25157146 DOI
Mechanism · early framing

Pseudogenes: pseudo-functional or key regulators in health and disease?

Pink RC, Wicks K, Caley DP, Punch EK, Jacobs L, Carter DR. RNA 17(7): 792–798. 2011.

A mid-decade review that sat on the inflection point — written just after PTENP1 but before the GENCODE catalogue. Useful for understanding how the community's question shifted from "are pseudogenes anything?" to "which are something?"

PMID 21603444 DOI
Methodology · pipeline

PseudoPipe: an automated pseudogene identification pipeline

Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M. Bioinformatics 22(12): 1437–1439. 2006.

The original automated pseudogene-identification pipeline. The methodology behind most subsequent catalogues; still cited when new pseudogene-mining tools are described.

PMID 16574694 DOI
Pharmacogenomics · CYP

CYP2D6 / CYP2D7 / CYP2D8P — primary literature collection

Curated PubMed search · > 200 papers

A live PubMed query for the most clinically consequential pharmacogenomic paralog locus. CYP2D6 metabolises a quarter of all clinically used drugs; its two pseudogene partners are the textbook example of why short-read NGS struggles in paralog regions.

PubMed query
Live search · everything new

All pseudogene papers, 2024–present

PubMed live filter

For when you want to see what's new. The rate of pseudogene-tagged papers has roughly tripled since 2010, and 2024–2026 has produced a wave of pseudogene-related work in cancer transcriptomics, long-read clinical genomics, and ceRNA networks.

Open in PubMed

§ VII — Frequently asked

For the curious.

Plain-language answers to the questions we get most often from people outside genetics. If you have a clinical or technical question that isn't here, the literature corner above is the next stop.

Are pseudogenes basically broken genes?

Some are. Many aren't quite. A pseudogene is a stretch of DNA that looks like a gene but doesn't make a working protein the way the parent gene does. Some pseudogenes still get transcribed into RNA, and that RNA can do useful work on its own — regulating other genes, sponging up microRNAs, even (rarely) coding for short proteins. "Broken" is too strong; "different job" is closer to the truth for the interesting ones.

ii.

Can a pseudogene make me sick?

Indirectly, yes — in two ways. First, a few pseudogenes regulate the activity of disease-relevant parent genes (PTENP1 regulates the PTEN tumour suppressor; disrupting it has been linked to cancer). Second, and more commonly relevant in clinical practice, pseudogenes confound DNA tests: short-read sequencing can confuse a pseudogene with its parent gene, and a wrong read assignment can produce a wrong variant call on a clinical report. That's the problem pseudocaller exists to fix.

iii.

Why do pseudogenes exist at all?

Mostly because evolution doesn't bother removing things that aren't actively harmful. Pseudogenes are usually relics of duplications or retrotransposition events — a copy of a gene was made, the original kept doing its job, and the copy was free to drift. Over millions of years some accumulated breaking mutations and stopped producing protein. Removing them wouldn't help; keeping them doesn't hurt; so they stay.

iv.

How many pseudogenes are in my genome?

The current authoritative count, from GENCODE v49, is 15 198. That number drifts as the catalogue gets refined — older estimates put it as high as 19 000; stricter modern criteria have pulled it down. Either way, the count is roughly comparable to the number of protein-coding genes (≈ 20 000). Half the genome's gene-shaped DNA does the day-to-day cellular work; the other half is still being figured out.

Are pseudogenes still evolving?

Yes. New pseudogenes are continually born by retrotransposition — and a small fraction occasionally evolve back into functional genes, a process called pseudogene reactivation. The flow is genuinely two-way. The genome is a living archive, not a fossil.

vi.

Should I be worried about my clinical genetic test?

Not unduly — most tests don't sit in pseudogene-confounded regions. But for a small set of clinically critical loci (PMS2, SMN1, CYP2D6, STRC, CYP21A2 and a few others) labs have known about the pseudogene problem for years and use special workarounds. If your test result feels surprising, it is reasonable to ask whether the lab used a paralog-aware variant calling pipeline for that specific gene.

vii.

What is the deal with "junk DNA"?

It's a 1972 metaphor that aged poorly. The term covered most of the genome that doesn't code for proteins — including pseudogenes, repetitive elements, regulatory sequences, and non-coding RNAs. We now know much of that "junk" has function (regulation, structure, splicing control). A fair fraction probably is genuinely non-functional. The honest position is: parts of the genome are still uncatalogued, and "junk" was a placeholder that never stopped being a placeholder.

§ VIII — The wider family

Where this site
fits in the picture.

pseudogenomics.com is a public-education and tool-portal site by schlein-lab. It sits inside a small constellation of related projects — research code, clinical tools, and a parent product called Zyrkel.

schlein-lab.com

Pseudo­genes.Hidden in plain sight.

Almost a gene.Not quite.

Pseudogenes don't allarrive the same way.

The retrocopy

The duplicate that drifted

The functional gene we lost

The pseudogene that isn't always one

We were wrongfor half a century.

When two genesare almost the same,a wrong call isn't theoretical.

Software for workingon pseudogenes.

pseudocaller

BRANCH

variantpaths

nano-zyrkel

The literature corner.

For the curious.

Where this sitefits in the picture.

schlein-lab

Zyrkel

nano-zyrkel

Pseudogenes.
Hidden in plain sight.

Almost a gene.
Not quite.

Pseudogenes don't all
arrive the same way.

We were wrong
for half a century.

When two genes
are almost the same,
a wrong call isn't theoretical.

Software for working
on pseudogenes.

Where this site
fits in the picture.