---
title: "How Scholarly Identifiers Are Defined"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How Scholarly Identifiers Are Defined}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

This vignette explains how common scholarly identifiers are formally defined,
what their structural components are, and what it means for them to be *valid*
in a programmatic context.

When working with identifiers in R, it is essential to distinguish between:

- **Structural validity** (does it match the formal grammar?)
- **Checksum validity** (does the control digit verify?)
- **Registry validity** (does the identifier actually exist?)

The functions in `scholid` validate identifiers at the **structural**
level and verify checksums where defined (ORCID, ROR, ISNI, ISBN, ISSN).
They do not check registry or online existence. The regexes in each section
describe the **canonical** form that `is_scholid()` expects; wrapped URLs and
labels should be normalized with `normalize_scholid()` first. Checksum rules
are documented separately where they apply.

## Classification order

`classify_scholid()` and `detect_scholid_type()` walk types in the order
returned by `scholid_types()` (most specific first). The first matching type
wins. This matters when patterns overlap: for example, OpenAlex is checked
before PMID, and six-character UniProt accessions such as `P12345` are not
treated as OpenAlex keys.

PMID is a **fallback** type (`detect_last` in the registry): bare digit strings
are only classified or detected as PMID when no more specific type matches.
During extraction, PMID candidates use 4–9 digits and do not match digits
immediately following `PMC`.

For the authoritative type list and order, call `scholid_types()` in R.

## Supported types (overview)

| Type | Example | Checksum | Notes |
|------|---------|----------|-------|
| `doi` | `10.1000/182` | No | Prefix `10.`; opaque suffix |
| `arxiv` | `2101.00001v2` | No | Modern or legacy archive form |
| `bibcode` | `1992ApJ...400L...1W` | No | Fixed 19 characters |
| `openalex` | `W2741809807` | No | Not UniProt-shaped 6-char accessions |
| `swhid` | `swh:1:cnt:94a9ed02…` | No | Requires `swh:` prefix; optional qualifiers |
| `ark` | `ark:/12148/btv1b8449691v` | No | Requires `ark:` label; 5-digit NAAN |
| `isni` | `000000012146438X` | Yes | Compact 16 characters |
| `orcid` | `0000-0002-1825-0097` | Yes | Hyphenated canonical form |
| `ror` | `01an7q238` | Yes | Lowercase Crockford base32 |
| `rrid` | `RRID:AB_262044` | No | `RRID:` prefix; authority allowlist |
| `uniprot` | `P12345` | No | Uppercase; no version suffix |
| `refseq` | `NM_001744.6` | No | Prefix allowlist; version required |
| `sra` | `SRR1553610` | No | INSDC `S`/`E`/`D` + `R` + entity letter |
| `geo` | `GSE2553` | No | `GSE`, `GSM`, `GPL`, or `GDS` |
| `bioproject` | `PRJNA257197` | No | INSDC `PRJ*` prefixes |
| `assembly` | `GCF_000001405.40` | No | `GCA_` or `GCF_`; nine digits + version |
| `isbn` | `9780306406157` | Yes | ISBN-10 or ISBN-13 |
| `issn` | `2434-561X` | Yes | Hyphenated canonical display |
| `pmcid` | `PMC1234567` | No | Literal `PMC` prefix |
| `pmid` | `12345678` | No | Fallback; excludes valid ISBNs |

The sections below follow a consistent layout: **Structure**, **Validation in
scholid**, **Checksum** (if applicable), and **Structural regex**.

---

# DOI (Digital Object Identifier)

**Governing body:** International DOI Foundation  
**Standard:** ISO 26324

## Structure

A DOI has two parts:

```
prefix/suffix
```

### Prefix
- Always begins with `10.`
- Followed by a registrant code (4–9 digits)

Example:
```
10.1000
10.1038
```

### Suffix
- Assigned by the registrant
- May contain almost any printable character
- Has no globally fixed grammar
- Case-sensitive in theory

Example:
```
10.1000/182
10.1038/s41586-020-2649-2
```

## Validation in scholid

DOI validation is **structural only**. There is no checksum. Registry
existence is not checked. Wrapped forms (`https://doi.org/…`, `doi:` labels)
should be normalized before classification.

## Structural Regex

Canonical form (as enforced by `is_scholid()`):

```
^10\.\d{4,9}/\S+$
```

This checks:
- Prefix starts with `10.`
- 4–9 digits
- A slash
- Non-whitespace suffix

---

# ISNI (International Standard Name Identifier)

**Governing body:** ISNI International Agency  
**Standard:** ISO 27729  
**Documentation:** [ISNI](https://isni.org/)

## Structure

An ISNI uniquely identifies public identities of contributors to media
content. The identifier is 16 characters: 15 decimal digits plus a check
character.

Compact canonical form:

```
000000012146438X
```

Human-readable presentation uses an `ISNI` prefix and spaces in blocks of
four:

```
ISNI 0000 0001 2146 438X
```

Preferred resolver URLs include:

```
https://isni.org/isni/000000012146438X
```

ORCID iDs use the same ISO/IEC 7064 MOD 11-2 checksum on 16 characters but
are canonicalized in `scholid` with hyphens. Compact checksum-valid
16-character strings are treated as ISNI; hyphenated strings are treated
as ORCID.

## Validation in scholid

ISNI validation requires a **checksum-valid** compact 16-character string.
Hyphenated ORCID-shaped input is not accepted as ISNI; normalize or classify
as ORCID instead. Registry existence is not checked.

## Checksum

Uses ISO/IEC 7064 MOD 11-2, identical to ORCID. The check character may be
`0`–`9` or `X`.

## Structural Regex

Compact canonical form:

```
^\d{15}[\dX]$
```

---

# ORCID

**Governing body:** ORCID, Inc.  
**Standard basis:** ISO 7064 (checksum algorithm)

## Structure

An ORCID iD consists of 16 characters:

```
0000-0002-1825-0097
```

### Components

- 16 digits total
- Grouped as 4-4-4-4
- Final character is a checksum digit
- Check digit may be `X`

Internally (without hyphens):

```
0000000218250097
```

## Checksum

Uses ISO 7064 Mod 11-2 algorithm.  
A structurally correct ORCID may still be invalid if the checksum does not match.

## Validation in scholid

ORCID validation requires a **checksum-valid** hyphenated iD. Unhyphenated
16-character strings are not accepted as ORCID by `is_scholid()`; if they
match the ISNI compact pattern and checksum, they classify as `isni` instead.
Wrapped `https://orcid.org/` URLs should be normalized first.

## Structural Regex

Hyphenated canonical form (used by `is_scholid()`):

```
^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$
```

Unhyphenated internal form:

```
^\d{15}[\dX]$
```

---

# ROR (Research Organization Registry)

**Governing body:** ROR Community  
**Documentation:** [ROR identifier pattern](https://ror.readme.io/docs/identifier)

## Structure

A ROR iD is a 9-character lowercase string:

```
0abcdef94
```

Preferred external form is the full URL:

```
https://ror.org/01an7q238
```

## Checksum

The last two characters are a checksum derived from the preceding seven
characters using Crockford base32 encoding and ISO/IEC 7064 MOD 97-10 rules,
matching ROR's identifier generation implementation.

## Validation in scholid

ROR validation requires a **checksum-valid** lowercase compact iD.
`https://ror.org/` URLs should be normalized before classification. Registry
existence is not checked.

## Structural Regex

Canonical compact form:

```
^0[a-hjkmnp-tv-z0-9]{6}[0-9]{2}$
```

---

# RRID (Research Resource Identifier)

**Governing body:** Resource Identification Initiative (SciCrunch)  
**Documentation:** [RRID Initiative](https://www.rrids.org/)

## Structure

A RRID cites a research resource such as an antibody, cell line, model
organism, software tool, or plasmid. The canonical form includes the literal
`RRID:` prefix followed by an authority-specific accession:

```
RRID:AB_262044
RRID:CVCL_2260
RRID:SCR_007358
RRID:IMSR_JAX:000664
RRID:MGI:3840442
RRID:Addgene_80088
```

Preferred resolver URLs include:

```
https://scicrunch.org/resolver/RRID:AB_262044
```

## Validation in scholid

RRID validation is **structural only**. There is no checksum algorithm, and
registry existence is not checked.

To limit false positives, `scholid` accepts only canonical `RRID:`-prefixed
forms and validates the accession body against a conservative allowlist of
known RRID authority prefixes (for example `AB`, `CVCL`, `SCR`, `IMSR`,
`MGI`, `Addgene`). Bare local IDs such as `AB_262044` without the `RRID:`
prefix are rejected.

## Structural Regex

Canonical prefix (body matched against an authority allowlist, not `.+`):

```
^RRID:(?:AB_\d+|CVCL_[0-9A-Z]+|SCR_\d+|…)$
```

The full allowlist is defined in the package registry; see the RRID
implementation for the current authority patterns.

---

# UniProt (UniProtKB accession)

**Governing body:** UniProt Consortium  
**Documentation:** [UniProt accession numbers](https://www.uniprot.org/help/accession_numbers)

## Structure

A UniProtKB accession uniquely identifies a protein record. Accessions are
6 or 10 uppercase alphanumeric characters following UniProt-defined patterns.

Examples:

```
P12345
Q9H0H5
A0A022YWF9
```

Preferred resolver URLs include:

```
https://www.uniprot.org/uniprot/P12345
https://identifiers.org/uniprot/P12345
```

## Validation in scholid

UniProt validation is **structural only**. Registry existence is not checked.

Canonical form is the uppercase accession without version suffixes or entry
name qualifiers. Wrapped URLs and lowercase accessions should be normalized
with `normalize_scholid()` before classification.

Six-character accessions such as `P12345` are **not** accepted as OpenAlex keys
(OpenAlex is checked earlier in classification order, but `is_openalex()`
explicitly rejects UniProt-shaped strings).

## Structural Regex

```
^(?:[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9](?:[A-Z][A-Z0-9]{2}[0-9]){1,2})$
```

---

# RefSeq (NCBI Reference Sequence accession)

**Governing body:** NCBI RefSeq  
**Documentation:** [RefSeq accession prefixes](https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly)

## Structure

A RefSeq accession uniquely identifies a curated sequence record. The format
is a two-letter molecule-type prefix, an underscore, an alphanumeric accession
body, a period, and a version number.

Examples:

```
NM_001744.6
NP_001735.1
NC_003619.1
NZ_CASIGT010000001.1
```

Preferred resolver URLs include:

```
https://www.ncbi.nlm.nih.gov/nuccore/NM_001744.6
https://www.ncbi.nlm.nih.gov/protein/NP_001735.1
https://identifiers.org/refseq/NM_001744.6
```

## Validation in scholid

RefSeq validation is **structural only**. Registry existence is not checked.

Canonical form is the uppercase accession with version suffix. Known RefSeq
prefixes are allowlisted. Wrapped URLs and lowercase accessions should be
normalized with `normalize_scholid()` before classification.

`GCA_` / `GCF_` genome assembly accessions are a separate type (`assembly`)
and are not matched as RefSeq.

## Structural Regex

```
^(?:AC|AP|NC|NG|NM|NP|NR|NT|NW|NZ|XM|XP|XR|YP|WP)_[A-Z0-9]+\.[0-9]+$
```

---

# SRA (Sequence Read Archive accession)

**Governing body:** INSDC Sequence Read Archive (NCBI, EBI, DDBJ)  
**Documentation:** [Search in SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/)

## Structure

An SRA accession identifies a study, sample, experiment, or run in the
INSDC archives. The format is a three-letter prefix (source database plus
entity type) followed by digits.

Examples:

```
SRP006081
SRS123456
SRX1234567
SRR1553610
ERR1234567
DRR1234567
```

Preferred resolver URLs include:

```
https://www.ncbi.nlm.nih.gov/sra/SRR1553610
https://identifiers.org/sra/SRR1553610
```

## Validation in scholid

SRA validation is **structural only**. Registry existence is not checked.

Canonical form is the uppercase accession without version suffix. The first
letter denotes the source archive (`S` NCBI, `E` EBI, `D` DDBJ); the third
letter denotes entity type (`P` study, `S` sample, `X` experiment, `R` run).
Wrapped URLs and lowercase accessions should be normalized with
`normalize_scholid()` before classification.

## Structural Regex

```
^[SED]R[RXSP][0-9]{5,}$
```

---

# GEO (Gene Expression Omnibus accession)

**Governing body:** NCBI GEO  
**Documentation:** [GEO programmatic access](https://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html)

## Structure

A GEO accession identifies a curated dataset, series, sample, or platform
record. The format is a three-letter entity prefix followed by digits.

Examples:

```
GSE2553
GSM313800
GPL96
GDS505
```

Preferred resolver URLs include:

```
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2553
https://identifiers.org/geo/GSE2553
```

## Validation in scholid

GEO validation is **structural only**. Registry existence is not checked.

Canonical form is the uppercase accession. Supported entity prefixes are
`GSE` (series), `GSM` (sample), `GPL` (platform), and `GDS` (dataset).
Wrapped URLs and lowercase accessions should be normalized with
`normalize_scholid()` before classification.

## Structural Regex

```
^(?:GSE|GSM|GPL|GDS)[0-9]{2,}$
```

---

# BioProject (INSDC BioProject accession)

**Governing body:** INSDC BioProject (NCBI, EBI, DDBJ)  
**Documentation:** [BioProject handbook](https://www.ncbi.nlm.nih.gov/books/NBK169438/)

## Structure

A BioProject accession identifies a research project that groups related
sequence and sample records. The format is a five-letter INSDC prefix
followed by digits.

Examples:

```
PRJNA257197
PRJEB12345
PRJDB303
```

Preferred resolver URLs include:

```
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA257197
https://identifiers.org/bioproject/PRJNA257197
```

## Validation in scholid

BioProject validation is **structural only**. Registry existence is not
checked.

Canonical form is the uppercase accession. Known prefixes (`PRJNA`, `PRJEB`,
`PRJDB`, `PRJDA`, `PRJEA`) are allowlisted. Wrapped URLs and lowercase
accessions should be normalized with `normalize_scholid()` before
classification.

## Structural Regex

```
^(?:PRJNA|PRJEB|PRJDB|PRJDA|PRJEA)[0-9]{2,}$
```

---

# Genome assembly (INSDC GCA/GCF accession)

**Governing body:** INSDC / NCBI Assembly  
**Documentation:** [Genome assembly accessions](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/data-processing/policies-annotation/genome-processing/version-status/)

## Structure

A genome assembly accession identifies a collection of sequences comprising
an assembled genome. GenBank assemblies use the `GCA_` prefix; NCBI RefSeq
assembly counterparts use `GCF_`. The accession body is nine digits followed
by a version number.

Examples:

```
GCF_000001405.40
GCA_000001405.29
GCA_009914755.4
```

Preferred resolver URLs include:

```
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
https://identifiers.org/insdc.gcf:GCF_000001405.40
```

## Validation in scholid

Assembly validation is **structural only**. Registry existence is not checked.

Canonical form is the uppercase accession with version suffix. Only `GCA_`
and `GCF_` prefixes are accepted, with exactly nine digits in the accession
body. Wrapped URLs and lowercase accessions should be normalized with
`normalize_scholid()` before classification.

RefSeq gene and protein accessions (`NM_`, `NP_`, …) are validated separately
and are not accepted as `assembly`.

## Structural Regex

```
^GC[AF]_[0-9]{9}\.[0-9]+$
```

---

# ISBN (International Standard Book Number)

**Governing body:** International ISBN Agency  
**Standard:** ISO 2108

## Two Forms

### ISBN-10
- 9 digits + checksum digit
- Check digit may be `X`

Example:
```
0306406152
030640615X
```

### ISBN-13
- 13 digits
- Usually begins with 978 or 979
- EAN-13 checksum algorithm

Example:
```
9780306406157
```

## Validation in scholid

ISBN validation requires a **checksum-valid** ISBN-10 or ISBN-13 in compact
form (no hyphens or spaces in canonical output). Labeled or spaced input
should be normalized first. Registry existence is not checked.

## Structural Regex

ISBN-10 (canonical compact):

```
^\d{9}[\dX]$
```

ISBN-13:

```
^\d{13}$
```

---

# ISSN (International Standard Serial Number)

**Governing body:** ISSN International Centre  
**Standard:** ISO 3297

## Structure

An ISSN has 8 characters:

```
1234-567X
```

### Components

- 7 digits
- 1 checksum digit (0–9 or X)
- Canonical display includes a hyphen after 4 digits

Internal numeric form:

```
1234567X
```

## Validation in scholid

ISSN validation requires a **checksum-valid** ISSN. Canonical form uses a
hyphen after the fourth digit (`1234-567X`). Extraction targets hyphenated
tokens; normalize for compact checks. Registry existence is not checked.

## Structural Regex

Hyphenated (common in extraction):

```
^\d{4}-\d{3}[\dX]$
```

Compact form:

```
^\d{7}[\dX]$
```

---

# arXiv Identifier

**Authority:** arXiv (Cornell University)

## Two Formats

### Modern (post-2007)

```
YYMM.NNNN
YYMM.NNNNN
```

Optional version suffix:

```
YYMM.NNNN(v2)
```

Components:
- 4-digit year/month
- Dot
- 4–5 digit submission number
- Optional version `vN`

Structural regex:

```
^\d{4}\.\d{4,5}(v\d+)?$
```

---

### Legacy (pre-2007)

```
archive/YYMMNNN
```

Example:
```
hep-th/9901001
```

Structural regex:

```
^[a-z\-]+/\d{7}(v\d+)?$
```

## Validation in scholid

arXiv validation is **structural only**. Both modern (`YYMM.NNNNN`) and legacy
(`archive/YYMMNNN`) forms are accepted. Optional version suffix `vN` is allowed.
Wrapped `arXiv:` labels and `https://arxiv.org/` URLs should be normalized
before classification. No checksum; registry existence is not checked.

---

# ADS Bibcode

**Authority:** SAO/NASA Astrophysics Data System (ADS)  
**Documentation:** [ADS bibliographic codes](https://adsabs.harvard.edu/abs_doc/help_pages/data.html)

## Structure

An ADS bibcode is a fixed **19-character** identifier for bibliographic records
in astronomy and related fields. The format follows SIMBAD/NED conventions:

```
YYYYJJJJJVVVVM PPPPA
```

Where:

- `YYYY` — publication year (four digits)
- `JJJJJ` — journal abbreviation, left-justified, padded with `.`
- `VVVV` — volume, right-justified, padded with `.`
- `M` — qualifier (e.g. `L` for letters)
- `PPPP` — page, right-justified, padded with `.`
- `A` — first letter of the first author's surname

Example:

```
1992ApJ...400L...1W
```

Preferred resolver URLs include:

```
https://ui.adsabs.harvard.edu/abs/1992ApJ...400L...1W
```

## Validation in scholid

Bibcode validation is **structural only**. There is no checksum algorithm, and
ADS existence is not checked.

To limit false positives, `scholid` requires exactly 19 characters, a
letter in the journal field, and a letter as the final author-initial
character. Case is preserved in canonical form.

## Structural Regex

```
^\d{4}[A-Za-z0-9.]{14}[A-Za-z]$
```

---

# OpenAlex ID

**Governing body:** OurResearch (OpenAlex)  
**Documentation:** [OpenAlex key concepts](https://developers.openalex.org/guides/key-concepts)

## Structure

Every OpenAlex entity has a persistent ID. The official form is a URL:

```
https://openalex.org/W2741809807
```

The short key (`W2741809807`) is commonly used in API calls and tabular data.
Keys are case-insensitive; `scholid` canonicalizes them to uppercase.

A key consists of:

- a single letter prefix indicating entity type (`W`, `A`, `S`, `I`, `T`,
  `K`, `P`, `F`, or `G`)
- a numeric tail (at least five digits)

Examples:

```
W2741809807
A5023888391
I97018004
```

## Validation in scholid

OpenAlex validation is **structural only**. There is no checksum algorithm,
and registry existence is not checked.

Deprecated concept IDs (`C` prefix) are not accepted. Bare keys are accepted
only when they match the structural pattern; wrapped URLs should be normalized
with `normalize_scholid()` before classification.

Six-character keys that match the UniProt accession pattern (for example
`P12345`) are **rejected** by `is_openalex()` to avoid overlap with UniProt.

Works, authors, and institutions in OpenAlex often also have DOI, ORCID, or
ROR identifiers respectively; those types are checked earlier during
classification.

## Structural Regex

Canonical uppercase key:

```
^[WASTIKPFG][0-9]{5,}$
```

---

# ARK (Archival Resource Key)

**Governing body:** ARK Alliance  
**Documentation:** [ARK specification](https://arks.org/)

## Structure

An ARK is a persistent identifier for digital, physical, or abstract objects.
The core identifier has the form:

```
ark:/NAAN/Name[Qualifier]
```

Where:

- `NAAN` — Name Assigning Authority Number (in `scholid`, five digits)
- `Name` — opaque name assigned by the authority
- `Qualifier` — optional hierarchical (`/`) or variant (`.`) suffix

Examples:

```
ark:/12148/btv1b8449691v/f29
ark:/13030/654xz321
```

Resolver URLs often embed the ARK after the host, for example:

```
https://n2t.net/ark:/12148/btv1b8449691v
```

The labels `ark:` and `ark:/` are equivalent; `scholid` canonicalizes to
`ark:/`.

## Validation in scholid

ARK validation is **structural only**. Resolver existence is not checked.

To limit false positives, `scholid` requires an explicit `ark:` label, a
five-digit NAAN, and a non-empty name body. Bare paths without the `ark:`
prefix are rejected.

## Structural Regex

Canonical form:

```
^ark:/[0-9]{5}/[0-9A-Za-z][0-9A-Za-z._/=-]*$
```

---

# SWHID (SoftWare Hash IDentifier)

**Governing body:** Software Heritage  
**Standard:** ISO/IEC 18670  
**Documentation:** [SWHID specification](https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html)

## Structure

A SWHID identifies a software artifact archived by Software Heritage. The
core identifier has four colon-separated fields:

```
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
```

Where:

- `swh` is the scheme prefix
- `1` is the scheme version
- `cnt` is the object type (`cnt`, `dir`, `rev`, `rel`, or `snp`)
- the final field is a 40-character lowercase hex SHA-1 intrinsic identifier

Optional qualifiers may follow, separated by semicolons:

```
swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://example.org/repo.git;path=/src/main.c;lines=9-15
```

Resolver URLs include:

```
https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
```

## Validation in scholid

SWHID validation is **structural only**. The embedded hash is an intrinsic
content identifier, but verifying that it matches the referenced artifact
requires access to the artifact itself and is not performed by `scholid`.

To limit false positives, `scholid` requires the explicit `swh:` prefix and
rejects bare 40-character hex strings (for example Git commit hashes). Known
qualifier keys (`origin`, `visit`, `anchor`, `path`, `lines`) are validated
conservatively when present.

## Structural Regex

Core form:

```
^swh:1:(cnt|dir|rev|rel|snp):[0-9a-f]{40}$
```

---

# PMID (PubMed Identifier)

**Authority:** U.S. National Library of Medicine

## Structure

A PMID is a decimal integer assigned by PubMed. There is no checksum.

Example:

```
12345678
```

## Validation in scholid

PMID validation is intentionally **permissive** at the character level: canonical
form is digits only (`^\d+$`), but `is_scholid()` also **rejects values that are
valid ISBNs** to reduce cross-type false positives.

Because bare digit strings are ambiguous, PMID is registered as a **fallback**
type (`detect_last`): `classify_scholid()` and the primary pass of
`detect_scholid_type()` try other types first. Use PMID only when nothing more
specific matches.

For **extraction**, candidates are **4–9 digits** and must not immediately
follow the literal `PMC` (so `PMC12345` does not yield a PMID `12345`).

Wrapped forms such as `PMID: 12345678` should be detected via
`detect_scholid_type()` and normalized before strict validation.

## Structural Regex

Canonical form accepted by `is_scholid()` (after ISBN exclusion):

```
^\d+$
```

Extraction pattern (digit run length and `PMC` boundary):

```
(?<![[:alnum:]_./-]|PMC)\d{4,9}(?![[:alnum:]_]|[-/.][[:alnum:]_])
```

---

# PMCID (PubMed Central Identifier)

**Authority:** PubMed Central

## Structure

```
PMC1234567
```

Components:

- Literal prefix `PMC`
- One or more digits

## Validation in scholid

PMCID validation is **structural only**: canonical form is `PMC` followed by
digits. There is no checksum. Registry existence is not checked.

PMCIDs are checked **before** PMID in classification order, so `PMC1234567` is
never classified as a bare PMID. Extraction uses a dedicated `PMC` prefix
pattern.

## Structural Regex

Canonical form:

```
^PMC\d+$
```