Why don't you scrape the entire United States?

We will. The Colorado scope is intentional — to exercise the parser, link graph, and editorial layer on a single state until coverage is rock-solid. Multi-state expansion is a sequencing decision, not a capacity one. Article 5 in this series covers the roadmap.

Is your crawler the same one CMS uses?

No. CMS runs a [separate validator](https://www.cms.gov/hospital-price-transparency/resources) that scores any URL against the schema. Our crawler discovers and verifies; the CMS validator scores. We surface the validator results on each workspace page alongside our parser results so the two views can be compared.

Do you cache files locally?

We cache the parsed output, not the raw MRF. Every dollar value the site shows is reproducible from the source file the hospital published — the [methodology page](/methodology) documents the schema and the aggregation steps.

What if a hospital's MRF link is technically fine but the data is wrong?

We don't second-guess the published rates. If a number looks anomalous, we'll surface it on the per-hospital page (the per-payer band makes outliers visible) but we don't substitute, infer, or interpolate. Corrections come from the hospital republishing.

How we discover and crawl hospital MRF links

By Ashwin Pingali Updated May 6, 2026 6 min read

No central registry exists for hospital MRF URLs — hospitals self-host, links break, formats vary. How we find, validate, and re-validate the link graph.

There is no central registry of hospital machine-readable file URLs. CMS requires every hospital to publish one (45 CFR §180.50) but does not maintain a master list of where each file lives. Hospitals self-host. Links move. The format varies. To build an aggregated view across hospitals, somebody has to keep an active link graph.

This piece walks through the practical reality of doing that — what the rule says about discoverability, where the link is supposed to be, where it actually is, and the failure modes the pipeline sees most often.

Why this is a hard problem

The CMS rule mandates publication, not central registration. Each hospital posts its own file at its own URL on its own domain. There is no API to query, no registry to subscribe to, and no standardised filename convention. The only way to know where every hospital's MRF lives is to find each one, store the URL, and re-check it on a cadence.

The link graph also decays. Hospitals migrate hosting providers, restructure their websites, rename files when they refresh, switch CDNs, and occasionally rotate the path on every quarterly publish. A working URL today is not a working URL six months from now.

Where the link is supposed to be

Per 45 CFR §180.50(d)(2), the hospital's homepage must contain a prominent link to the standard-charges file. CMS recommends specific anchor language — "price transparency," "standard charges" — and a location near the top of the page or in the footer. The intent is that any patient could find the file by visiting `hospital.example/`, scanning the homepage, and clicking through.

In practice the link is often deep in a financial-assistance section, behind a generic "Billing" header, or inside a regulatory-disclosures index that itself isn't reachable from the homepage. We have seen MRF links nested four pages deep behind menus designed to discourage clicking.

Some hospitals comply technically — there is a homepage anchor — but the anchor points to a redirect chain that finally lands on the file. Others post a ZIP archive containing the MRF rather than the MRF directly. Both satisfy the letter of the rule. Neither makes the file discoverable to a casual reader.

The discovery pipeline

The atlas keeps a per-hospital `mrf_url` field — that one column is what every downstream price aggregation depends on. Three sources fill it:

· A sibling crawler service. A separate compliance-scanner pipeline pulls hospital homepages, extracts MRF candidate links, and verifies they resolve to a parseable file. It is the primary feed.
· Hospital-association datasets. State hospital associations and a handful of third-party sources publish curated lists. We use them as cross-checks rather than as the source of truth — formatting and freshness vary too much.
· Manual overrides. When a hospital corrects a link by email, the override goes into a curated map that supersedes the crawler's most recent guess. Manual fixes typically last one to two refresh cycles before the crawler picks up the same path.

The pipeline re-verifies every link on each refresh cycle. A link that 404s, redirects to an HTML error page, or returns the wrong content type for the declared format gets flagged in `precompute-errors.json` and tracked in the pipeline's internal error log so the next manual-override pass picks it up.

Failure modes in the wild

The recurring breakages, ranked by frequency in our logs:

· File renamed, homepage anchor unchanged. The MRF moved from `/standardcharges-2025.csv` to `/standardcharges-2026.csv`; the homepage still points at the old name; the old file is gone. This is the single most common failure.
· ZIP wrapper. The MRF is published inside a `.zip` with the actual CSV named with non-ASCII characters or nested two folders deep.
· Redirect chains. HTTP → HTTPS → www. → CDN. Most clients follow them; older parsers and some validators don't.
· Requires User-Agent or Cookie. The hospital's CDN blocks requests that look programmatic. We pass a documented identifier so hospitals can recognise our pipeline traffic; some block it anyway.
· Streaming-too-slow timeouts. A 10GB JSON file streamed over a slow connection exceeds the parser's read budget. The fix is increasing the budget; the cost is longer refresh runs.
· Wrong content type. Server returns `text/html` for what's actually a CSV — a configuration mistake, not a content one.

Dealing with corrupt files

Even when the URL resolves and the file downloads, the contents can be partly corrupt — most commonly a Parquet conversion in the upstream ingest hits a ZSTD decompression failure on one or two row groups, and downstream queries fail at the affected code-prefix slice. The pipeline handles this with a defensive query wrapper: each individual query runs inside a `tryQuery` block that catches failures, records them in `precompute-errors.json`, and lets the rest of the precompute proceed. Partial coverage is better than total failure.

Six Colorado hospitals currently fail at least one query phase in our pipeline:

· Pikes Peak Regional Hospital (Woodland Park)
· Rio Grande Hospital (Del Norte)
· Southeast Colorado Hospital (Springfield)
· UCHealth Longs Peak Hospital (Longmont)
· UCHealth Memorial Hospital Central (Colorado Springs)
· UCHealth Memorial Hospital North (Colorado Springs)

The UCHealth-cluster failures share an upstream cause — a thrift/Parquet error in how the source files were converted, not a problem with the hospitals' published MRFs themselves. Those hospitals appear in the catalog with the data we could parse, and the workspace scorecard flags the partial state. We re-attempt the query phases on each refresh; some have begun resolving on their own as upstream tooling stabilises.

The honest state of the link graph today

At the most recent refresh, all 59 Colorado hospitals in our atlas have a `mrf_url` populated and resolving. 33,902,350 individual rate rows survived the parse and aggregation, across 311,409 distinct procedure codes and 275 distinct payers. Six hospitals' files lose a slice of their query coverage to the upstream Parquet issue described above; the rest parse cleanly enough to power the methodology we publish.

An additional set of small Colorado hospitals — critical-access facilities, recently-closed ones, and a handful with no published MRF at all — sit in the broader compliance dataset but not in the atlas. They appear in the workspace scorecard as missing or non-conformant rather than aggregated. (How CMS wrote the rule is in the regulations explainer; what an MRF actually contains is in the MRF primer.)

What you can do as a reader

If a hospital page on this site links to an MRF that no longer resolves, the simplest thing is to email the contact in the footer. Corrections that involve a republished file are picked up at the next refresh cycle; smaller fixes (a homepage path that moved) can be applied to the manual override map sooner.

If you are a hospital pricing or compliance team and want to see exactly what the pipeline sees for your facility, email the contact address in the footer. A per-hospital workspace surface (crawl status, validator results, parse errors) is in development and will be invite-only when it ships.

Why don't you scrape the entire United States?: We will. The Colorado scope is intentional — to exercise the parser, link graph, and editorial layer on a single state until coverage is rock-solid. Multi-state expansion is a sequencing decision, not a capacity one. Article 5 in this series covers the roadmap.
Is your crawler the same one CMS uses?: No. CMS runs a separate validator that scores any URL against the schema. Our crawler discovers and verifies; the CMS validator scores. We surface the validator results on each workspace page alongside our parser results so the two views can be compared.
Do you cache files locally?: We cache the parsed output, not the raw MRF. Every dollar value the site shows is reproducible from the source file the hospital published — the methodology page documents the schema and the aggregation steps.
What if a hospital's MRF link is technically fine but the data is wrong?: We don't second-guess the published rates. If a number looks anomalous, we'll surface it on the per-hospital page (the per-payer band makes outliers visible) but we don't substitute, infer, or interpolate. Corrections come from the hospital republishing.

Related procedures

Related collections

CMS-mandated shoppable services
9
CMS requires every U.S. hospital to publish prices for a list of common shoppable services in plain-language format. These are the catalog items on that list.

Numbers and citations on this page trace back to hospitals’ own machine-readable files under 45 CFR §180.50. See the methodology page for how the prices are aggregated, and the editorial policy for what we will and won’t do as a publisher.