A real 404 is fine. A broken link that returns 200 OK is the one that hurts you.
Most teams treat broken links as a tidiness problem: a dead end annoys a visitor, you patch it when someone complains, and you move on. That framing misses the part search engines care about. Every internal link that points nowhere is crawl budget spent on a URL that returns nothing, and ranking signal that flows into a wall and disappears. It is leakage, not just untidiness.
The counterintuitive part is that an honest 404 is the good outcome. It tells a search engine to drop the URL cleanly and stop asking. The expensive failure is the broken link that still answers 200 OK, because nothing tells the crawler to stop, so it keeps coming back to index a page that has nothing on it. This post is about why broken internal links cost you ranking, why the status code matters more than what the page shows a human, and when to actually go looking for them.
What's in this post
- Why broken internal links cost you ranking, not just goodwill
- Soft 404s: the broken link that lies with a 200
- Checking the status code instead of the page
- When links break in bulk, and how to fix them
- Finding all of them in one scan
Why broken internal links cost you ranking, not just goodwill
A broken internal link does three separate kinds of damage, and only one of them is the visitor-facing annoyance everyone notices.
The first is crawl budget. Search engines allocate a finite amount of crawling to your site, and every request spent fetching a URL that returns nothing is a request not spent discovering or refreshing real content. Google's own guidance on managing crawl budget is explicit that low-value URLs, including error pages, eat into the crawling your important pages would otherwise get. On a small site this is invisible. On a large or frequently-updated one, it is the difference between new pages getting indexed this week or next month.
The second is link equity. Internal links pass ranking signal between pages, and a link pointing at a dead URL pours that signal into a page that no longer exists. Whatever authority would have flowed to a live, relevant page is simply lost on the way. You built that internal link for a reason; when its target 404s, the reason evaporates and the signal strands.
The third is the aggregate quality read. A site littered with 404s reads as neglected. That is not a single-page penalty so much as a slow drag on how the whole domain is assessed, which is exactly why these are worth fixing in bulk rather than one report at a time.
Soft 404s: the broken link that lies with a 200
Here is the distinction that trips up most people. A correct 404 returns the HTTP status code 404 Not Found. That is the page telling the crawler, in the only language it actually reads, "this is gone, drop it." The system works as designed: the URL falls out of the index and stops getting crawled.
A soft 404 is the same missing page wearing a disguise. The page shows a human something like "Sorry, that page could not be found," but the server returns 200 OK in the response header. To a person the two look identical. To a search engine they are opposites. The 200 says "this is a real, valid page, index it and check back later," so the crawler keeps returning to an empty page indefinitely, and that empty page can sit in the index competing with your real content. Google documents this exact failure mode in its guidance on soft 404 errors, and the fix it recommends is to return a true 404 or 410 status for content that is actually gone.
This is why a soft 404 is strictly worse than an honest one. The honest 404 ends the relationship. The soft 404 keeps the crawler on the hook, wasting budget on a page that will never rank for anything, because there is nothing on it to rank.
Checking the status code instead of the page
The practical consequence: you cannot judge whether a link is broken by looking at the rendered page. You have to look at the status code in the response header, because that is the only thing the crawler acts on.
The fastest way to see the real status of any URL is curl with the headers-only flag:
# -I sends a HEAD request and prints only the response headers
curl -I https://example.com/old-page
# Or print just the final status code, following redirects:
curl -s -o /dev/null -w "%{http_code}\n" -L https://example.com/old-page
The first line shows you the raw status. The second follows any redirects (-L) and prints the final code, which is what you actually want to know. If that page displays a friendly "not found" message but the command prints 200, you have found a soft 404. If it prints 404 or 410, the link is genuinely broken but at least honestly so, and a search engine will handle it correctly. The full list of what each code means lives in the MDN HTTP response status codes reference, and the 2xx versus 4xx split is the entire ballgame here.
One thing the command line does not surface easily is which of your pages link to the dead URL. The status code tells you a link is broken; it does not tell you where the broken link lives. That is the gap a crawler-based checker fills, and it is why a one-off curl is a spot check rather than an audit.
When links break in bulk, and how to fix them
Broken internal links rarely appear one at a time. They arrive in waves, and the waves line up with predictable events:
EVENT | WHY LINKS BREAK
------------------------------------------------------------------
Site migration | URL paths change, old links not updated
Redesign / re-platform | template links rewritten or dropped
URL structure change | /blog/x becomes /articles/x site-wide
Bulk content deletion | linked pages removed, links left behind
Page rename / slug edit | one rename orphans every link to it
The pattern is the same every time: something on the target side changes, and the source side still points at the old address. That is why a single slug edit can break dozens of links at once, and why the slow drip between big events still adds up. The right cadence is to check immediately after any of the events above, and to run a periodic scan (monthly is a reasonable default for an active site) to catch the drip in between.
Fixing them comes down to three cases:
- The target moved. Update the link to the new URL. If you cannot update every link, 301-redirect the old URL to the new one so the equity follows.
- The target is gone for good. Remove the link, or repoint it at the most relevant page that still exists. Do not leave it dangling.
- The link goes through a redirect chain. Point it directly at the final destination. A chain wastes a round trip and bleeds a little equity at every hop, so linking straight to the end is always better than relying on the redirects to bail you out. (More on that in redirect chains and what they cost.)
Sitemaps deserve the same scrutiny, since a sitemap full of dead URLs hands the crawler a list of pages to waste budget on. If yours might be stale, see what a broken sitemap does to crawling.
Finding all of them in one scan
Everything above is checkable by hand, but checking by hand means running curl against every internal link on every page you care about, then tracing each broken one back to the page it came from. That is fine for a spot check and miserable as a routine.
The LintPage Broken Link Checker does the whole pass in one request: it follows every internal link on a page, checks the live HTTP status of each one, and reports the links returning 404 Not Found, 5xx server errors, 4xx client errors like 401 and 403, and the ones that time out or cannot be reached at all. Because it reads the actual response code rather than the rendered page, it catches the dead links that a human eye skims past, and it tells you which page they live on so the fix is a single edit rather than a hunt.
Broken links tend to travel with other migration-era debris, so it is worth running the full set of checks on the same page while you are at it.
The 30-second version
An honest 404 is fine: it returns the 404 status code, so search engines drop the URL cleanly and stop crawling it. The expensive failures are broken internal links that strand crawl budget and link equity, and soft 404s, which show "not found" to a human but return 200 OK, so the crawler keeps re-indexing an empty page forever. The status code is the only thing the crawler acts on, so check the response header (curl -I), not what the page displays. Links break in bulk during migrations, redesigns, URL changes, and bulk deletions, so scan after those events and periodically otherwise. Fix by updating the link to the moved URL (or 301-ing it), removing or repointing dead links, and always linking straight to the final destination instead of through a redirect chain.