Digital Snowstorm

Enterprise Technical SEO: Fixing Crawl, Index, and Speed Issues at Scale

At a few hundred pages, technical SEO is hygiene. At millions of pages, it's a systems discipline of ratios, logs, and templates. Here's how to diagnose and fix crawl, index, and speed at scale, and where the real leverage lives.

Illustration of an enterprise crawl and index graph: a site root branching to page nodes marked indexed, crawled, and not-indexed, with a magnifier auditing one node

TL;DR

  • Enterprise technical SEO is a systems discipline. You optimize the templates, architectures, and rules that generate millions of pages, not pages one at a time.
  • Protect crawl capacity by refusing to generate junk. Faceted navigation is the single biggest source of crawl waste; handle it with a per-facet decision matrix.
  • Read the two "not indexed" states correctly: "Discovered" is a crawl/demand problem, "Crawled, not indexed" is a quality/duplication verdict. They have different fixes.
  • Fix speed and structure at the template level so one change touches thousands of pages, and treat quality as the lever that earns more crawling, more indexing, and more trust.
Table of Contents

When a site has a few hundred pages, technical SEO is mostly hygiene. You fix your titles, tidy your sitemap, make sure nothing important is blocked, and move on. When a site has hundreds of thousands or millions of pages, the entire job changes. You stop thinking about individual pages and start thinking about systems, ratios, and templates. The questions are no longer "is this page optimized" but "what percentage of my crawl budget is being wasted," "which template is bleeding index coverage," and "if I fix this one thing, how many thousands of pages does it touch."

That shift is the whole game. Almost every mistake I see at enterprise scale comes from someone applying small-site thinking to a big-site problem. So let's walk through how to actually diagnose and fix crawl, index, and speed issues when you're operating at scale, and where the real leverage lives. It's the core of what I do as an enterprise SEO consultant, and the discipline behind every technical SEO engagement.

Crawl Budget Is Real, but It's Misunderstood

Crawl budget gets thrown around a lot, usually by people who don't have a crawl budget problem. Google's own large-site crawl guidance is pretty clear that most sites never need to think about it. You start caring when you've got over a million pages with content that changes regularly, or a smaller site (think 10,000-plus pages) where a lot changes every single day, or when you're watching a growing pile of URLs stuck in "Discovered, currently not indexed." If you're under a few thousand URLs, Google crawls you fine and you should go work on something that matters.

When it does matter, you have to understand what crawl budget actually is, because it's two separate things wearing one name. The first is the crawl capacity limit, which is basically how hard Googlebot is willing to push your server. Respond fast and clean, and that ceiling rises. Start throwing 5xx errors or slowing to a crawl, and Googlebot backs off because it doesn't want to knock your site over. The second is crawl demand, which is how much Google even wants to crawl you in the first place. That's driven by how many URLs Google thinks you have, how popular those URLs are, and how stale they're getting.

This distinction is where most diagnoses go sideways. If your server is fast but Googlebot isn't crawling much, you don't have a capacity problem, you have a demand problem, and no amount of CDN tuning will fix it. Gary Illyes has been blunt about this. Google's stated direction is to crawl the web less, not more, and the way you earn more crawling isn't a setting you toggle. You have to convince Google your stuff is worth fetching. Popularity and quality drive demand. You can't buy your way past that.

So the practitioner's job splits into two tracks: raise capacity by being fast and stable, and stop wasting the capacity you've got. The second track is where the money is.

Crawl Waste Is the Enemy, and Faceted Navigation Is Enemy Number One

Here's the uncomfortable truth about big sites. Googlebot is almost never spending its time on the pages you care about. It's drowning in junk URLs your own platform generated. Sort orders, session IDs, tracking parameters, filter combinations, calendar pages stretching into infinity, soft 404s, redirect chains that bounce three times before landing. Every one of those is a fetch that didn't go toward a page that earns you money.

Faceted navigation is the single biggest offender, and it's not close. Google said as much in its 2024 "Crawling December" series on the Search Central blog, calling faceted nav by far the most common source of overcrawl issues people report to them. Gary Illyes has put rough numbers on it: faceted navigation accounts for something like half of all the crawling complaints Google sees, with action parameters like add-to-cart and print making up another quarter. The math is brutal. Ten filters with ten options each can theoretically spin up billions of URL combinations, and Googlebot will happily try to crawl a meaningful chunk of them if you let it.

The fix isn't one switch, it's a decision matrix you apply per facet. Some facets have genuine search demand and deserve to be indexed, like "black running shoes" or "two bedroom apartments in Denver." Those should be clean, crawlable, indexable URLs. Single low-demand filters that don't have their own search intent should canonical back to the parent category. Pure presentation parameters like sort order, pagination view preferences, and session tracking should be blocked in robots.txt outright so Googlebot never wastes a fetch on them. And combinations beyond two or three filters deep almost never have search demand worth the crawl cost, so most large e-commerce sites block those entirely. Wayfair's published approach is a clean example: index the one and two filter paths, block the deeper combinations.

One thing worth flagging because people still reach for it: the URL Parameters tool in Search Console is gone. Google retired it back in 2022, saying barely one percent of parameter configurations were even useful and that its crawlers figure parameters out on their own now. There's no drop-in replacement in the UI. The workflow is robots.txt, canonicals, meta robots, and clean URL architecture. That's it.

Indexing Diagnostics: Read the Two "Not Indexed" States Correctly

When you open the Page Indexing report on a large site, two statuses will dominate your attention, and they mean completely different things. Confusing them is the most common analytical error I see.

"Discovered, currently not indexed" means Google knows the URL exists but hasn't bothered to crawl it yet. That's a crawl and demand signal. Google found the link but didn't think the page was worth the trip, usually because it's buried deep in your architecture, your server is too slow to make crawling cheap, or Google has decided that pages matching this pattern tend to be low value so why bother. You fix it by strengthening internal links to those pages, getting them into clean sitemaps, speeding up your server, and pruning the near-duplicate siblings that taught Google to distrust the pattern in the first place.

"Crawled, currently not indexed" is a different animal entirely. Google fetched the page, looked at it, and decided not to index it. That's almost always a quality or duplication verdict. The page is thin, it's a near-copy of something else you already have, or it doesn't match a real search intent. You don't fix that with internal linking. You fix it by making the page genuinely better and more distinct, consolidating cannibalizing pages with canonicals, and slowing down how fast you're firehosing new thin pages into the index.

The broader concept underneath both is index bloat. When you flood Google with low-value URLs, you're not just failing to index those pages, you're taxing the whole site. A big mass of thin or duplicate pages drags down crawl demand and Google's willingness to index everything else you publish. This is why pruning isn't housekeeping, it's an indexation strategy. Removing dead weight can lift the pages you actually care about.

For deciding how to handle any given URL type, you've got four tools and they're not interchangeable. Robots.txt disallow stops crawling but doesn't deindex anything, so it's for infinite spaces and junk parameters you never want fetched. Noindex stops indexing but requires the page to stay crawlable, so never combine it with a robots.txt block or Google can't see the tag. Canonical consolidates signals among near-duplicates that should stay crawlable. And a 404 or 410 is for content that's genuinely gone, with 410 being the slightly stronger "don't come back" signal. Picking the wrong one is how people accidentally tank sections of their site.

Internal Linking Is How You Tell Google What Matters

On a million-page site, you cannot manually decide which pages get authority. Your internal linking system does it for you, whether you've designed it intentionally or not. Link equity flows through your architecture, and the pages that sit deep with few internal links are the pages that get crawled rarely and indexed reluctantly.

Click depth correlates hard with indexation. The rough target I work toward is getting around ninety percent of indexable URLs within four clicks of the homepage. Pages within three clicks tend to get crawled and indexed fast. Once you're past depth five, you're relying on deliberate internal linking to keep those pages alive, because Google's default behavior is to deprioritize the stuff that's hard to reach. JetOctopus has a well-known case where revising internal linking moved Googlebot's coverage from forty percent of pages to seventy percent. That's a thirty-point swing from architecture alone, no new content required.

At scale, this has to be programmatic. A hundred-thousand-page site running a couple hundred links per page is dealing with tens of millions of internal links, and you're not placing those by hand. You're building related-content modules, breadcrumb systems, and contextual linking rules into your templates, organized around a hub-and-spoke or pillar-cluster structure so authority concentrates where topical relevance is strongest. Then you hunt for orphans, which are pages that exist and even show up in your logs or analytics but have no internal links pointing to them. Finding orphans means merging your crawl data with your server logs and Search Console data, because the orphan by definition won't appear in your crawler's link graph. It's only visible in the gap between datasets.

One caution: flatter is generally better for discovery, but don't solve depth by dumping five hundred links into a mega-menu or footer. That technically puts everything at depth one while diluting your equity into mush and flattening the hierarchy that helps Google understand what's actually important. Keep your navigation purposeful.

Sitemaps Are a Diagnostic Instrument, Not a Formality

Most people treat the XML sitemap as a box to check. At scale, it's one of your best diagnostic tools, and you're wasting it if you dump every URL into one giant file.

The limits are fifty thousand URLs and fifty megabytes per file, and you nest them under a sitemap index. The real value comes from how you segment. Split your sitemaps by template or section, so you've got one for products, one for articles, one for category pages, and so on. Now submit the index and watch the indexed-to-submitted ratio for each child sitemap separately. If your articles sitemap indexes at ninety-five percent and your products sitemap sits at forty percent, you've just localized your problem to product URLs without crawling a single page. That per-sitemap ratio, tracked weekly, is the single best indexation monitoring instrument I know of for large sites.

For this to work the sitemaps have to be clean. Only canonical, indexable, 200-status, stable URLs belong in there. The second you stuff in redirects, noindexed pages, and blocked URLs, you've diluted the signal and your ratios become noise. And handle the lastmod field with integrity. Google does use lastmod as a crawl scheduling hint now, but only if it trusts you, and it stops trusting you the moment it notices every page claims it was modified "just now." John Mueller has been clear that if your CMS stamps the current timestamp on everything, Google eventually stops believing you. Set lastmod only on real content changes. Changefreq and priority, by the way, are ignored, so don't bother.

Quick note on a deprecation people miss: Google killed the sitemap ping endpoint in 2023. The old trick of pinging Google when your sitemap updates returns a 404 now. Google said the vast majority of those pings were spam. Submit through Search Console and the robots.txt Sitemap directive instead.

JavaScript and Rendering: Better Than It Was, Still a Trap

The old story about JavaScript SEO was the "two waves of indexing," where Google crawled your raw HTML first and came back later to render the JavaScript, sometimes much later. That model is fading. Google's own people have said rendering plays less and less of a role as a bottleneck because it's cheaper than everyone assumed, and Google now renders essentially all HTML pages, with a median crawl-to-render delay measured in seconds rather than days.

That's the optimistic read. The realistic read is that rendering still costs real crawl resources, and JavaScript-dependent content still gets indexed slower and less reliably than content that's there in the initial HTML. Onely's research has shown JavaScript content can take dramatically longer to actually make it into the index. Google also only processes the first two megabytes of HTML and ignores oversized resources, so a bloated JavaScript payload can quietly cost you.

The biggest practical shift is that Google no longer recommends dynamic rendering. Its documentation now carries a warning calling it a workaround rather than a long-term solution, and pointing you toward server-side rendering, static rendering, or hydration instead. So the modern playbook is static generation for content-heavy pages, server-side rendering where content is dynamic, and client-side JavaScript reserved for genuinely interactive UI. The framework matters less than the rendering strategy. React, Next.js, Angular, and Vue all work fine at scale as long as your critical content and links exist in server-rendered or static HTML.

There's a newer reason this matters even more, which I'll come back to: most AI crawlers don't render JavaScript at all. If your content only exists after client-side rendering, it's invisible to them. Server-rendering your content is no longer just a Google optimization. It's how you stay visible across the whole retrieval ecosystem.

Diagnosing JavaScript issues comes down to comparing your raw HTML against your rendered HTML. View source shows you what Googlebot gets on the first pass. The URL Inspection tool in Search Console shows you the rendered DOM. When canonicals, hreflang, content, or links appear in one but not the other, you've found your problem.

Core Web Vitals: Fix the Template, Fix Everything

Speed is a confirmed ranking signal, but a modest one. It works as a tiebreaker between otherwise comparable pages, mostly visible on competitive mobile results, and it will never rescue weak content. Don't obsess over chasing a perfect score for its own sake. Do treat it as a site-wide multiplier worth getting into the green.

The metrics you're measured on are LCP for loading, INP for responsiveness, and CLS for visual stability, all judged at the seventy-fifth percentile of real-world field data, not lab scores. The big recent change is that INP replaced FID on March 12, 2024. This matters because INP is genuinely harder to pass. The old metric only looked at the delay on your first interaction. INP looks at every interaction across the whole visit, including processing and rendering time, so JavaScript-heavy sites that skated by on FID often find themselves in trouble on INP. The thresholds: LCP good at 2.5 seconds or under, INP good at 200 milliseconds or under, CLS good at 0.1 or under. You need at least seventy-five percent of visits hitting "good" on all three to earn a passing grade.

This is the defining principle of technical SEO at scale: you're not optimizing pages, you're optimizing the templates that generate pages. One template fix touches every page built from it.

Here's the leverage that makes this an enterprise problem rather than a tedious one. Your pages are built from templates. Fix the LCP element in the product template and you've fixed it across every product page. Reserve space for your images in the template and CLS disappears site-wide. Trim the JavaScript in your shared header and INP improves everywhere at once.

The priority order for fixes usually starts at the server. Time to first byte under 800 milliseconds, ideally far lower, because no amount of front-end tuning compensates for a slow origin. Get there with a CDN, server-side caching, and database optimization. Then attack LCP with compressed and properly sized hero images in modern formats like WebP or AVIF, with preloading. Then INP by code-splitting, deferring non-critical JavaScript, and reining in third-party tags. Then CLS by setting explicit dimensions on images and embeds and handling font loading cleanly.

Programmatic SEO: Where the Value Line Actually Sits

Programmatic SEO, generating lots of pages from a dataset and a template, is either your biggest growth lever or your fastest route to a penalty, and the difference comes down to one thing: does each page add real value, or is it the same boilerplate with the keywords swapped out.

The companies that win at this have a dataset competitors can't easily copy. Zapier builds a page for every app integration and every tool-to-tool combination off its real integration database. Tripadvisor generates location and category pages backed by live reviews, ratings, and photos. G2 has verified user ratings. Wise has a live currency engine. In every case, the data is the moat, and each page genuinely answers a distinct query with information you can't get elsewhere. John Mueller has called programmatic SEO a fancy banner for spam, and he's right about the bad version. The good version is just databases meeting search intent.

The line got sharper in March 2024 when Google rolled out its scaled content abuse policy alongside a core update. The policy targets generating many pages primarily to manipulate rankings rather than help users, and it explicitly does not care how the pages were made: human, automated, AI, or some hybrid. The core update that came with it was aimed at cutting unhelpful content significantly, and Google later reported a reduction in that ballpark. So the test for any programmatic project is simple to state and hard to fake. Start from a dataset that's genuinely yours or genuinely useful. Confirm the page combinations have real search demand. Only publish the variations you can make meaningfully different from each other. And withhold the pages where your data can't differentiate them, because publishing those just invites the index bloat and quality problems we talked about earlier. Then monitor your per-sitemap index ratios and prune the underperformers.

AI Content at Scale: The Guardrails Are Non-Negotiable

This deserves its own section because it's where I see the most reckless behavior right now. Google's position on AI content is method-neutral and quality-based. Its February 2023 guidance says appropriate use of AI isn't against the guidelines, that AI content is just content, and that if it's useful and original and demonstrates real expertise it can do well. The same guidance says using automation, including AI, to generate content primarily to manipulate rankings is a spam violation. So AI is fine. AI slop at scale is not.

Google's helpful content framework gives you the self-assessment: who made it, how it was made, and why. Make authorship clear where readers would expect it. Disclose AI involvement when it's reasonable to. And be honest that the primary purpose is to help people rather than to rank. The "why" is the one that catches people, because if the honest answer is "to capture traffic at volume," you're already on the wrong side of the line.

The cautionary tales are real and they're severe. After the March 2024 update, a wave of sites that had published AI content at industrial scale got deindexed. The most-cited example published tens of thousands of articles over about six months on unrelated trending topics and went from over a million monthly organic visitors to effectively zero after a manual action. Search Engine Journal, drawing on Ian Nuttall's tracking, reported that the March 2024 manual actions fully deindexed hundreds of sites accounting for over twenty million monthly organic visits combined. Those are third-party attributions inferring penalties from deindexing patterns rather than confirmed notices, but the pattern is unmistakable. Publishing thousands of thin AI pages fast is a flashing signal to Google.

The pipeline that survives looks like this: use AI to draft and assist from structured data, then put a human in the loop for fact-checking, originality, and genuine expertise. Add firsthand experience, expert commentary, or proprietary data that an LLM couldn't generate on its own. Deduplicate against what you already have. Stagger publishing rather than dumping everything at once. And kill the pages that can't clear a real quality bar before they ever go live.

Going Multilingual Without Creating a Mess

International SEO at scale is mostly about implementing hreflang correctly and then keeping it correct, which is harder than it sounds when every page in every language has to stay in sync.

You've got three ways to implement hreflang: tags in the HTML head, HTTP headers, or annotations in your XML sitemaps. For large multilingual sites, sitemap annotations win. They centralize the logic, they don't bloat your HTML with dozens of tags per page, and they're far easier to audit. Whatever you choose, pick one method per page and never mix them.

The rules that break people at scale are the reciprocity and self-reference rules. Every page needs a self-referencing hreflang tag, and every annotation has to be bidirectional. If your English page points to your German page, the German page has to point back, or Google ignores the whole set. With ten locales, every page carries ten annotations and all ten pages have to stay synchronized, which is why you generate hreflang programmatically from a single source of truth and never, ever maintain it by hand. Add an x-default for users who don't match any of your locales.

Watch the interaction with canonicals, because this one is subtle. Each language version has to self-canonicalize. Never point your French page's canonical at the English version, because the moment your canonical and hreflang disagree, Google throws out the hreflang. As Mueller has put it, if the canonical isn't part of the hreflang pairs, the markup gets ignored.

For URL structure, subdirectories like /de/ are the common enterprise choice because they consolidate authority on one domain and are the easiest to manage. Country-code domains send the strongest geographic signal but split your authority and cost the most to run. And on translation: machine translation is acceptable to Google if it's reviewed and useful, but dumping raw unreviewed auto-translation across thousands of pages can land you right back in scaled content abuse territory. Localize titles, metas, headings, and content for real. Don't just run them through a translator and ship.

Templates, Titles, and the Elements Google Rewrites Anyway

At scale, every element is templated, and the trick is templating for uniqueness rather than producing ten thousand near-identical title tags. Pull differentiating data into the template, like price, location, model, or count, so each page genuinely differs.

Worth knowing how much effort to spend on titles, because Google rewrites a lot of them. Cyrus Shepard's Zyppy study of over eighty thousand URLs found Google rewrote around sixty percent of titles at least partially, and more recent analysis puts it even higher. The triggers are predictable. Titles over seventy characters get rewritten almost every single time. Bracketed text gets removed often. The sweet spot is fifty to sixty characters, and the most reliable defense against a rewrite is making your title closely match your H1. So at the template level, keep titles in that range and keep them aligned with the on-page H1, and you'll keep more of your titles intact. The rest of the on-page SEO playbook follows the same templated logic.

For URLs, favor clean static descriptive paths for anything you want indexed, keep them lowercase, pick a trailing-slash policy and hold to it, and keep depth shallow. Reserve query parameters for the low-value stuff you intend to block or canonicalize anyway.

Migrations and Status Codes: Where Careers Get Made or Ended

Two more things that only get scarier with scale. The first is duplicate handling at the infrastructure level. Pick one host, one protocol, one trailing-slash convention, one case convention, and enforce them everywhere. Inconsistent canonical logic across a big site erodes Google's trust in your canonicals generally.

The second is migrations, where the redirect map is the entire project. A good migration playbook is built on a one-to-one redirect for every indexed URL, using 301 or 308 because those forward signals and 302-style redirects don't. You build and validate that map on a staging environment that's noindexed and locked down, crawl it to confirm coverage, then launch and monitor on a 30, 60, and 90 day cadence. Watch your logs in the first few days to confirm Googlebot is actually crawling the new URLs, and watch Search Console for redirect errors and "crawled, not indexed" spikes. Search Engine Journal's study of nearly nine hundred domain migrations found it took well over a year on average to return to prior traffic levels, with the fastest recoveries landing in a few weeks. The fast ones all shared the same traits: complete redirect coverage, no chains, staged validation, and obsessive early monitoring. For very large sites, phase the migration by section so you can actually diagnose what breaks. And eliminate redirect chains relentlessly, because every extra hop wastes crawl and slows users.

The New Layer: AI Crawlers and Who Gets Access

The last piece is genuinely new and it's now a real decision, not an afterthought. A whole class of AI crawlers is hitting your site: GPTBot, OAI-SearchBot, and ChatGPT-User from OpenAI, ClaudeBot and the Claude search and user agents from Anthropic, plus PerplexityBot, Google-Extended, CCBot, and others. They don't all do the same thing, which is the key insight. Some are pulling content to train models. Some are retrieving content live to answer a user's question and cite you. Those deserve different treatment.

The strategy most enterprises are landing on is selective allow. Let the search and retrieval bots in, because they drive citations and referral traffic, and block the pure training crawlers if your legal or brand stance calls for it. But be careful what you block. If you block OAI-SearchBot, for instance, you remove yourself from ChatGPT's search answers entirely, which is probably not what you want. And remember that some of these bots have been documented ignoring robots.txt, so if you're serious about blocking something, you enforce it at the WAF or edge layer rather than trusting a directive in a text file. Two practical notes that tie back to earlier: Google-Extended only affects Gemini training and has zero effect on your Google Search ranking, so blocking it doesn't hurt your search visibility. And since most of these AI crawlers don't render JavaScript, your server-rendered HTML is what determines whether you exist to them at all. The rendering strategy you chose for Googlebot is now also your GEO / AEO visibility strategy across the entire AI answer ecosystem.

The Mindset That Ties It All Together

If there's one thing to take away, it's that enterprise technical SEO is a systems discipline, not a page-by-page one. You're not optimizing pages, you're optimizing the templates, architectures, and rules that generate millions of pages. You diagnose with ratios and logs and segmented data rather than spot checks. You protect crawl capacity by refusing to generate junk, you read the two "not indexed" states as the different problems they are, you fix speed and structure at the template level so one change touches thousands of pages, and you treat quality as the lever that earns you everything else: more crawling, more indexing, and more trust.

The basics still matter. They just stop being the work. The work is building the machine that keeps the basics true across millions of URLs at once.

And then watching the right numbers closely enough to catch it the moment the machine drifts.

FAQ

Frequently Asked Questions

Probably not, unless you're large. Google says most sites never need to think about it. Start caring at over a million regularly-changing pages, at 10,000-plus pages where a lot changes daily, or when "Discovered, currently not indexed" is piling up. Under a few thousand URLs, Google crawls you fine.

"Discovered, currently not indexed" means Google knows the URL exists but hasn't crawled it, a crawl and demand signal you fix with internal links, clean sitemaps, a faster server, and pruning near-duplicates. "Crawled, currently not indexed" means Google fetched it and chose not to index it, a quality or duplication verdict you fix by making the page genuinely better and distinct, not with more links.

Not inherently. Google renders nearly all HTML pages now, often within seconds, but rendering still costs crawl resources and JS-dependent content indexes slower and less reliably. Google no longer recommends dynamic rendering; use static generation or server-side rendering for critical content. It matters even more for AI crawlers, most of which don't render JavaScript at all.

No, but quality is. Google's scaled content abuse policy targets generating many pages mainly to manipulate rankings, and it doesn't care whether they're human, automated, or AI-made. Programmatic and AI content works when each page is backed by genuinely useful or proprietary data, meets real search demand, and is meaningfully different from its siblings. Thin pages published fast and at volume are what get penalized.

Be selective. Let search and retrieval bots (like OAI-SearchBot) in because they drive citations and referral traffic; block pure training crawlers only if your legal or brand stance calls for it. Blocking OAI-SearchBot removes you from ChatGPT's search answers, and Google-Extended only affects Gemini training, not your Search ranking. If you must block something, enforce it at the WAF or edge, since some bots ignore robots.txt.

Crawl, Index, or Speed Problems at Scale?

Apply for a free analysis and I'll pinpoint where your crawl budget is leaking, which templates are bleeding index coverage, and the highest-leverage fixes to ship first.