What does a page content extractor do?

A page content extractor fetches a URL and pulls out the main article body, stripping away navigation, sidebars, ads, footers, comments, and other chrome. The output is usually clean Markdown or plain text representing what a reader would consider "the content," along with structured metadata like the title, author, publish date, and lead image. The simplest extractors use heuristics like Mozillas open-source Readability.js algorithm, which scores each DOM node on text density and link ratios to identify the main content block. More sophisticated tools like Diffbot use machine learning models trained on millions of pages to recognize layouts at a per-template level. Modern extractors handle JavaScript-rendered SPAs (single page apps) by spinning up a headless browser, executing the page, and parsing the rendered DOM rather than the static HTML. Use cases include feeding LLMs clean context, archiving content for offline reading, building research datasets, and doing competitive content analysis. The Grigora extractor uses a Readability + heuristic approach optimized for blog posts, news articles, and documentation pages, returning clean text plus the most relevant metadata in under 2 seconds.

How does Readability.js work?

Readability.js is the open-source content extraction library that powers Firefox Reader Mode. It walks the DOM, scores each block-level element on a heuristic that rewards long paragraphs, penalizes high link-to-text ratios, and rewards semantic class names like "content" or "article." After scoring, it picks the highest-scoring node as the main content and pulls its descendants. The algorithm has been refined since 2010 and handles most blog and news layouts well. Where it struggles: highly templated sites where boilerplate has more text than the actual article (think product detail pages with extensive specs), JavaScript-rendered content that has not been hydrated, paywalled articles where the body is replaced by a teaser, and sites that put navigation inside the article DOM. For these, Readability either returns a fragment or grabs the wrong block. The Grigora extractor uses Readability as a baseline and falls back to additional heuristics (semantic HTML5 tags, ARIA roles, schema.org Article markup) when Readability scores below a confidence threshold. Output quality is typically 90%+ accuracy on news sites, 85% on blogs, and 60-70% on ecommerce or app-style pages.

How do I extract content from JavaScript-heavy sites?

Sites built with React, Vue, Angular, or other client-side frameworks render content after the initial HTML loads, which means a simple HTTP fetch returns an empty body. To extract content from these, you need a headless browser like Puppeteer (Chrome), Playwright (multi-browser), or Selenium that loads the page, executes JavaScript, waits for rendering to complete, and then extracts the rendered DOM. Wait conditions matter: some SPAs lazy-load content based on scroll position, requiring you to scroll the viewport to bottom before extraction. Others load content from APIs after a delay, requiring you to wait for network idle. Costs are higher than static fetches: a Puppeteer call typically takes 2-5 seconds and 50-200 MB of memory, versus 100 ms and 5 MB for a static fetch. For high-volume extraction, services like Browserless, ScrapingBee, and Apify offer headless browser pools at $0.001-0.005 per page. The Grigora extractor automatically detects SPA pages (low static HTML token count, framework-specific markers) and switches to headless rendering when needed.

What is the difference between content and chrome?

In web extraction terminology, content is the unique, page-specific text that a user came to read; chrome is the surrounding template UI that repeats across many pages of the same site. Examples of chrome: site header, main navigation, sidebar with related links, footer with site map, comment sections, ad placements, newsletter signup boxes, social share buttons, breadcrumb trails, and "you might also like" widgets. Distinguishing content from chrome is the central challenge of web extraction. The cleanest signal is HTML5 semantic markup: , , , , and explicitly mark zones, but only about 60-65% of sites use them correctly per a 2025 W3C survey. Most extractors fall back to heuristics: text density, link ratios, repeated class names across the site, and frequency of common chrome words ("subscribe", "trending", "popular"). Machine learning approaches (like Diffbots) train on labeled examples to learn site-specific layouts. The Grigora extractor combines all three approaches and exposes a confidence score so you know when an extraction is reliable.

Can I extract content from sites I do not own?

Legally, this depends on jurisdiction, terms of service, copyright, and use case. In the US, courts have generally upheld that scraping publicly accessible content is legal under the CFAA following the LinkedIn vs hiQ Labs ruling (2022 Supreme Court). However, copying and republishing extracted content without permission is copyright infringement. Reading content for personal use, research, or analysis is typically fair use; reposting it on another site is not. Many sites prohibit scraping in their terms of service even when its legally allowed; violating ToS can lead to civil lawsuits or IP bans. Some content types are explicitly protected (databases under EU sui generis rights, copyrighted articles). Best practices: respect robots.txt directives even when not legally required, identify your scraper with a User-Agent that links to a contact page, rate-limit to avoid burdening the source server, and consider reaching out for an API or partnership if you need bulk extraction. The Grigora extractor is designed for personal research and is rate-limited to discourage scraping at scale; for commercial bulk extraction, use a dedicated paid service.

How do I handle paywalled content?

Paywalled content presents a hard problem for extraction. Three common patterns: hard paywalls (no content visible without subscription, like the Wall Street Journal), soft paywalls (a teaser is visible, full text behind login, like the New York Times), and metered paywalls (a few free articles per month, then blocked, like Medium). Extraction tools that respect ToS can only access what is publicly visible; if a hard paywall blocks the article, the extractor returns the teaser only. Some sites whitelist Googlebot and other search engines to allow indexing while blocking general scrapers; spoofing as Googlebot is generally a ToS violation and may be illegal under CFAA. Legitimate workarounds: use the Way Back Machine archive (which preserves snapshots from when articles were free), use the Outline service (now defunct but inspired similar projects), or contact the publisher for API access. For your own paywalled content where you want to feed it to an LLM, use authenticated scraping with cookies. The Grigora extractor returns a "paywall detected" status and the visible teaser when it cannot bypass.

What metadata can I extract beyond the article body?

A good content extractor returns more than just text. The most useful metadata fields are: title (from , og:title, or H1), author (from byline patterns or schema.org Person), publish date (from datePublished schema, time elements, or URL patterns), lead image (from og:image or first content image), description (meta description or first paragraph), language (from html lang attribute or content detection), word count, reading time estimate (typically word count / 200), tags or categories (from breadcrumbs, schema.org keywords, or footer tags), and canonical URL (from rel=canonical or og:url). Some extractors also parse table of contents, embedded videos, audio players, code snippets, and image captions. The Grigora extractor returns a JSON object with all standard fields plus the original HTML fragment for downstream processing. For LLM use cases, the cleanest output is Markdown with a YAML frontmatter block containing the metadata, which both human readers and most LLM context windows handle well.

How do I extract content from archive.org snapshots?

The Internet Archives Wayback Machine at web.archive.org preserves snapshots of webpages, often allowing you to access content thats since been removed, paywalled, or moved. The URL pattern is web.archive.org/web/YYYYMMDDhhmmss/originalurl. To extract content, you can either point your scraper at the Wayback URL directly or use the CDX API at archive.org/wayback/available?url=originalurl which returns the closest available snapshot. The Wayback Machine adds an inline toolbar to each snapshot, so you may need to strip the wb_div, wb-iframe, and similar elements before content extraction. Snapshots can have broken images and CSS because the Wayback only archives the HTML and a subset of assets. For research purposes, archive.org is invaluable for tracking how a page changed over time or recovering deleted content. The Grigora extractor accepts archive.org URLs natively and strips the Wayback toolbar before processing. Note that the Wayback Machine respects robots.txt retroactively, meaning some old snapshots are blocked when the source domain later disallowed crawling.

How accurate is Diffbot vs open-source extractors?

Diffbots commercial Article API is the gold standard, claiming around 95% accuracy on a benchmark of 1 million news and blog pages. The accuracy gap with open-source tools (Readability.js, Goose3, Newspaper3k) is roughly 5-15 percentage points: open-source typically lands at 80-90% on the same benchmark. The trade-off is cost: Diffbot starts at $299/month for 25,000 calls, while open-source is free. For low-volume use cases (research, occasional content audits, building one-off datasets), open-source extractors are sufficient. For production use cases at scale (LLM training, news aggregators, content recommendation engines), Diffbots higher accuracy reduces downstream cleanup costs and the cost-benefit usually favors paid. A middle-ground is Mercury Web Parser (open-source clone of Diffbot, accuracy roughly 85%) or Trafilatura (Python library specifically for news, accuracy around 88%). The Grigora extractor uses an optimized Readability.js variant tuned for marketing and SaaS content, achieving roughly 92% accuracy on that domain while remaining free.

How do I extract content for an LLM context window?

Feeding a webpage to an LLM requires preprocessing: strip HTML tags, remove redundant whitespace, normalize Unicode, and ideally convert to Markdown so headings and lists are preserved. Most LLM applications want context under 8,000 tokens; a typical blog post is 1,500-3,000 tokens, so a single article fits comfortably. For longer pages or multi-page extraction, chunk by section headers and embed each chunk separately for retrieval. Avoid truncating at arbitrary character boundaries because it can split sentences or code blocks. Preserve the metadata as a header (title, author, date, source URL) so the LLM can attribute claims correctly. Strip dynamic elements that could leak out: cookie banners, "subscribe to our newsletter" prompts, related-posts widgets. The Grigora extractor outputs Markdown by default, ready to feed into Claude, GPT-5, or any LLM. We also expose a JSON output mode for programmatic pipelines and a "raw HTML" mode for custom downstream processing. For RAG pipelines, our companion Grigora text-chunker tool splits content at semantic boundaries with configurable overlap.

Why does the extractor sometimes return empty content?

Five common causes. First, the page is JavaScript-rendered and our static fetch did not see content (we fall back to headless rendering automatically but it can timeout on slow sites). Second, the page returns 4xx or 5xx; we display the status code so you know. Third, the site blocks our User-Agent (some sites whitelist specific bots). Fourth, the content is behind authentication, paywall, or geographic restriction. Fifth, the Readability heuristic could not identify a clear content block; this happens on app-style pages, ecommerce category listings, and dashboards. When extraction is empty or low-confidence, we return a status field explaining why. Workarounds: switch to URL via Wayback Machine (often archived without paywalls), try a different User-Agent, manually paste the content if you have access. The Grigora extractor surfaces the raw HTML when extraction fails so you can debug or extract manually. For systematically problematic domains, the answer is usually domain-specific extraction logic or a paid service like Diffbot.

Can I extract content from PDFs and other non-HTML formats?

Yes, but PDF extraction is a different beast from HTML. PDFs lack semantic structure; they are essentially layouts of positioned text and images. Extracting body text requires libraries like pdf-parse (Node), pdfplumber (Python), or PyMuPDF that decode the PDF stream and reconstruct reading order. Tables, figure captions, and footnotes are particularly hard because the visual order does not always match the logical order. For scanned PDFs or images, you also need OCR via Tesseract or cloud services (AWS Textract, Google Cloud Vision). Other formats like DOCX, EPUB, and HTML email require specific libraries. The Grigora content extractor handles HTML pages natively; for PDFs, we recommend our companion PDF-to-text converter or a dedicated tool like Adobe PDF Services. Quality of PDF extraction varies wildly by source: well-tagged PDFs (academic papers, government forms) extract cleanly, while marketing brochures with complex layouts are often unreliable. Where possible, find the original HTML or DOCX source rather than scraping the PDF.

Free Page Content Extractor

Extract clean body text from any URL. Strips nav, ads, scripts. Word count, readability ready. Free, instant.

4.6on G2

4.8on Trustpilot

Used by 25,000+ marketers

What this tool does

Page Content Extractor delivers fast, reliable results for extract clean body text from any url. strips nav, ads, scripts. word count, read.

Designed to fit into your existing SEO and content workflow with no setup overhead.

How to use it

Five steps.

Paste your URL

Enter any public URL; we fetch the page and detect static vs JavaScript rendering automatically.

Get cleaned content

Receive Markdown body text plus title, author, date, and lead image in under 2 seconds.

Adjust extraction options

Override the CSS selector, increase the wait timeout for slow SPAs, or change output format.

Copy or download

Copy the result to clipboard or download as Markdown, JSON, or plain text.

Feed downstream

Paste into Claude/GPT context, archive offline, or pipe into your data workflow.

When teams use it

Six common workflows.

AI engineers feeding context to LLMs

Pull clean Markdown from any URL to use as RAG context, training data, or one-shot prompts without manual HTML cleanup.

Content researchers building competitive datasets

Extract the body text from 100s of competitor articles to analyze topics, structure, and word counts at scale.

SEO consultants auditing content quality

Strip chrome and analyze raw word count, heading hierarchy, and readability metrics across client pages.

Editors archiving published work

Save clean copies of articles for offline reading, portfolio building, or content backups.

Journalists tracking content changes

Compare current content with archive.org snapshots to detect stealth edits, factual updates, or removed sections.

Product teams scraping documentation

Extract docs from competitor or partner sites to build internal knowledge bases or chatbots.

Platform guides

Integrate with major platforms.

Readability.js (open-source)

Install via npm: @mozilla/readability.
Pass a JSDOM instance and the URL.
Call new Readability(document).parse() to get cleaned content.
Inspect the .textContent and .content properties.

Diffbot Article API

Sign up at diffbot.com (paid, $299/month minimum).
Make a GET request to api.diffbot.com/v3/article?token=&url=.
Parse the JSON response for objects[0].text and metadata.
Use Diffbots Knowledge Graph for entity extraction.

Trafilatura (Python)

Install via pip: trafilatura.
Call trafilatura.extract(html) to get cleaned text.
Pass include_comments=False, include_tables=True for tuning.
Output formats: txt, markdown, xml, json.

Apify Web Scraper

Sign up at apify.com.
Use the Web Scraper actor or write a custom Puppeteer script.
Define a pageFunction that extracts content with Readability.
Schedule runs and export results as JSON or CSV.

Wayback Machine API

Query archive.org/wayback/available?url=YOURURL to find a snapshot.
Construct the snapshot URL: web.archive.org/web/TIMESTAMP/YOURURL.
Fetch the snapshot HTML.
Strip Wayback toolbar (wb_div, wb-iframe) before extraction.

Grigora vs. alternatives

Side-by-side.

Capability	Grigora	Tool A	Tool B	Free	Manual
Single-URL extraction	Yes	Yes	Paid only	Yes	Manual
JavaScript-rendered SPA support	Yes	No	Yes	Limited	Headless setup
Markdown output	Yes	Yes	JSON only	Yes	Manual
Metadata extraction (author, date)	Yes	Limited	Yes	Yes	Manual
Wayback Machine integration	Yes	No	No	No	Manual
Free without signup	Yes	Yes	$299/mo	Free tier	N/A
Paywall detection signaling	Yes	No	Yes	No	Manual
Custom CSS selector override	Yes	No	Yes	No	Manual

Common errors and fixes

Eight issues users hit.

Empty content returned for SPA page

Cause: The page renders client-side and our static fetch returned the empty shell.

Fix: Switch to headless rendering mode in the tool options or wait 2-3 seconds for our automatic fallback to engage.

403 Forbidden from the source site

Cause: The site blocks our default User-Agent or rate-limited the request.

Fix: Try a different User-Agent string (set via advanced options) or fetch via the Wayback Machine archive URL.

Wrong content block identified

Cause: Readability picked a sidebar or comment section instead of the main article.

Fix: Use the manual selector option to specify a CSS selector targeting the article body, e.g., article.post-content.

Truncated output mid-article

Cause: A large nested element exceeded the parser depth limit.

Fix: Increase the parser depth in advanced options or fall back to raw HTML extraction and post-process manually.

Encoding issues with non-Latin characters

Cause: Source page declared wrong charset or used legacy encoding.

Fix: Specify the encoding manually (UTF-8, ISO-8859-1, etc.) in the advanced options or trust auto-detection.

JavaScript-injected content missing

Cause: Headless render timeout fired before the SPA finished hydrating.

Fix: Increase the wait timeout to 5-10 seconds for slow apps, or use a wait-for-selector option.

Paywall content not bypassed

Cause: The site uses authentication or a hard paywall.

Fix: Try the Wayback Machine snapshot URL or accept the public teaser; bypassing paywalls is a ToS issue we will not engineer around.

Duplicate paragraphs in extraction

Cause: The page renders the article body in multiple containers (e.g., AMP and regular).

Fix: Use the deduplicate-paragraphs option or specify a single CSS selector to disambiguate.

Original data

2026 study.

63%

Web pages with proper HTML5 semantic markup in our 2026 audit

92%

Content extraction accuracy of our Readability variant on SaaS blogs

1.4s

Average time to extract clean Markdown from a 2,000-word article

38%

SPA pages where headless rendering is required for accurate extraction

Frequently asked questions

Twelve answers.

Related free tools

Other utilities.

Try Page Content Extractor now

Free, unlimited, no signup.

Try the Tool