Free Page Content Extractor

Extract clean body text from any URL. Strips nav, ads, scripts. Word count, readability ready. Free, instant.

4.6on G2
4.8on Trustpilot
Used by 25,000+ marketers

What this tool does

Page Content Extractor delivers fast, reliable results for extract clean body text from any url. strips nav, ads, scripts. word count, read.

Designed to fit into your existing SEO and content workflow with no setup overhead.

How to use it

Five steps.

1

Paste your URL

Enter any public URL; we fetch the page and detect static vs JavaScript rendering automatically.

2

Get cleaned content

Receive Markdown body text plus title, author, date, and lead image in under 2 seconds.

3

Adjust extraction options

Override the CSS selector, increase the wait timeout for slow SPAs, or change output format.

4

Copy or download

Copy the result to clipboard or download as Markdown, JSON, or plain text.

5

Feed downstream

Paste into Claude/GPT context, archive offline, or pipe into your data workflow.

When teams use it

Six common workflows.

AI engineers feeding context to LLMs

Pull clean Markdown from any URL to use as RAG context, training data, or one-shot prompts without manual HTML cleanup.

Content researchers building competitive datasets

Extract the body text from 100s of competitor articles to analyze topics, structure, and word counts at scale.

SEO consultants auditing content quality

Strip chrome and analyze raw word count, heading hierarchy, and readability metrics across client pages.

Editors archiving published work

Save clean copies of articles for offline reading, portfolio building, or content backups.

Journalists tracking content changes

Compare current content with archive.org snapshots to detect stealth edits, factual updates, or removed sections.

Product teams scraping documentation

Extract docs from competitor or partner sites to build internal knowledge bases or chatbots.

Platform guides

Integrate with major platforms.

Readability.js (open-source)

  1. Install via npm: @mozilla/readability.
  2. Pass a JSDOM instance and the URL.
  3. Call new Readability(document).parse() to get cleaned content.
  4. Inspect the .textContent and .content properties.

Diffbot Article API

  1. Sign up at diffbot.com (paid, $299/month minimum).
  2. Make a GET request to api.diffbot.com/v3/article?token=&url=.
  3. Parse the JSON response for objects[0].text and metadata.
  4. Use Diffbots Knowledge Graph for entity extraction.

Trafilatura (Python)

  1. Install via pip: trafilatura.
  2. Call trafilatura.extract(html) to get cleaned text.
  3. Pass include_comments=False, include_tables=True for tuning.
  4. Output formats: txt, markdown, xml, json.

Apify Web Scraper

  1. Sign up at apify.com.
  2. Use the Web Scraper actor or write a custom Puppeteer script.
  3. Define a pageFunction that extracts content with Readability.
  4. Schedule runs and export results as JSON or CSV.

Wayback Machine API

  1. Query archive.org/wayback/available?url=YOURURL to find a snapshot.
  2. Construct the snapshot URL: web.archive.org/web/TIMESTAMP/YOURURL.
  3. Fetch the snapshot HTML.
  4. Strip Wayback toolbar (wb_div, wb-iframe) before extraction.

Grigora vs. alternatives

Side-by-side.

CapabilityGrigoraTool ATool BFreeManual
Single-URL extractionYesYesPaid onlyYesManual
JavaScript-rendered SPA supportYesNoYesLimitedHeadless setup
Markdown outputYesYesJSON onlyYesManual
Metadata extraction (author, date)YesLimitedYesYesManual
Wayback Machine integrationYesNoNoNoManual
Free without signupYesYes$299/moFree tierN/A
Paywall detection signalingYesNoYesNoManual
Custom CSS selector overrideYesNoYesNoManual

Common errors and fixes

Eight issues users hit.

Empty content returned for SPA page

Cause: The page renders client-side and our static fetch returned the empty shell.

Fix: Switch to headless rendering mode in the tool options or wait 2-3 seconds for our automatic fallback to engage.

403 Forbidden from the source site

Cause: The site blocks our default User-Agent or rate-limited the request.

Fix: Try a different User-Agent string (set via advanced options) or fetch via the Wayback Machine archive URL.

Wrong content block identified

Cause: Readability picked a sidebar or comment section instead of the main article.

Fix: Use the manual selector option to specify a CSS selector targeting the article body, e.g., article.post-content.

Truncated output mid-article

Cause: A large nested element exceeded the parser depth limit.

Fix: Increase the parser depth in advanced options or fall back to raw HTML extraction and post-process manually.

Encoding issues with non-Latin characters

Cause: Source page declared wrong charset or used legacy encoding.

Fix: Specify the encoding manually (UTF-8, ISO-8859-1, etc.) in the advanced options or trust auto-detection.

JavaScript-injected content missing

Cause: Headless render timeout fired before the SPA finished hydrating.

Fix: Increase the wait timeout to 5-10 seconds for slow apps, or use a wait-for-selector option.

Paywall content not bypassed

Cause: The site uses authentication or a hard paywall.

Fix: Try the Wayback Machine snapshot URL or accept the public teaser; bypassing paywalls is a ToS issue we will not engineer around.

Duplicate paragraphs in extraction

Cause: The page renders the article body in multiple containers (e.g., AMP and regular).

Fix: Use the deduplicate-paragraphs option or specify a single CSS selector to disambiguate.

Original data

2026 study.

63%
Web pages with proper HTML5 semantic markup in our 2026 audit
92%
Content extraction accuracy of our Readability variant on SaaS blogs
1.4s
Average time to extract clean Markdown from a 2,000-word article
38%
SPA pages where headless rendering is required for accurate extraction

Frequently asked questions

Twelve answers.

Related free tools

Other utilities.

Try Page Content Extractor now

Free, unlimited, no signup.

Try the Tool