Free Page Content Extractor
Extract clean body text from any URL. Strips nav, ads, scripts. Word count, readability ready. Free, instant.
What this tool does
Page Content Extractor delivers fast, reliable results for extract clean body text from any url. strips nav, ads, scripts. word count, read.
Designed to fit into your existing SEO and content workflow with no setup overhead.
How to use it
Five steps.
Paste your URL
Enter any public URL; we fetch the page and detect static vs JavaScript rendering automatically.
Get cleaned content
Receive Markdown body text plus title, author, date, and lead image in under 2 seconds.
Adjust extraction options
Override the CSS selector, increase the wait timeout for slow SPAs, or change output format.
Copy or download
Copy the result to clipboard or download as Markdown, JSON, or plain text.
Feed downstream
Paste into Claude/GPT context, archive offline, or pipe into your data workflow.
When teams use it
Six common workflows.
AI engineers feeding context to LLMs
Pull clean Markdown from any URL to use as RAG context, training data, or one-shot prompts without manual HTML cleanup.
Content researchers building competitive datasets
Extract the body text from 100s of competitor articles to analyze topics, structure, and word counts at scale.
SEO consultants auditing content quality
Strip chrome and analyze raw word count, heading hierarchy, and readability metrics across client pages.
Editors archiving published work
Save clean copies of articles for offline reading, portfolio building, or content backups.
Journalists tracking content changes
Compare current content with archive.org snapshots to detect stealth edits, factual updates, or removed sections.
Product teams scraping documentation
Extract docs from competitor or partner sites to build internal knowledge bases or chatbots.
Platform guides
Integrate with major platforms.
Readability.js (open-source)
- Install via npm: @mozilla/readability.
- Pass a JSDOM instance and the URL.
- Call new Readability(document).parse() to get cleaned content.
- Inspect the .textContent and .content properties.
Diffbot Article API
- Sign up at diffbot.com (paid, $299/month minimum).
- Make a GET request to api.diffbot.com/v3/article?token=&url=.
- Parse the JSON response for objects[0].text and metadata.
- Use Diffbots Knowledge Graph for entity extraction.
Trafilatura (Python)
- Install via pip: trafilatura.
- Call trafilatura.extract(html) to get cleaned text.
- Pass include_comments=False, include_tables=True for tuning.
- Output formats: txt, markdown, xml, json.
Apify Web Scraper
- Sign up at apify.com.
- Use the Web Scraper actor or write a custom Puppeteer script.
- Define a pageFunction that extracts content with Readability.
- Schedule runs and export results as JSON or CSV.
Wayback Machine API
- Query archive.org/wayback/available?url=YOURURL to find a snapshot.
- Construct the snapshot URL: web.archive.org/web/TIMESTAMP/YOURURL.
- Fetch the snapshot HTML.
- Strip Wayback toolbar (wb_div, wb-iframe) before extraction.
Grigora vs. alternatives
Side-by-side.
| Capability | Grigora | Tool A | Tool B | Free | Manual |
|---|---|---|---|---|---|
| Single-URL extraction | Yes | Yes | Paid only | Yes | Manual |
| JavaScript-rendered SPA support | Yes | No | Yes | Limited | Headless setup |
| Markdown output | Yes | Yes | JSON only | Yes | Manual |
| Metadata extraction (author, date) | Yes | Limited | Yes | Yes | Manual |
| Wayback Machine integration | Yes | No | No | No | Manual |
| Free without signup | Yes | Yes | $299/mo | Free tier | N/A |
| Paywall detection signaling | Yes | No | Yes | No | Manual |
| Custom CSS selector override | Yes | No | Yes | No | Manual |
Common errors and fixes
Eight issues users hit.
Empty content returned for SPA page
Cause: The page renders client-side and our static fetch returned the empty shell.
Fix: Switch to headless rendering mode in the tool options or wait 2-3 seconds for our automatic fallback to engage.
403 Forbidden from the source site
Cause: The site blocks our default User-Agent or rate-limited the request.
Fix: Try a different User-Agent string (set via advanced options) or fetch via the Wayback Machine archive URL.
Wrong content block identified
Cause: Readability picked a sidebar or comment section instead of the main article.
Fix: Use the manual selector option to specify a CSS selector targeting the article body, e.g., article.post-content.
Truncated output mid-article
Cause: A large nested element exceeded the parser depth limit.
Fix: Increase the parser depth in advanced options or fall back to raw HTML extraction and post-process manually.
Encoding issues with non-Latin characters
Cause: Source page declared wrong charset or used legacy encoding.
Fix: Specify the encoding manually (UTF-8, ISO-8859-1, etc.) in the advanced options or trust auto-detection.
JavaScript-injected content missing
Cause: Headless render timeout fired before the SPA finished hydrating.
Fix: Increase the wait timeout to 5-10 seconds for slow apps, or use a wait-for-selector option.
Paywall content not bypassed
Cause: The site uses authentication or a hard paywall.
Fix: Try the Wayback Machine snapshot URL or accept the public teaser; bypassing paywalls is a ToS issue we will not engineer around.
Duplicate paragraphs in extraction
Cause: The page renders the article body in multiple containers (e.g., AMP and regular).
Fix: Use the deduplicate-paragraphs option or specify a single CSS selector to disambiguate.
Original data
2026 study.
Frequently asked questions
Twelve answers.
Related free tools
Other utilities.