What does this tool remove from my HTML?

All tags (<p>, <div>, <span>, <a>, etc.), inline styles, <script> and <style> block contents, HTML comments, and excess whitespace. What remains is the human-readable text content of the page, with paragraph breaks preserved where block-level elements existed in the original.

Does it preserve formatting like bold and italic?

No — the goal is plain text. Bold, italic, underline, color, and font choices all disappear because they are styling, not text. If you need to keep semantic emphasis, convert to Markdown instead (asterisks for bold, underscores for italic). For pure analysis, plain text is what you want.

Are paragraph breaks preserved?

Yes, where the source HTML used block-level elements (<p>, <div>, <h1>-<h6>, <li>, <blockquote>), the cleaner inserts line breaks. Inline elements (<span>, <a>, <strong>) do not create breaks. Lists become indented or numbered text depending on the original tag.

How is this different from copying text out of a browser?

Browser copy includes invisible Unicode characters, smart quotes, non-breaking spaces, and inconsistent paragraph handling. This tool gives you ASCII-clean output suitable for further processing (LLM input, word count, plagiarism check, content migration). Browser copy is fine for one-off use; the cleaner is better when you need predictable text.

Will it strip CSS and JavaScript?

Yes. The contents of <style> and <script> tags are removed entirely — not just their tags but the CSS rules and JS code inside. This matters because raw page HTML often has 30-50% of its bytes in scripts and styles; the cleaner gives you only the actual content.

How long is the input it can handle?

A few megabytes is comfortable. Beyond 5MB, browser text-area handling gets sluggish. For larger inputs (whole-site HTML dumps), use a CLI tool like html-to-text npm package, or split into manageable chunks. Most use cases (single page, blog post, email) fit well under 1MB.

Can I use this for content migration between CMSes?

Yes — with caveats. The cleaner gets you to plain text. From there, you re-add formatting in your destination CMS's editor or convert to Markdown first using a separate tool (turndown.js, html-to-md). For straight body-text migration, plain output works. For rich content with images and links, expect to do some manual touch-up.

Does it preserve link URLs?

No, by default. The link text remains; the href URL is dropped. If you need URLs preserved, run the HTML through a Markdown converter first (which keeps links as [text](url)), then strip the markdown formatting if needed. Or use a custom html-to-text config that keeps anchor href values.

Will hidden content (display:none) be included?

Yes. The cleaner reads the source HTML, not the rendered page. Content hidden via CSS (display:none, visibility:hidden, off-screen positioning) is still in the source and appears in the output. To exclude visually-hidden content, you would need a headless browser that runs CSS — this tool does not.

Why are there strange characters in my output?

HTML entities (&, <, >, ", ) are decoded to their literal characters by default. If you see weird characters anyway, they are likely Unicode characters preserved from the original (em-dashes, smart quotes, special punctuation). Run the output through a Unicode normalizer if you need ASCII-only.

Can I use this for web scraping pipelines?

For one-off scrapes, yes. For automated pipelines processing thousands of pages, use a code-based approach instead (Cheerio + html-to-text in Node, or BeautifulSoup + .get_text() in Python). The browser tool is for ad-hoc analysis; the libraries are better for production scrape workflows.

How does this work with email HTML?

Well. Email HTML is heavy with inline styles and table-based layouts; the cleaner extracts the actual readable content cleanly. Useful for: archiving newsletter content, building searchable email indexes, extracting article copy from newsletter sends. Email layouts often have boilerplate (header, footer, unsubscribe) that comes through — you may want to strip those manually.

Free HTML to Text Cleaner

Paste any HTML, get clean readable text. Strips tags, scripts, styles, comments. Preserves paragraph breaks. Free, unlimited, no signup.

4.6on G2

4.7on Trustpilot

Used by 50,000+ developers and content teams

Strips every tag, script, and style block

Preserves paragraph breaks for readability

Result in under 1 second for typical pages

What the HTML to Text Cleaner does

HTML files often have 30-60% of their bytes in tags, scripts, styles, and comments — everything that makes the page render in a browser, but nothing to do with the actual reading content. For analysis, migration, LLM input, and word counting, you want only the readable text.

This tool strips every tag and code block, decodes HTML entities, normalizes whitespace, and preserves paragraph breaks where the source had block-level elements. The output is clean ASCII text that pastes cleanly into any tool downstream — word counters, plagiarism checkers, AI models, CMS rich-text editors, search indexes. Free, unlimited, and your code never leaves the form.

How to clean HTML

Five steps from messy HTML to clean text.

Paste your HTML

Drop in any HTML: a full page, an email, a CMS export, a scraped snippet.

Click Convert to Text

The tool strips every tag, script, and style block, leaving only readable text.

Review the output

Read through to confirm paragraph breaks landed where you expected.

Copy or download

One-click copy. Paste into your tool of choice: word counter, LLM input, CMS field.

Repeat per file

For batch jobs of 10+ inputs, switch to a CLI tool. The browser version is best for one-off use.

When developers and writers use it

Six common workflows where the cleaner earns its keep.

Content migration from old CMS to new

Old CMS exports HTML; new CMS expects plain text or Markdown. Run the export through this cleaner, get clean text, paste into your new CMS, re-add formatting. Saves hours per post compared to manually fighting copy-paste artifacts.

Word counting for billing or estimation

You bill clients by word count. The HTML version of a 2,000-word article has thousands of bytes of tags inflating the count. Strip first, then count — you get the real content size.

LLM and AI input preparation

Feeding article content to ChatGPT, Claude, or your own model. Tags waste tokens and confuse the model. Strip first; pass clean text. Saves cost on token-priced APIs and improves output quality.

Plagiarism / duplication checking

Most plagiarism checkers want plain text input. Your CMS exports HTML. Clean to text first, then upload to Copyscape, Quetext, or similar. The match results are cleaner without HTML noise.

Web scrape post-processing

You scraped 100 pages with Cheerio or Puppeteer. Each page is messy HTML. Run the body content through this tool to get clean text for your downstream analysis (sentiment, classification, embedding).

Email content archiving

You receive HTML newsletters and want a searchable plain-text archive. Forward each to a script, run through the cleaner, save text version. Searches and full-text-indexing work better on clean text than messy HTML.

Workflow integrations

How to fit the cleaner into the workflows it pairs best with.

Web scraping pipelines

For one-off scrapes, paste the page's body HTML into this tool and copy the output.
For repeated scrapes, use the html-to-text npm package or BeautifulSoup .get_text() in Python.
Always strip whitespace and deduplicate empty lines after the cleaner step.

WordPress to Webflow migration

Export WordPress posts (Tools > Export > Posts).
For each post, copy the post_content from the XML export.
Run through this cleaner to get plain text. Paste into Webflow CMS rich-text fields, re-add formatting via Webflow's editor.

Notion as content source

If you copy from Notion, the clipboard often includes HTML formatting alongside the visible text.
Paste into this cleaner to strip the HTML, leaving just the text.
Useful when piping Notion content into other systems that choke on Notion's rich-text format.

AI / LLM input prep

Article HTML often has 30-60% of bytes in tags and scripts.
Clean to text before sending to OpenAI, Anthropic, or your own LLM. You save tokens and reduce input noise.
For long articles, also chunk the cleaned text to fit context windows (typically 100K-200K tokens depending on model).

Email HTML extraction

Save email HTML (most clients let you "Show Original" or "View Source" on a message).
Paste into this cleaner. The output is the email's actual content without table-layout artifacts.
Strip newsletter boilerplate (header logo, footer unsubscribe) by hand — it appears as text but is not part of the article.

Grigora vs. other cleaners

A side-by-side of the alternatives.

Capability	Grigora	html-to-text npm	BeautifulSoup CLI	Free generators	Manual
Free + unlimited	Yes	Limited free	Free trial	Free, ad-supported	Manual only
Strips scripts + styles	Yes	Yes	Yes	Yes	Manual
Preserves paragraph breaks	Yes	Partial	Yes	No	Manual
Decodes HTML entities	Yes	Yes	Yes	Yes	Manual
Handles large input (5MB+)	Yes	Limited	Yes	Limited	Manual
No signup	Yes	Yes	Account required	Yes	Yes
Multi-language safe	Yes	Yes	Yes	Mostly	Manual
Works offline	Browser-side	No	No	No	Yes

Common errors and how to fix them

Eight issues users hit with HTML-to-text conversions, with the exact fix.

Output looks like a wall of text with no breaks

Cause: Your input HTML used <br> tags or <span> for line breaks instead of block-level <p> tags.

Fix: Manually add line breaks after sentences, or run the input through an HTML formatter first to introduce proper structure, then re-clean.

Script or style content appears in the output

Cause: You used a different tool that does not strip <script> / <style> contents.

Fix: This tool removes them. If you see code in your output, you used a different tool. Re-run with the Grigora HTML to Text Cleaner.

HTML entities (& or <) appear in output

Cause: Some entities remained because the source had non-standard encoding.

Fix: Run the output through a separate HTML decoder, or paste into a browser console: document.querySelector("textarea").value = decoded; The cleaner handles common entities; obscure ones may slip through.

Text is duplicated

Cause: Your input had visually-hidden duplicate content (e.g., a desktop nav and a mobile nav with identical text).

Fix: After cleaning, deduplicate by hand or with a text-deduplication tool. The cleaner sees raw HTML; CSS-hidden duplicates are not filtered.

Lists do not look like lists

Cause: List items lose their bullets in plain text by definition.

Fix: After cleaning, manually add bullet characters (- or ·) at the start of each line if you need visual list formatting.

Image alt text is missing

Cause: The cleaner strips <img> tags entirely by default, including alt attributes.

Fix: For alt-text preservation, use an HTML-to-Markdown converter (which writes alt text as ![alt](url)). Or strip just images manually before cleaning.

Tables come through as scrambled text

Cause: HTML tables with inconsistent widths or merged cells flatten poorly to plain text.

Fix: For tabular data, convert to CSV or Markdown table syntax first. Plain-text cleaning of tables produces readable but ugly output.

Tool fails on very large input (10MB+)

Cause: Browser memory limits.

Fix: Split into chunks of 1-2MB. Or use html-to-text Node CLI for files over 10MB — it processes in streams and handles arbitrary size.

Original data from our 2026 cleaner study

What we observed across 6,000 cleanings.

47%

Average HTML weight that is non-content (tags, scripts, styles)

12 MB

Largest input we successfully cleaned in browser

0.4 sec

Median time to clean a typical blog post HTML

LLM input prep (32%)

Most common use case across our sessions

Frequently asked questions

Twelve answers covering what users ask us about HTML-to-text conversion.

Related free tools

Other utilities that pair well with the HTML to Text Cleaner.

Strip some HTML right now

Paste HTML, get clean text. Free, unlimited, no signup. Your code stays in the browser.

Try the HTML to Text Cleaner