php convert html to php
Php convert html to php
HTML To Markdown for PHP
Library which converts HTML to Markdown for your sanity and convenience.
Requires: PHP 7.2+
Lead Developer: @colinodell
Original Author: @nickcernis
Why convert HTML to Markdown?
«What alchemy is this?» you mutter. «I can see why you’d convert Markdown to HTML,» you continue, already labouring the question somewhat, «but why go the other way?»
Typically you would convert HTML to Markdown if:
Require the library by issuing this command:
Add require ‘vendor/autoload.php’; to the top of your script.
Next, create a new HtmlConverter instance, passing in your valid HTML code to its convert() function:
The included demo directory contains an HTML->Markdown conversion form to try out.
By default, HTML To Markdown preserves HTML tags without Markdown equivalents, like and
To strip HTML tags that don’t have a Markdown equivalent while preserving the content inside them, set strip_tags to true, like this:
Or more explicitly, like this:
Note that only the tags themselves are stripped, not the content they hold.
By default, all comments are stripped from the content. To preserve them, use the preserve_comments option, like this:
To preserve only specific comments, set preserve_comments with an array of strings, like this:
By default, placeholder links are preserved. To strip the placeholder links, use the strip_placeholder_links option, like this:
By default bold tags are converted using the asterisk syntax, and italic tags are converted using the underlined syntax. Change these by using the bold_style and italic_style options.
Line break options
By default, br tags are converted to two spaces followed by a newline character as per traditional Markdown. Set hard_break to true to omit the two spaces, as per GitHub Flavored Markdown (GFM).
By default, a tags are converted to the easiest possible link syntax, i.e. if no text or title is available, then the syntax will be used rather than the full [url](url) syntax. Set use_autolinks to false to change this behavior to always use the full link syntax.
Passing custom Environment object
You can pass current Environment object to customize i.e. which converters should be used.
Header
Support for Markdown tables is not enabled by default because it is not part of the original Markdown syntax. To use tables add the converter explicitly:
Setext (underlined) headers are the default for H1 and H2. If you prefer the ATX style for H1 and H2 (# Header 1 and ## Header 2), set header_style to ‘atx’ in the options array when you instantiate the object:
$converter = new HtmlConverter(array(‘header_style’=>’atx’));
Headers of H3 priority and lower always use atx style.
Links and images are referenced inline. Footnote references (where image src and anchor href attributes are listed in the footnotes) are not used.
Blockquotes aren’t line wrapped – it makes the converted Markdown easier to edit.
HTML To Markdown requires PHP’s xml, lib-xml, and dom extensions, all of which are enabled by default on most distributions.
Errors such as «Fatal error: Class ‘DOMDocument’ not found» on distributions such as CentOS that disable PHP’s xml extension can be resolved by installing php-xml.
Many thanks to all contributors so far. Further improvements and feature suggestions are very welcome.
HTML To Markdown creates a DOMDocument from the supplied HTML, walks through the tree, and converts each node to a text node containing the equivalent markdown, starting from the most deeply nested node and working inwards towards the root node.
Converting HTML to plain text in PHP for e-mail
I use TinyMCE to allow minimal formatting of text within my site. From the HTML that’s produced, I’d like to convert it to plain text for e-mail. I’ve been using a class called html2text, but it’s really lacking in UTF-8 support, among other things. I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had tags in the HTML.
Does anyone use a similar approach to converting HTML to plain text in PHP? And if so: Do you recommend any third-party classes that I can use? Or how do you best tackle this issue?
14 Answers 14
Use html2text (example HTML to text), licensed under the Eclipse Public License. It uses PHP’s DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text. Usage:
Although incomplete, it is open source and contributions are welcome.
Issues with other conversion scripts:
here is another solution:
For other variations of sanitization functions, see:
Converting from HTML to text using a DOMDocument is a viable solution. Consider HTML2Text, which requires PHP5:
Regarding UTF-8, the write-up on the «howto» page states:
PHP’s own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP’s own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.
The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.
html_entity_decode
(PHP 4 >= 4.3.0, PHP 5, PHP 7, PHP 8)
html_entity_decode — Convert HTML entities to their corresponding characters
Description
html_entity_decode() is the opposite of htmlentities() in that it converts HTML entities in the string to their corresponding characters.
More precisely, this function decodes all the entities (including all numeric entities) that a) are necessarily valid for the chosen document type — i.e., for XML, this function does not decode named entities that might be defined in some DTD — and b) whose character or characters are in the coded character set associated with the chosen encoding and are permitted in the chosen document type. All other entities are left as is.
Parameters
An optional argument defining the encoding used when converting characters.
If omitted, encoding defaults to the value of the default_charset configuration option.
Although this argument is technically optional, you are highly encouraged to specify the correct value for your code if the default_charset configuration option may be set incorrectly for the given input.
The following character sets are supported:
Charset | Aliases | Description |
---|---|---|
ISO-8859-1 | ISO8859-1 | Western European, Latin-1. |
ISO-8859-5 | ISO8859-5 | Little used cyrillic charset (Latin/Cyrillic). |
ISO-8859-15 | ISO8859-15 | Western European, Latin-9. Adds the Euro sign, French and Finnish letters missing in Latin-1 (ISO-8859-1). |
UTF-8 | ASCII compatible multi-byte 8-bit Unicode. | |
cp866 | ibm866, 866 | DOS-specific Cyrillic charset. |
cp1251 | Windows-1251, win-1251, 1251 | Windows-specific Cyrillic charset. |
cp1252 | Windows-1252, 1252 | Windows specific charset for Western European. |
KOI8-R | koi8-ru, koi8r | Russian. |
BIG5 | 950 | Traditional Chinese, mainly used in Taiwan. |
GB2312 | 936 | Simplified Chinese, national standard character set. |
BIG5-HKSCS | Big5 with Hong Kong extensions, Traditional Chinese. | |
Shift_JIS | SJIS, SJIS-win, cp932, 932 | Japanese |
EUC-JP | EUCJP, eucJP-win | Japanese |
MacRoman | Charset that was used by Mac OS. | |
» | An empty string activates detection from script encoding (Zend multibyte), default_charset and current locale (see nl_langinfo() and setlocale() ), in this order. Not recommended. |
Note: Any other character sets are not recognized. The default encoding will be used instead and a warning will be emitted.
Return Values
Returns the decoded string.
Changelog
Examples
Example #1 Decoding HTML entities
= «I’ll \»walk\» the dog now» ;
Notes
You might wonder why trim(html_entity_decode(‘ ‘)); doesn’t reduce the string to an empty string, that’s because the ‘ ‘ entity is not ASCII code 32 (which is stripped by trim() ) but ASCII code 160 (0xa0) in the default ISO 8859-1 encoding.
See Also
User Contributed Notes 20 notes
If you need something that converts + entities to UTF-8, this is simple and works:
/* Entity crap. /
$input = «Fovič»;
It seems that ENT_XML1 and ENT_XHTML are identical when decoding.
This functionality is now implemented in the PEAR package PHP_Compat.
More information about using this function without upgrading your version of PHP can be found on the below link:
The following function decodes named and numeric HTML entities and works on UTF-8. Requires iconv.
This is a safe rawurldecode with utf8 detection:
I wanted to use this function today and I found the documentation, especially about the flags, not particularly helpful.
Running the code below, for example, failed because the flag I used was the wrong one.
The correct flag to use in this case is ENT_QUOTES.
My understanding of the flag to use is the one that would correspond to the expected, converted outcome. So, ENT_QUOTES for a character that would be a single or double quote when converted. and so on.
Please help make the documentation a bit clearer.
Quick & dirty code that translates numeric entities to UTF-8.
$no_bytes = 0 ;
$byte = array();
$test = ‘This is a čא test» ;
Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That’s not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to ‘UTF-8’.
If you don’t want a UTF-8 string, you’ll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding().
If you’re producing XML, which doesn’t recognise most HTML entities:
When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, ‘UTF-8’), ENT_NOQUOTES, ‘UTF-8’) (because you only need to escape and & unless you’re printing inside the XML tags themselves).
Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document’s DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).
I had a problem getting the ‘TM’ trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn’t work, but directly replacing the entity with it’s ASCII equivalent did:
We were having very peculiar behavior regarding foreign characters such as e-acute.
However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.
As other users have made a note of, the default character setting wasn’t what they were expecting it to be when they left theirs blank.
When we changed our default_charset to «UTF-8», our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!
echo urlencode ( html_entity_decode ( » » ));
?>
will output «%A0» instead of «+».
The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.
This function seems to have to have two limitations (at least in PHP 4.3.8):
a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references
a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that’s quite ugly and detroys all characters not present in Latin-1.
b) can be solved rather nicely using the following code:
Here is the ultimate functions to convert HTML entities to UTF-8 :
The main function is htmlentities2utf8
Others are helper functions
// Callback for preg_replace_callback(‘