HTML Purifier

The problem

I want to allow users to input HTML-formatted text, but I only want them to use certain tags and never any JavaScript. Sometimes users will copy and paste WYSIWYG formatted HTML, with it’s associated CSS classes and inline style rules – but I don’t want that to mess up my site design.

A simplistic approach is to attempt to use regular expressions to filter out unwanted HTML tags, but this becomes tedious and is always fraught with risk because it is notoriously difficult to anticipate and catch all the possible permutations of HTML tags and their attributes.

A more successful approach is to use a psuedo markup language like bbCode or WikiText, but both of these require users to learn another markup language, which is likely to deter users from posting.

Is there a better alternative?

Yes there is! HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C’s specifications.

HTML Purifier works by decomposing the whole document into tokens and removing non-whitelisted elements, checking the well-formedness and nesting of tags, and validating all attributes according to their RFCs.

Why HTML Purifier

I’ve used HTML Purifier because it

  • uses a whitelist (e.g. allow only b, p, br, ul, ol and li tags)
  • outputs valid XHTML
  • protects againts XSS
  • can remove attibutes and classes from tags without removing the tags

Before and after

An example of HTML that a user may enter:

<P style="MARGIN: 0cm 0cm 0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal><st1:Lorem w:st="on"><st1:place w:st="on"><B>LOREM IPSUM</B></st1:place></st1:Lorem><B>LOREM IPSUM</B></P>
<P style="MARGIN: 0cm 0cm 0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal><st1:place w:st="on"><st1:PlaceName w:st="on">Lorem</st1:PlaceName> <st1:PlaceType w:st="on">Ipsum</st1:PlaceType></st1:place>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque at augue vitae nisl sodales interdum. <st1:City w:st="on"><st1:place w:st="on">Lorem </st1:place></st1:City> Pellentesque erat enim, ullamcorper eget vehicula feugiat, auctor non nunc. Quisque vel molestie eros. Cras erat nulla, faucibus eget pretium at, cursus eu enim. <st1:place w:st="on"><st1:PlaceType w:st="on">lorem</st1:PlaceType> <st1:PlaceType w:st="on">Ipsum</st1:PlaceType></st1:place> Integer et eros lorem, eget pharetra justo. Maecenas accumsan eleifend leo, a ullamcorper justo venenatis ut. Vestibulum bibendum diam vel turpis lobortis bibendum.</P>
<P style="MARGIN: 0cm 0cm 0pt; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto" class=MsoNormal><B>Lorem / Ipsum</B> </P>

After it has been passed through the filter:

<p><b>LOREM IPSUM</b><b>LOREM IPSUM</b></p>
<p>Lorem Ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque at augue vitae nisl sodales interdum. Lorem Pellentesque erat enim, ullamcorper eget vehicula feugiat, auctor non nunc. Quisque vel molestie eros. Cras erat nulla, faucibus eget pretium at, cursus eu enim. lorem Ipsum Integer et eros lorem, eget pharetra justo. Maecenas accumsan eleifend leo, a ullamcorper justo venenatis ut. Vestibulum bibendum diam vel turpis lobortis bibendum.</P>
<p><b>Lorem / Ipsum</b></p>

This is the code used to achieve the before/after example:

require_once '/path_to/HTMLPurifier/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.AllowedElements', 'b,i,p,br,ul,ol,li');
$config->set('Attr.AllowedClasses', '');
$config->set('HTML.AllowedAttributes', '');
$config->set('AutoFormat.RemoveEmpty', true);
$purifier = new HTMLPurifier($config);

$remarks = 'the text to be filtered';
$remarks = preg_replace('/<\?xml[^>]+\/>/im', '', $remarks);
$remarks_cleaned = $purifier->purify($remarks);

The only other line of code here doing work apart from the HTML Purifier is the regex to remove <?xml ... ?> namespace tags from MS Word.

3 comments

Leave a Reply

Your email address will not be published. Required fields are marked *