Archive for October, 2008

Filter HTML

Wednesday, October 1st, 2008

Many websites take input from users.  If your website is going to take that input a redisplay it somewhere on the site, you really need to filter your html.  If you are lucky the requirements for your site will let you strip out all html code, if your not lucky you will have to filter it.  

Why Filter

First, why accept html at all?  because it is easy enough for users to work with, for a WYSIWYG editor to work with, and it is what you are going to be displaying, really it is best all around in terms of your site’s performance and your user experience.  

Ok, so why filter?  Well there are some hurtful people out there, and some people that dont know what they are doing, both can make you look bad.  First the people that dont know what they are doing will forget to close their tags(leaving a <b> tag so the rest of the page can be bold), and using other markup that will ruin your design.  Then the hurtful people will come in a add some javascript to destructive ends, compromising the security of your site.  And keep in mind that script doesnt have to live inside a <script> tag.  it can be in many attributes, such as onmouseover for example. combine that with some inline style that enlarges and positions text to cover the whole page, and boom!  the hacker just got their malicious script to run on your site without a script tag, not good for your users.

How to Filter

We understand the need, now how do we accomplish the task?  3 main points.

1) Whitelist tags and attributes.  Create a whitelist of allowed tags, and their allowed attributes.  Whitelists are better then blacklisting.  Cause they should be shorter,  they are easier to maintain, and more restrictive.  A comprehensive blacklist could take a long time to make, and whenever a browsers decide to add support for new tags, your blacklist requires updating.  If you use a whitelist, it is shorter, and wont break as new tags are supported.

2) Balance is needed.  Your page can be ruined if the user submitted code includes some </div></div>  Or what if the user opens a tag that they never close… maybe  <center>  What will your site look like then?  You need to add balance to your user submitted html.  Balance all tags.  Also keep in mind tags that self close, <img> or <br> for example.  and to be XHTML compliant make sure they include the self close <br />

3)  Proper Nesting.  Improper nesting in certain browsers can lead to trouble similar that that of unbalanced tags.  Check for <b><u>text</b></u>  

Here is the Code

So enough talk here is some links to helpful code to get this done:

Also check out a project called Tidy, it has a lot of this functionality built in, and is available for many languages.  - HTML Tidy Project Page

So long, and be safe…Filter your Html.