Friday, 21. January 2005
Written By : Peter de WaardCategory : XML
Location : Freiburg
Richard Schwarz asked the question how to purify HTML that has been submitted in a Blog. Stephan suggested to transform the HTML to proper xHTML first using jTidy and then, since you gain valid XML (validating against the xhtml schema), filter it using XSLT.
We had a quick session and came up with a template you can download in the downloads section of this blog. First we had to decide on the general mode of operation: Either: "Everything is ok unless explicitly forbidden" or: "Only tags that are explicitly allowed will go through. We opted for the first one. The result was about 4k of XSLT, that serves as the template for your specific needs.
The beginning of the template is pretty standard. The only special thing is the definition of the incoming xHTML file in the XSLT as h:. You need that step to be able to achive a proper matching!
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml">
<xsl:output method="html" version="1.0" encoding="UTF-8" indent="yes" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" omit-xml-declaration="yes"/>
The next entry is the master template, that processes an arbitrary tag. The trick here: the selection "*" has the lowest priority, so the rule is: "take this unless something more specific has been specified". To make sure all attributes get processed you need to have an apply-templates with the "@*" selector which stands for "any attribute".
<xsl:template match="*">
<xsl:variable name="curTagName" select="name()"/>
<xsl:element name="{$curTagName}">
<!-- Walk through the attributes -->
<xsl:apply-templates select="@*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
Note, that you need to store the name of the node you are processing in a variable, since the name attribute of xsl:element doesn't take functions.
Next we create a template that copies all attributes of one element into the new element. We do that to gain the possibility to exclude or alter specific elements. Nothing fancy there:
<xsl:template match="@*">
<xsl:variable name="curAttName" select="name()"/>
<xsl:attribute name="{$curAttName}"><xsl:value-of select="."/></xsl:attribute>
</xsl:template>
When you got your template so far you can test it against a valid xHTML file and should get a unaltered result. So we managed to follow the first IT commandment: "Do no harm".
To filter out specific elements we take advantage of XSLT build in prioritizing mechanism. In a nutshell: the more specific a match statement is, the higher the priority. Check Michael Kay's XSLT book for the full rules. The second feature that helps us here is the "Do nothing" command. When you specify a template with a match condition and close it, then the matched part gets ignored. As an example, we removed all script, object and meta tags. Please note, that you need to specify the xhtml namespace with the prefix h:
<xsl:template match="h:script"/>
<xsl:template match="h:object"/>
<xsl:template match="h:meta"/>
One more tricky endavour. We don't want anchor links that point to javascript. With a little xPath is also is only one line.
<xsl:template match="h:a[substring(@href,1,10)='javascript']"/>
Note: you might need to play with uppercase/lowercase problems.
Next step: I don't want forms, but I want the form content (HTML inside the forms):
<xsl:template match="h:form">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="h:input"/>
<xsl:template match="h:textarea"/>
Now I'd like to make sure, that all places where JavaScript could be called are removed. We didn't look into Javascript in src tags (but since we handled them in a/href this shouldn't pose a problem for you. This is the moment where we introduce attribute filtering:
<xsl:template match="@onclick"/>
<xsl:template match="@ondblclick"/>
<xsl:template match="@onkeydown"/>
<xsl:template match="@onkeypress"/>
<xsl:template match="@onkeyup"/>
<xsl:template match="@onmousedown"/>
<xsl:template match="@onmousemove"/>
<xsl:template match="@onmouseout"/>
<xsl:template match="@onmouseover"/>
<xsl:template match="@onmouseup"/>
<xsl:template match="@onload"/>
<xsl:template match="@onunload"/>
Finally we want, following Googles suggestion, insert the rel="nofollow" tag in all anchor links (Note, that you need to alter the html tag too! - We omitted that here):
<!-- Google fashion: add a rel="nofollow" to the link -->
<xsl:template match="h:a">
<xsl:element name="a">
<xsl:attribute name="rel">nofollow</xsl:attribute>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
Voila, there goes your universal HTML Filter. The advantage of "oursourcing" the rules to an XSLT are obvious: one code for multiple filter rules, standard rules format, output for many formats.
And don't forget the closing tag :-)
</xsl:stylesheet>
Let us know how you like it
:-) pdw & stw
Site purpose and disclaimer
You consider to extend or replace your Domino infrastructure. You found a lot of information about messaging migration. You didn't find much about the applications, other than tool vendors advertisements. You realized that Domino migration is an emotional mine field. Bookmark this site, we will provide information and discuss the move from Domino to J2EE and other environments (both retaining and replacing Domino). We focus on applications, not on messaging.
The articles on this site mention products and phrases, that might be subject to copyright or trademarks. So we acknowledge, that the copyrights belong to the owner of the respective copyright or trademark.The links on this page are provided for convenience and are constitute no endorsement of the content of the target site.
So once your ready to discuss if and/or how to move away from Domino contact us.
The articles on this site mention products and phrases, that might be subject to copyright or trademarks. So we acknowledge, that the copyrights belong to the owner of the respective copyright or trademark.The links on this page are provided for convenience and are constitute no endorsement of the content of the target site.
So once your ready to discuss if and/or how to move away from Domino contact us.











