Home > C#, Regex > C#: Html Stripper

C#: Html Stripper

February 13th, 2012
Leave a comment Go to comments

    Some time ago I developed the project comparing the visible content of two html-pages and displaying all found differences. The main goal of project was to detect whether page was updated through the time, so usually comparison was between different versions of one html-page. To fetch the visible content and throw away the hidden one I created the Regex-based Html stripper embodied in the HtmlStripper class. The HtmlStripper follows the next three steps:

  1. to remove all service and auxiliary Html tags. In most cases that means to remove the <head>-tag with all its content, along with the <style>, <script> and other tags residing within <body>;
  2. to remove all the rest Html tags;
  3. to replace escape sequences found in Html with its text analogs. In practice that means that, for example,
    • the &nbsp; sequence should be replaced with the space symbol ‘ ‘;
    • &lt; – with ‘<‘;
    • &gt – with ‘>’ and so on

    There is a huge amount of escape sequences (or HTML codes), so the HtmlStripper operates with the most popular in my opinion;

So, the code of HtmlStripper is shown below:

Applying the HtmlStripper to the simple piece of Html like this

 
<div>text in the first div&nbsp;
     <div><text in the nested div><div/>
     &nbsp;&#8364;text in the first div again
</div>

the following result is got

text in the first div 
    <text in the nested div>
     €text in the first div again

The next code sample demonstrates how to use HtmlStripper in field conditions:

using System;
using System.Text;
using System.Web;
using System.Net;
using System.IO;
using Helpers;

namespace HtmlStripperConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            String responseString = null;

            WebRequest request = HttpWebRequest.Create("http://dotnetfollower.com/wordpress/2011/12/sharepoint-understanding-businessdata-column-bdc-field/");            
        
            using(WebResponse response = request.GetResponse())
                using (Stream stream = response.GetResponseStream())
                {
                    StreamReader reader = new StreamReader(stream, Encoding.UTF8);
                    responseString = reader.ReadToEnd();
                }
            
            string strippedText = HtmlStripper.Strip(responseString);
            File.WriteAllText(@"c:\text.txt", strippedText);            
        }
    }
}
 
Categories: C#, Regex Tags: ,
  1. No comments yet.
  1. No trackbacks yet.