Home > C#, Regex > C#: Html Stripper

C#: Html Stripper

February 13th, 2012 Leave a comment Go to comments

    Some time ago I developed the project comparing the visible content of two html-pages and displaying all found differences. The main goal of project was to detect whether page was updated through the time, so usually comparison was between different versions of one html-page. To fetch the visible content and throw away the hidden one I created the Regex-based Html stripper embodied in the HtmlStripper class. The HtmlStripper follows the next three steps:

  1. to remove all service and auxiliary Html tags. In most cases that means to remove the <head>-tag with all its content, along with the <style>, <script> and other tags residing within <body>;
  2. to remove all the rest Html tags;
  3. to replace escape sequences found in Html with its text analogs. In practice that means that, for example,
    • the &nbsp; sequence should be replaced with the space symbol ‘ ‘;
    • &lt; – with ‘<‘;
    • &gt – with ‘>’ and so on

    There is a huge amount of escape sequences (or HTML codes), so the HtmlStripper operates with the most popular in my opinion;

So, the code of HtmlStripper is shown below:

HtmlStripper Source

using System;
using System.Text.RegularExpressions;

namespace Helpers
{
    public static class HtmlStripper
    {
        #region fields
        /// <summary>
        /// Allows to find the HTML tags hidden from view (style, script code and so on)
        /// </summary>
        private static Regex _findHtmlTagsWithInvisibleTextRegex = new Regex
            (@"<head[^>]*?>.*?</head> | <style[^>]*?>.*?</style> | <script[^>]*?.*?</script> | 
               <object[^>]*?.*?</object> | <embed[^>]*?.*?</embed> | <applet[^>]*?.*?</applet> |
               <noframes[^>]*?.*?</noframes> | <noscript[^>]*?.*?</noscript> | <noembed[^>]*?.*?</noembed>",
                RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);

        /// <summary>
        /// Allows to find all HTML tags
        /// </summary>
        private static Regex _findHtmlTagsRegex = new Regex
            (@"<(?:(!--) |(\?) |(?i:( TITLE  | SCRIPT | APPLET | OBJECT | STYLE )) | ([!/A-Za-z]))(?(4)(?:(?![\s=][""`'])

[^>] |[\s=]`[^`]*`
             |[\s=]'[^']*'|[\s=]""[^""]*"")*|.*?)(?(1)(?<=--))(?(2)(?<=\?))(?(3)</(?i:\3)(?:\s[^>]*)?)>",
             RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);

        private static string _escapeGroupName = "escSeq";
        /// <summary>
        /// Allows to find escape sequences that are used in HTML
        /// </summary>
        private static Regex _findHtmlEscapeSequencesRegex = new Regex
            (string.Format(@"[&] (([#](?<{0}>\d+)) | ([#](?<{0}>x[\dabcdef]+)) | (?<{0}>\w+));?", _escapeGroupName),
            RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);
        #endregion

        public static string Strip(string htmlSrc)
        {
            string xTmpStr = _findHtmlTagsWithInvisibleTextRegex.Replace(htmlSrc, "");
            xTmpStr = _findHtmlTagsRegex.Replace(xTmpStr, "");
            return RemoveHtmlEscapeSequences(xTmpStr);
        }        

        private static string RemoveHtmlEscapeSequences(string htmlStr)
        {
            return _findHtmlEscapeSequencesRegex.Replace(htmlStr, delegate(Match match)
            {
                string replacement = GetHtmlEscapeSequenceSubstitution(match.Groups[_escapeGroupName].Value);
                return replacement == null ? match.Value : replacement;
            });
        }

        private static string GetHtmlEscapeSequenceSubstitution(string htmlEscape)
        {
            htmlEscape = htmlEscape.TrimStart('0'); // sometimes number contains leading zeros

            bool? isNumber = null;

            // if it's a hex number, convert it into string containing decimal number
            if (htmlEscape.StartsWith("x"))
            {
                int num = Int32.Parse(htmlEscape.TrimStart('x'), System.Globalization.NumberStyles.HexNumber);
                htmlEscape = num.ToString();
                isNumber = true;
            }

            // check if it's a number
            if(isNumber == null)
            {
                int tmpOut;
                isNumber = int.TryParse(htmlEscape, out tmpOut);
            }

            // find substitution for either the peeled number or the original string
            string res = FindSubstitution(htmlEscape);

            // if it's a number, any further attempts with different string variations are senseless
            if (!isNumber.Value && res == null)
            {                
                // there are a few Html codes consisting of all capital letters,
                // try to find substitution for them
                htmlEscape = htmlEscape.ToUpperInvariant();
                res = FindSubstitution(htmlEscape);               

                if (res == null)
                {
                    // try to find substitution for string with the first capital letter
                    htmlEscape = htmlEscape.Substring(0, 1).ToUpperInvariant() + htmlEscape.Substring(1).ToLowerInvariant();
                    res = FindSubstitution(htmlEscape);
                }

                if (res == null)
                {
                    // try to find substitution for the lower string
                    htmlEscape = htmlEscape.ToLowerInvariant();
                    res = FindSubstitution(htmlEscape);
                }
            }

            return res;
        }

        private static string FindSubstitution(string htmlEscape)
        {
            switch (htmlEscape)
            {
                // All browsers support

                /* space */                             /* zero */
                case "32": return " ";                  case "48": return "0";
                /* exclamation point */                 /* one */
                case "33": return "!";                  case "49": return "1";
                /* double quotes */                     /* two */
                case "34": case "quot": return "\"";    case "50": return "2";
                /* number sign */                       /* three */
                case "35": return "#";                  case "51": return "3";
                /* dollar sign */                       /* four */
                case "36": return "$";                  case "52": return "4";
                /* percent sign */                      /* five */
                case "37": return "%";                  case "53": return "5";
                /* ampersand */                         /* six */
                case "38": case "amp": return "&";      case "54": return "6";
                /* single quote */                      /* seven */
                case "39": return "'";                  case "55": return "7";
                /* opening parenthesis */               /* eight */
                case "40": return "(";                  case "56": return "8";
                /* closing parenthesis */               /* nine */
                case "41": return ")";                  case "57": return "9";
                /* asterisk */                          /* colon */
                case "42": return "*";                  case "58": return ":";
                /* plus sign */                         /* semicolon */
                case "43": return "+";                  case "59": return ";";
                /* comma */                             /* less than sign */
                case "44": return ",";                  case "60": case "lt": return "<";
                /* minus sign - hyphen */               /* equal sign */
                case "45": return "-";                  case "61": return "=";
                /* period */                            /* greater than sign */
                case "46": return ".";                  case "62": case "gt": return ">";
                /* slash */                             /* question mark */
                case "47": return "/";                  case "63" : return "?";
                
                /* at symbol */             case "83": return "S";
                case "64": return "@";      case "84": return "T";
                case "65": return "A";      case "85": return "U";
                case "66": return "B";      case "86": return "V";
                case "67": return "C";      case "87": return "W";
                case "68": return "D";      case "88": return "X";
                case "69": return "E";      case "89": return "Y";
                case "70": return "F";      case "90": return "Z";
                case "71": return "G";      /* opening bracket */
                case "72": return "H";      case "91": return "[";
                case "73": return "I";      /* backslash */
                case "74": return "J";      case "92": return "\\";
                case "75": return "K";      /* closing bracket */
                case "76": return "L";      case "93": return "]";
                case "77": return "M";      /* caret - circumflex */
                case "78": return "N";      case "94": return "^";
                case "79": return "O";      /* underscore */
                case "80": return "P";      case "95": return "_";
                case "81": return "Q";      /* grave accent */
                case "82": return "R";      case "96": return "`";                                            
                         
                case "97": return "a";      case "114": return "r";
                case "98": return "b";      case "115": return "s";
                case "99": return "c";      case "116": return "t";
                case "100": return "d";     case "117": return "u";
                case "101": return "e";     case "118": return "v";
                case "102": return "f";     case "119": return "w";
                case "103": return "g";     case "120": return "x";
                case "104": return "h";     case "121": return "y";
                case "105": return "i";     case "122": return "z";
                case "106": return "j";     /* opening brace */
                case "107": return "k";     case "123": return "{";
                case "108": return "l";     /* vertical bar */
                case "109": return "m";     case "124": return "|";
                case "110": return "n";     /* closing brace */
                case "111": return "o";     case "125": return "}";
                case "112": return "p";     /* equivalency sign - tilde */
                case "113": return "q";     case "126": return "~";

                /* non-breaking space */                    /* degree sign */
                case "160": case "nbsp" : return " ";       case "176": case "deg": return "°";
                /* inverted exclamation mark */             /* plus-or-minus sign */
                case "161": case "iexcl": return "¡";       case "177": case "plusmn": return "±";
                /* cent sign */                             /* superscript two - squared */
                case "162": case "cent" : return "¢";       case "178": case "sup2": return "²";
                /* pound sign */                            /* superscript three - cubed */
                case "163": case "pound": return "£";       case "179": case "sup3": return "³";
                /* currency sign */                         /* acute accent - spacing acute */
                case "164": case "curren": return "¤";      case "180": case "acute": return "´";
                /* yen sign */                              /* micro sign */
                case "165": case "yen": return "¥";         case "181": case "micro": return "µ";
                /* broken vertical bar */                   /* pilcrow sign - paragraph sign */
                case "166": case "brvbar": return "¦";      case "182": case "para": return "¶";
                /* section sign */                          /* middle dot - Georgian comma */
                case "167": case "sect": return "§";        case "183": case "middot": return "·";
                /* spacing diaeresis - umlaut */            /* spacing cedilla */
                case "168": case "uml": return "¨";         case "184": case "cedil": return "¸";
                /* copyright sign */                        /* superscript one */
                case "169": case "copy": return "©";        case "185": case "sup1": return "¹";
                /* feminine ordinal indicator */            /* masculine ordinal indicator */
                case "170": case "ordf": return "ª";        case "186": case "ordm": return "º";
                /* left double angle quotes */              /* right double angle quotes */
                case "171": case "laquo": return "«";       case "187": case "raquo": return "»";
                /* not sign */                              /* fraction one quarter */
                case "172": case "not": return "¬";         case "188": case "frac14": return "¼";
                /* soft hyphen */                           /* fraction one half */
                case "173": case "shy": return "­";          case "189": case "frac12": return "½";
                /* registered trade mark sign */            /* fraction three quarters */
                case "174": case "reg": return "®";         case "190": case "frac34": return "¾";
                /* spacing macron - overline */             /* inverted question mark */
                case "175": case "macr": return "¯";        case "191": case "iquest": return "¿";

                /* latin capital letter A with grave */         /* latin capital letter ETH */
                case "192": case "Agrave": return "À";          case "208": case "ETH": return "Ð";
                /* latin capital letter A with acute */         /* latin capital letter N with tilde */
                case "193": case "Aacute": return "Á";          case "209": case "Ntilde": return "Ñ";
                /* latin capital letter A with circumflex */    /* latin capital letter O with grave */
                case "194": case "Acirc": return "Â";           case "210": case "Ograve": return "Ò";
                /* latin capital letter A with tilde */         /* latin capital letter O with acute */
                case "195": case "Atilde": return "Ã";          case "211": case "Oacute": return "Ó";
                /* latin capital letter A with diaeresis */     /* latin capital letter O with circumflex */
                case "196": case "Auml": return "Ä";            case "212": case "Ocirc": return "Ô";
                /* latin capital letter A with ring above */    /* latin capital letter O with tilde */
                case "197": case "Aring": return "Å";           case "213": case "Otilde": return "Õ";
                /* latin capital letter AE */                   /* latin capital letter O with diaeresis */
                case "198": case "AElig": return "Æ";           case "214": case "Ouml": return "Ö";
                /* latin capital letter C with cedilla */       /* multiplication sign */
                case "199": case "Ccedil": return "Ç";          case "215": case "times": return "×";
                /* latin capital letter E with grave */         /* latin capital letter O with slash */
                case "200": case "Egrave": return "È";          case "216": case "Oslash": return "Ø";
                /* latin capital letter E with acute */         /* latin capital letter U with grave */
                case "201": case "Eacute": return "É";          case "217": case "Ugrave": return "Ù";
                /* latin capital letter E with circumflex */    /* latin capital letter U with acute */
                case "202": case "Ecirc": return "Ê";           case "218": case "Uacute": return "Ú";
                /* latin capital letter E with diaeresis */     /* latin capital letter U with circumflex */
                case "203": case "Euml": return "Ë";            case "219": case "Ucirc": return "Û";
                /* latin capital letter I with grave */         /* latin capital letter U with diaeresis */
                case "204": case "Igrave": return "Ì";          case "220": case "Uuml": return "Ü";
                /* latin capital letter I with acute */         /* latin capital letter Y with acute */
                case "205": case "Iacute": return "Í";          case "221": case "Yacute": return "Ý";
                /* latin capital letter I with circumflex */    /* latin capital letter THORN */
                case "206": case "Icirc": return "Î";           case "222": case "THORN": return "Þ";
                /* latin capital letter I with diaeresis */     /* latin small letter sharp s - ess-zed */
                case "207": case "Iuml": return "Ï";            case "223": case "szlig": return "ß";
                
                /* latin small letter a with grave */           /* latin small letter eth */
                case "224": case "agrave": return "à";          case "240": case "eth": return "ð";
                /* latin small letter a with acute */           /* latin small letter n with tilde */
                case "225": case "aacute": return "á";          case "241": case "ntilde": return "ñ";
                /* latin small letter a with circumflex */      /* latin small letter o with grave */
                case "226": case "acirc": return "â";           case "242": case "ograve": return "ò";
                /* latin small letter a with tilde */           /* latin small letter o with acute */
                case "227": case "atilde": return "ã";          case "243": case "oacute": return "ó";
                /* latin small letter a with diaeresis */       /* latin small letter o with circumflex */
                case "228": case "auml": return "ä";            case "244": case "ocirc": return "ô";
                /* latin small letter a with ring above */      /* latin small letter o with tilde */
                case "229": case "aring": return "å";           case "245": case "otilde": return "õ";
                /* latin small letter ae */                     /* latin small letter o with diaeresis */
                case "230": case "aelig": return "æ";           case "246": case "ouml": return "ö";
                /* latin small letter c with cedilla */         /* division sign */
                case "231": case "ccedil": return "ç";          case "247": case "divide": return "÷";
                /* latin small letter e with grave */           /* latin small letter o with slash */
                case "232": case "egrave": return "è";          case "248": case "oslash": return "ø";
                /* latin small letter e with acute */           /* latin small letter u with grave */
                case "233": case "eacute": return "é";          case "249": case "ugrave": return "ù";
                /* latin small letter e with circumflex */      /* latin small letter u with acute */
                case "234": case "ecirc": return "ê";           case "250": case "uacute": return "ú";
                /* latin small letter e with diaeresis */       /* latin small letter u with circumflex */
                case "235": case "euml": return "ë";            case "251": case "ucirc": return "û";
                /* latin small letter i with grave */           /* latin small letter u with diaeresis */
                case "236": case "igrave": return "ì";          case "252": case "uuml": return "ü";
                /* latin small letter i with acute */           /* latin small letter y with acute */
                case "237": case "iacute": return "í";          case "253": case "yacute": return "ý";
                /* latin small letter i with circumflex */      /* latin small letter thorn */
                case "238": case "icirc": return "î";           case "254": case "thorn": return "þ";
                /* latin small letter i with diaeresis */       /* latin small letter y with diaeresis */
                case "239": case "iuml": return "ï";            case "255": case "yuml": return "ÿ";

                // Browser support: Internet Explorer > 4, Netscape > 4   
                
                /* latin capital letter OE */                   /* single low-9 quotation mark */    
                case "338": return "Œ";                         case "8218": return "‚";             
                /* latin small letter oe */                     /* left double quotation mark */     
                case "339": return "œ";                         case "8220": return "“";             
                /* latin capital letter S with caron */         /* right double quotation mark */
                case "352": return "Š";                         case "8221": return "”";
                /* latin small letter s with caron */           /* double low-9 quotation mark */
                case "353": return "š";                         case "8222": return "„";
                /* latin capital letter Y with diaeresis */     /* dagger */
                case "376": return "Ÿ";                         case "8224": return "†";
                /* latin small f with hook - function */        /* double dagger */
                case "402": return "ƒ";                         case "8225": return "‡";
                                                                /* bullet */
                /* en dash */                                   case "8226": return "•";
                case "8211": return "–";                        /* horizontal ellipsis */
                /* em dash */                                   case "8230": return "…";
                case "8212": return "—";                        /* per thousand sign */
                /* left single quotation mark */                case "8240": return "‰";
                case "8216": return "‘";                        /* euro sign */
                /* right single quotation mark */               case "8364": case "euro": return "€";
                case "8217": return "’";                        /* trade mark sign */
                                                                case "8482": return "™";
            }

            return null;
        }        
    }
}

Applying the HtmlStripper to the simple piece of Html like this

 
<div>text in the first div&nbsp;
     <div><text in the nested div><div/>
     &nbsp;&#8364;text in the first div again
</div>

the following result is got

text in the first div 
    <text in the nested div>
     €text in the first div again

The next code sample demonstrates how to use HtmlStripper in field conditions:

using System;
using System.Text;
using System.Web;
using System.Net;
using System.IO;
using Helpers;

namespace HtmlStripperConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            String responseString = null;

            WebRequest request = HttpWebRequest.Create("http://dotnetfollower.com/wordpress/2011/12/sharepoint-understanding-businessdata-column-bdc-field/");            
        
            using(WebResponse response = request.GetResponse())
                using (Stream stream = response.GetResponseStream())
                {
                    StreamReader reader = new StreamReader(stream, Encoding.UTF8);
                    responseString = reader.ReadToEnd();
                }
            
            string strippedText = HtmlStripper.Strip(responseString);
            File.WriteAllText(@"c:\text.txt", strippedText);            
        }
    }
}
 
Categories: C#, Regex Tags: ,
  1. No comments yet.
  1. No trackbacks yet.