Archive

Archive for the ‘Regex’ Category

Knockout.js: HtmlNoScript binding

February 14th, 2013 No comments

    The Knockout.js is one of the most popular and fast-developing libraries that bring the MVVM pattern into the world of JavaScript development. Knockout.js allows to declaratively bind UI (Document Object Model elements) to an underlying data model. One of the built-in bindings is the html-binding allowing to display an arbitrary Html provided by an associated property/field of the data model. The typical use of this binding may look like this:

<div data-bind="html: comment" />

where the comment property may be specified as shown in the following data model:

...
// include the knockout library
<script src="js/knockout/knockout-2.2.0.debug.js" type="text/javascript"></script>
...
$(function () {
	// define a data model
	var viewModel = function () {
        var self = this;		
		self.comment = ko.observable("some html comment <span>Hello<script type='text/javascript'>alert('Hello!');<\/script></span>");
		...
	}
	...
	// create instance of the data model
	var viewModelInstance = new viewModel();
	// bind the instance to UI (since that moment all of the declarative bindings have worked off and the data are displayed)
	ko.applyBindings(viewModelInstance);
});

Applying a value to a DOM element, the html-binding exploits the jQuery‘s html() method, of course, if jQuery is available on the page; otherwise knockout.js relies on its own logic that ends up with call of the DOM‘s appendChild function. In both cases if the html being applied contains inclusions of JavaScripts, the scripts execute once the html has been added to the DOM. Preventing all unwanted scripts from running is the best practice when displaying random htmls on the page. Using the approach described in my post How to prevent execution of JavaScript within a html being added to the DOM we are able to develop our own knockout binding disabling the scripts. It could look like the following:

 ko.bindingHandlers.htmlNoScripts = {
    init: function () {
        // Prevent binding on the dynamically-injected HTML
        return { 'controlsDescendantBindings': true };
    },
    update: function (element, valueAccessor, allBindingsAccessor) {
        // First get the latest data that we're bound to
        var value = valueAccessor();
        // Next, whether or not the supplied model property is observable, get its current value
        var valueUnwrapped = ko.utils.unwrapObservable(value);
		// disable scripts
        var disarmedHtml = valueUnwrapped.replace(/<script(?=(\s|>))/i, '<script type="text/xml" ');
        ko.utils.setHtml(element, disarmedHtml);
    }
};
 

An alternative way to prevent scripts from running is to use the innerHTML property of a DOM element as follows:

 var html = "<script type='text/javascript'>alert('Hello!');<\/script>";
 var newSpan = document.createElement('span');
 newSpan.innerHTML = str;
 document.getElementById('content').appendChild(newSpan);
 

Some people, however, report that this doesn’t work in Internet Explorer 7 for unknown reasons. So, let’s combine these two methods in one to minimize the risk of unwanted scripts running. My tests show the combined approach successfully prevent JavaScript from being executed in all popular browsers. Even in the worst case, at least one of the methods works off. The resultant knockout binding may look like the following:

ko.bindingHandlers.htmlNoScripts = {
    init: function () {
        // Prevent binding on the dynamically-injected HTML
        return { 'controlsDescendantBindings': true };
    },
    update: function (element, valueAccessor, allBindingsAccessor) {
        // First get the latest data that we're bound to
        var value = valueAccessor();
        // Next, whether or not the supplied model property is observable, get its current value
        var valueUnwrapped = ko.utils.unwrapObservable(value);
        // disable scripts
        var disarmedHtml = valueUnwrapped.replace(/<script(?=(\s|>))/i, '<script type="text/xml" ');        
        // create a wrapping element
        var newSpan = document.createElement('span');
        // safely set internal html of the wrapping element
        newSpan.innerHTML = disarmedHtml;
        // clear the associated node from the previous content
        ko.utils.emptyDomNode(element);
        // add the sanitized html to the DOM
        element.appendChild(newSpan);
    }
};

This htmlNoScripts binding can be used as follows:

<div data-bind="htmlNoScripts: comment" />

Related posts:

JavaScript: How to prevent execution of JavaScript within a html being added to the DOM

February 5th, 2013 No comments

    Getting pieces of html from an external web service, I display them “as is” on a page. To insert the htmls into the document I use handy jQuery methods: append() and html(). The only problem is the htmls contain inclusions of JavaScript that is executed whenever I add the htmls to the DOM (Document Object Model). Having looked for an acceptable solution to prevent untrusted scripts from running I found out that almost all solutions come to removing script-tags from html using the Regular Expressions. I tried a couple of regexes, and they really worked. However, every time I could find such a combination of script-tag, JavaScript inside it and wrapping html when the regexes either didn’t recognize a script block or cut out more than it was needed. And I don’t even mention that parsing an arbitrary HTML with the Regular Expressions is a bad idea in all respects :). Trying to find another solution, I recalled that if we replace the type=”text/javascript” of a script-tag with the type=”text/xml”, the JavaScript inside will not execute. This is quite known fact, and a few JavaScript libraries use this behavior for their purposes. However, two things impeded me to implement the direct replacing of one type with another: the type=”text/javascript” sub-string may encounter outside the <script>, somewhere in the content (like in this article 🙂 ); the script-tag may be without the type attribute at all, and, in this case, the code inside will be run as if the type=”text/javascript” is specified. Taking into account these conditions, I developed a few tricky regexes, that seemed to be fairly complicated and not reliable enough though. After all, this led me to a thought to examine what happens if the script-tag has two type attributes. Something like this:

<script type="text/xml" type="text/javascript">
    alert('Hello!');
</script>

All browsers I tested this in didn’t execute JavaScript. On the contrary, if I change the type attributes over, the script is executed. Only the first type attribute seems to be efficient. So, my resultant solution is based on the following assumptions:

  • JavaScript inside the <script type=”text/xml”> doesn’t execute;
  • JavaScript inside the <script type=”text/xml” type=”text/javascript”> doesn’t execute as well because only the first type is considered;

So, below is a very simple JavaScript function, which prevents scripts from running and seems to avoid most of drawbacks:

function preventJS(html) {
    return html.replace(/<script(?=(\s|>))/i, '<script type="text/xml" ');
}

The solution still relies on the Regular Expressions, but the regex itself is quite simple and unambiguous. The script-tags are preserved in the html, so you can treat them later in a manner you want.

Below is an example of use

<html>
  <head>
    <title>Prevent scripts from running</title>
    <script src="jquery-1.8.3.js" type="text/javascript"></script>
  </head>
  <body>
    <div id="content"></div>
    <script type="text/javascript">
      $(function () {
        var html1 = "<span>Hello 1<script type='text/javascript'>alert('Hello 1!');<\/script></span>"
        var html2 = "<span>Hello 2 &lt;script&gt; alert('Hi') &lt;/script&gt; <script   type='text/javascript'>alert('Hello 2!');<\/script></span>";
        var html3 = "<script src=\"someJs.js\" type='text/javascript'><\/script>";
        var html4 = "<script>alert('Hello 4!');<\/script>";
        var html5 = "<scriptsomeAttr >alert('Hello 5!');";

        $("#content").html(preventJS(html1));
        $("#content").append(preventJS(html2));
        $("#content").append(preventJS(html3));
        $("#content").append(preventJS(html4));
        $("#content").append(preventJS(html5));
      });

      function preventJS(html) {
        return html.replace(/<script(?=(\s|>))/i, '<script type="text/xml" ');
      }
    </script>        
  </body>
</html>

I tested this solution in Google Chrome v. 24.0.1312.56, FireFox v. 18.0.1 and Internet Explorer 9 v.9.0.8112.16421, and it works as directed. However, I still have doubts whether it’s applicable for all browsers and their versions. So, if you have tested it, please don’t hesitate to post a comment here with the result, browser’s name and version you use.

For test purposes you can download a html page and a couple of js-files here. If the preventing works correctly, having opened the page in a browser, you shouldn’t get any alerts.

Related posts:
Categories: JavaScript, jQuery, Regex Tags: , ,

C#: Html Stripper

February 13th, 2012 No comments

    Some time ago I developed the project comparing the visible content of two html-pages and displaying all found differences. The main goal of project was to detect whether page was updated through the time, so usually comparison was between different versions of one html-page. To fetch the visible content and throw away the hidden one I created the Regex-based Html stripper embodied in the HtmlStripper class. The HtmlStripper follows the next three steps:

  1. to remove all service and auxiliary Html tags. In most cases that means to remove the <head>-tag with all its content, along with the <style>, <script> and other tags residing within <body>;
  2. to remove all the rest Html tags;
  3. to replace escape sequences found in Html with its text analogs. In practice that means that, for example,
    • the &nbsp; sequence should be replaced with the space symbol ‘ ‘;
    • &lt; – with ‘<‘;
    • &gt – with ‘>’ and so on

    There is a huge amount of escape sequences (or HTML codes), so the HtmlStripper operates with the most popular in my opinion;

So, the code of HtmlStripper is shown below:

HtmlStripper Source

using System;
using System.Text.RegularExpressions;

namespace Helpers
{
    public static class HtmlStripper
    {
        #region fields
        /// <summary>
        /// Allows to find the HTML tags hidden from view (style, script code and so on)
        /// </summary>
        private static Regex _findHtmlTagsWithInvisibleTextRegex = new Regex
            (@"<head[^>]*?>.*?</head> | <style[^>]*?>.*?</style> | <script[^>]*?.*?</script> | 
               <object[^>]*?.*?</object> | <embed[^>]*?.*?</embed> | <applet[^>]*?.*?</applet> |
               <noframes[^>]*?.*?</noframes> | <noscript[^>]*?.*?</noscript> | <noembed[^>]*?.*?</noembed>",
                RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);

        /// <summary>
        /// Allows to find all HTML tags
        /// </summary>
        private static Regex _findHtmlTagsRegex = new Regex
            (@"<(?:(!--) |(\?) |(?i:( TITLE  | SCRIPT | APPLET | OBJECT | STYLE )) | ([!/A-Za-z]))(?(4)(?:(?![\s=][""`'])

[^>] |[\s=]`[^`]*`
             |[\s=]'[^']*'|[\s=]""[^""]*"")*|.*?)(?(1)(?<=--))(?(2)(?<=\?))(?(3)</(?i:\3)(?:\s[^>]*)?)>",
             RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);

        private static string _escapeGroupName = "escSeq";
        /// <summary>
        /// Allows to find escape sequences that are used in HTML
        /// </summary>
        private static Regex _findHtmlEscapeSequencesRegex = new Regex
            (string.Format(@"[&] (([#](?<{0}>\d+)) | ([#](?<{0}>x[\dabcdef]+)) | (?<{0}>\w+));?", _escapeGroupName),
            RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | 

RegexOptions.Singleline);
        #endregion

        public static string Strip(string htmlSrc)
        {
            string xTmpStr = _findHtmlTagsWithInvisibleTextRegex.Replace(htmlSrc, "");
            xTmpStr = _findHtmlTagsRegex.Replace(xTmpStr, "");
            return RemoveHtmlEscapeSequences(xTmpStr);
        }        

        private static string RemoveHtmlEscapeSequences(string htmlStr)
        {
            return _findHtmlEscapeSequencesRegex.Replace(htmlStr, delegate(Match match)
            {
                string replacement = GetHtmlEscapeSequenceSubstitution(match.Groups[_escapeGroupName].Value);
                return replacement == null ? match.Value : replacement;
            });
        }

        private static string GetHtmlEscapeSequenceSubstitution(string htmlEscape)
        {
            htmlEscape = htmlEscape.TrimStart('0'); // sometimes number contains leading zeros

            bool? isNumber = null;

            // if it's a hex number, convert it into string containing decimal number
            if (htmlEscape.StartsWith("x"))
            {
                int num = Int32.Parse(htmlEscape.TrimStart('x'), System.Globalization.NumberStyles.HexNumber);
                htmlEscape = num.ToString();
                isNumber = true;
            }

            // check if it's a number
            if(isNumber == null)
            {
                int tmpOut;
                isNumber = int.TryParse(htmlEscape, out tmpOut);
            }

            // find substitution for either the peeled number or the original string
            string res = FindSubstitution(htmlEscape);

            // if it's a number, any further attempts with different string variations are senseless
            if (!isNumber.Value && res == null)
            {                
                // there are a few Html codes consisting of all capital letters,
                // try to find substitution for them
                htmlEscape = htmlEscape.ToUpperInvariant();
                res = FindSubstitution(htmlEscape);               

                if (res == null)
                {
                    // try to find substitution for string with the first capital letter
                    htmlEscape = htmlEscape.Substring(0, 1).ToUpperInvariant() + htmlEscape.Substring(1).ToLowerInvariant();
                    res = FindSubstitution(htmlEscape);
                }

                if (res == null)
                {
                    // try to find substitution for the lower string
                    htmlEscape = htmlEscape.ToLowerInvariant();
                    res = FindSubstitution(htmlEscape);
                }
            }

            return res;
        }

        private static string FindSubstitution(string htmlEscape)
        {
            switch (htmlEscape)
            {
                // All browsers support

                /* space */                             /* zero */
                case "32": return " ";                  case "48": return "0";
                /* exclamation point */                 /* one */
                case "33": return "!";                  case "49": return "1";
                /* double quotes */                     /* two */
                case "34": case "quot": return "\"";    case "50": return "2";
                /* number sign */                       /* three */
                case "35": return "#";                  case "51": return "3";
                /* dollar sign */                       /* four */
                case "36": return "$";                  case "52": return "4";
                /* percent sign */                      /* five */
                case "37": return "%";                  case "53": return "5";
                /* ampersand */                         /* six */
                case "38": case "amp": return "&";      case "54": return "6";
                /* single quote */                      /* seven */
                case "39": return "'";                  case "55": return "7";
                /* opening parenthesis */               /* eight */
                case "40": return "(";                  case "56": return "8";
                /* closing parenthesis */               /* nine */
                case "41": return ")";                  case "57": return "9";
                /* asterisk */                          /* colon */
                case "42": return "*";                  case "58": return ":";
                /* plus sign */                         /* semicolon */
                case "43": return "+";                  case "59": return ";";
                /* comma */                             /* less than sign */
                case "44": return ",";                  case "60": case "lt": return "<";
                /* minus sign - hyphen */               /* equal sign */
                case "45": return "-";                  case "61": return "=";
                /* period */                            /* greater than sign */
                case "46": return ".";                  case "62": case "gt": return ">";
                /* slash */                             /* question mark */
                case "47": return "/";                  case "63" : return "?";
                
                /* at symbol */             case "83": return "S";
                case "64": return "@";      case "84": return "T";
                case "65": return "A";      case "85": return "U";
                case "66": return "B";      case "86": return "V";
                case "67": return "C";      case "87": return "W";
                case "68": return "D";      case "88": return "X";
                case "69": return "E";      case "89": return "Y";
                case "70": return "F";      case "90": return "Z";
                case "71": return "G";      /* opening bracket */
                case "72": return "H";      case "91": return "[";
                case "73": return "I";      /* backslash */
                case "74": return "J";      case "92": return "\\";
                case "75": return "K";      /* closing bracket */
                case "76": return "L";      case "93": return "]";
                case "77": return "M";      /* caret - circumflex */
                case "78": return "N";      case "94": return "^";
                case "79": return "O";      /* underscore */
                case "80": return "P";      case "95": return "_";
                case "81": return "Q";      /* grave accent */
                case "82": return "R";      case "96": return "`";                                            
                         
                case "97": return "a";      case "114": return "r";
                case "98": return "b";      case "115": return "s";
                case "99": return "c";      case "116": return "t";
                case "100": return "d";     case "117": return "u";
                case "101": return "e";     case "118": return "v";
                case "102": return "f";     case "119": return "w";
                case "103": return "g";     case "120": return "x";
                case "104": return "h";     case "121": return "y";
                case "105": return "i";     case "122": return "z";
                case "106": return "j";     /* opening brace */
                case "107": return "k";     case "123": return "{";
                case "108": return "l";     /* vertical bar */
                case "109": return "m";     case "124": return "|";
                case "110": return "n";     /* closing brace */
                case "111": return "o";     case "125": return "}";
                case "112": return "p";     /* equivalency sign - tilde */
                case "113": return "q";     case "126": return "~";

                /* non-breaking space */                    /* degree sign */
                case "160": case "nbsp" : return " ";       case "176": case "deg": return "°";
                /* inverted exclamation mark */             /* plus-or-minus sign */
                case "161": case "iexcl": return "¡";       case "177": case "plusmn": return "±";
                /* cent sign */                             /* superscript two - squared */
                case "162": case "cent" : return "¢";       case "178": case "sup2": return "²";
                /* pound sign */                            /* superscript three - cubed */
                case "163": case "pound": return "£";       case "179": case "sup3": return "³";
                /* currency sign */                         /* acute accent - spacing acute */
                case "164": case "curren": return "¤";      case "180": case "acute": return "´";
                /* yen sign */                              /* micro sign */
                case "165": case "yen": return "¥";         case "181": case "micro": return "µ";
                /* broken vertical bar */                   /* pilcrow sign - paragraph sign */
                case "166": case "brvbar": return "¦";      case "182": case "para": return "¶";
                /* section sign */                          /* middle dot - Georgian comma */
                case "167": case "sect": return "§";        case "183": case "middot": return "·";
                /* spacing diaeresis - umlaut */            /* spacing cedilla */
                case "168": case "uml": return "¨";         case "184": case "cedil": return "¸";
                /* copyright sign */                        /* superscript one */
                case "169": case "copy": return "©";        case "185": case "sup1": return "¹";
                /* feminine ordinal indicator */            /* masculine ordinal indicator */
                case "170": case "ordf": return "ª";        case "186": case "ordm": return "º";
                /* left double angle quotes */              /* right double angle quotes */
                case "171": case "laquo": return "«";       case "187": case "raquo": return "»";
                /* not sign */                              /* fraction one quarter */
                case "172": case "not": return "¬";         case "188": case "frac14": return "¼";
                /* soft hyphen */                           /* fraction one half */
                case "173": case "shy": return "­";          case "189": case "frac12": return "½";
                /* registered trade mark sign */            /* fraction three quarters */
                case "174": case "reg": return "®";         case "190": case "frac34": return "¾";
                /* spacing macron - overline */             /* inverted question mark */
                case "175": case "macr": return "¯";        case "191": case "iquest": return "¿";

                /* latin capital letter A with grave */         /* latin capital letter ETH */
                case "192": case "Agrave": return "À";          case "208": case "ETH": return "Ð";
                /* latin capital letter A with acute */         /* latin capital letter N with tilde */
                case "193": case "Aacute": return "Á";          case "209": case "Ntilde": return "Ñ";
                /* latin capital letter A with circumflex */    /* latin capital letter O with grave */
                case "194": case "Acirc": return "Â";           case "210": case "Ograve": return "Ò";
                /* latin capital letter A with tilde */         /* latin capital letter O with acute */
                case "195": case "Atilde": return "Ã";          case "211": case "Oacute": return "Ó";
                /* latin capital letter A with diaeresis */     /* latin capital letter O with circumflex */
                case "196": case "Auml": return "Ä";            case "212": case "Ocirc": return "Ô";
                /* latin capital letter A with ring above */    /* latin capital letter O with tilde */
                case "197": case "Aring": return "Å";           case "213": case "Otilde": return "Õ";
                /* latin capital letter AE */                   /* latin capital letter O with diaeresis */
                case "198": case "AElig": return "Æ";           case "214": case "Ouml": return "Ö";
                /* latin capital letter C with cedilla */       /* multiplication sign */
                case "199": case "Ccedil": return "Ç";          case "215": case "times": return "×";
                /* latin capital letter E with grave */         /* latin capital letter O with slash */
                case "200": case "Egrave": return "È";          case "216": case "Oslash": return "Ø";
                /* latin capital letter E with acute */         /* latin capital letter U with grave */
                case "201": case "Eacute": return "É";          case "217": case "Ugrave": return "Ù";
                /* latin capital letter E with circumflex */    /* latin capital letter U with acute */
                case "202": case "Ecirc": return "Ê";           case "218": case "Uacute": return "Ú";
                /* latin capital letter E with diaeresis */     /* latin capital letter U with circumflex */
                case "203": case "Euml": return "Ë";            case "219": case "Ucirc": return "Û";
                /* latin capital letter I with grave */         /* latin capital letter U with diaeresis */
                case "204": case "Igrave": return "Ì";          case "220": case "Uuml": return "Ü";
                /* latin capital letter I with acute */         /* latin capital letter Y with acute */
                case "205": case "Iacute": return "Í";          case "221": case "Yacute": return "Ý";
                /* latin capital letter I with circumflex */    /* latin capital letter THORN */
                case "206": case "Icirc": return "Î";           case "222": case "THORN": return "Þ";
                /* latin capital letter I with diaeresis */     /* latin small letter sharp s - ess-zed */
                case "207": case "Iuml": return "Ï";            case "223": case "szlig": return "ß";
                
                /* latin small letter a with grave */           /* latin small letter eth */
                case "224": case "agrave": return "à";          case "240": case "eth": return "ð";
                /* latin small letter a with acute */           /* latin small letter n with tilde */
                case "225": case "aacute": return "á";          case "241": case "ntilde": return "ñ";
                /* latin small letter a with circumflex */      /* latin small letter o with grave */
                case "226": case "acirc": return "â";           case "242": case "ograve": return "ò";
                /* latin small letter a with tilde */           /* latin small letter o with acute */
                case "227": case "atilde": return "ã";          case "243": case "oacute": return "ó";
                /* latin small letter a with diaeresis */       /* latin small letter o with circumflex */
                case "228": case "auml": return "ä";            case "244": case "ocirc": return "ô";
                /* latin small letter a with ring above */      /* latin small letter o with tilde */
                case "229": case "aring": return "å";           case "245": case "otilde": return "õ";
                /* latin small letter ae */                     /* latin small letter o with diaeresis */
                case "230": case "aelig": return "æ";           case "246": case "ouml": return "ö";
                /* latin small letter c with cedilla */         /* division sign */
                case "231": case "ccedil": return "ç";          case "247": case "divide": return "÷";
                /* latin small letter e with grave */           /* latin small letter o with slash */
                case "232": case "egrave": return "è";          case "248": case "oslash": return "ø";
                /* latin small letter e with acute */           /* latin small letter u with grave */
                case "233": case "eacute": return "é";          case "249": case "ugrave": return "ù";
                /* latin small letter e with circumflex */      /* latin small letter u with acute */
                case "234": case "ecirc": return "ê";           case "250": case "uacute": return "ú";
                /* latin small letter e with diaeresis */       /* latin small letter u with circumflex */
                case "235": case "euml": return "ë";            case "251": case "ucirc": return "û";
                /* latin small letter i with grave */           /* latin small letter u with diaeresis */
                case "236": case "igrave": return "ì";          case "252": case "uuml": return "ü";
                /* latin small letter i with acute */           /* latin small letter y with acute */
                case "237": case "iacute": return "í";          case "253": case "yacute": return "ý";
                /* latin small letter i with circumflex */      /* latin small letter thorn */
                case "238": case "icirc": return "î";           case "254": case "thorn": return "þ";
                /* latin small letter i with diaeresis */       /* latin small letter y with diaeresis */
                case "239": case "iuml": return "ï";            case "255": case "yuml": return "ÿ";

                // Browser support: Internet Explorer > 4, Netscape > 4   
                
                /* latin capital letter OE */                   /* single low-9 quotation mark */    
                case "338": return "Œ";                         case "8218": return "‚";             
                /* latin small letter oe */                     /* left double quotation mark */     
                case "339": return "œ";                         case "8220": return "“";             
                /* latin capital letter S with caron */         /* right double quotation mark */
                case "352": return "Š";                         case "8221": return "”";
                /* latin small letter s with caron */           /* double low-9 quotation mark */
                case "353": return "š";                         case "8222": return "„";
                /* latin capital letter Y with diaeresis */     /* dagger */
                case "376": return "Ÿ";                         case "8224": return "†";
                /* latin small f with hook - function */        /* double dagger */
                case "402": return "ƒ";                         case "8225": return "‡";
                                                                /* bullet */
                /* en dash */                                   case "8226": return "•";
                case "8211": return "–";                        /* horizontal ellipsis */
                /* em dash */                                   case "8230": return "…";
                case "8212": return "—";                        /* per thousand sign */
                /* left single quotation mark */                case "8240": return "‰";
                case "8216": return "‘";                        /* euro sign */
                /* right single quotation mark */               case "8364": case "euro": return "€";
                case "8217": return "’";                        /* trade mark sign */
                                                                case "8482": return "™";
            }

            return null;
        }        
    }
}

Applying the HtmlStripper to the simple piece of Html like this

 
<div>text in the first div&nbsp;
     <div><text in the nested div><div/>
     &nbsp;&#8364;text in the first div again
</div>

the following result is got

text in the first div 
    <text in the nested div>
     €text in the first div again

The next code sample demonstrates how to use HtmlStripper in field conditions:

using System;
using System.Text;
using System.Web;
using System.Net;
using System.IO;
using Helpers;

namespace HtmlStripperConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            String responseString = null;

            WebRequest request = HttpWebRequest.Create("http://dotnetfollower.com/wordpress/2011/12/sharepoint-understanding-businessdata-column-bdc-field/");            
        
            using(WebResponse response = request.GetResponse())
                using (Stream stream = response.GetResponseStream())
                {
                    StreamReader reader = new StreamReader(stream, Encoding.UTF8);
                    responseString = reader.ReadToEnd();
                }
            
            string strippedText = HtmlStripper.Strip(responseString);
            File.WriteAllText(@"c:\text.txt", strippedText);            
        }
    }
}
Categories: C#, Regex Tags: ,