12 May 2010

Oracle: Regular expression to remove html tags

There are two ways to strip HTML tags from a string

1) Try to extract everything between end html tag(>)and next starting html tag (<)

2) First approach becomes complicated when you have nested tags. So the solution is to replace everything that is between (<) and (>)

SQL below replaces all the tags with space.

Select Regexp_Replace('<>source string<>bold<> with html tags<>', '<[^>]+>','') from dual

*Note: just put some tags between <> to test, google is not displaying the tags properly. The inner tags (before and after bold ) simulate nested tags.

2 comments:

  1. This is good, but how do you replace special characters of html tags like &nbsp, &quot and many more.

    ReplyDelete
  2. since &nbsp,&quot are constants, you can either use replace or regexp_replace

    regexp_replace(source_string,'[&nbsp]|[&quot]','')

    ReplyDelete