RemoveAllHtmlTags

== How to remove all HTML tags from a file ==

This is something I have bumped into so many times: how do you remove all those annoying HTML tags (for instance: {{{<td>...</td>}}}) from a page downloaded from the Internet?

Turns out, one of the oldest (text-only) browsers is your friend! Try this:

{{{ $ lynx -dump http://www.megacorp.com/path/to/file.html >> same_file_without_tags.txt }}}

Once the text file has been created, it becomes very easy to parse it with the standard Linux for the important information - you know, the one you are looking for! ;-)

For instance, without HTML tags, you can do the following:

{{{ $ lynx -dump http://www.megacorp.com/path/to/file.html >> same_file_without_tags.txt $ grep look_for_this_line | awk '{print $2}' }}}

Adapt the above two lines to get to the important information you are looking for. Makes your scripts soooooo much easier to understand!

Hope this helps!

== See Also: ==