Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

UnHTM() :: Remove HTML formatting from a String [Updated]


  • Please log in to reply
22 replies to this topic
SKAN
  • Administrators
  • 9115 posts
  • Last active:
  • Joined: 26 Dec 2005
Please do not expect UnHTM() to unformat a whole HTML file. If you have already parsed out a string, and need to unformat it to plain text, then UnHTM() would be handy.

UnHTM( HTM ) {   ; [color=black]Remove HTML formatting / Convert to ordinary text[/color]   [color=indigo]by SKAN 19-Nov-2009[/color]
 Static HT,C=";" ; [color=black]Forum Topic:[/color] www.autohotkey.com/forum/topic51342.html  [color=#AA0000]Mod: 16-Sep-2010[/color]
 IfEqual,HT,,   SetEnv,HT, % [color=#808080]"ááââ´´ææàà&ååãã&au"
 . "mlä&bdquo„¦¦&bull•ç縸¢¢&circˆ©©¤¤&dagger†&dagger‡°"
 . "°÷÷ééêêèèððëë&euro€&fnofƒ½½¼¼¾¾>>&h"
 . "ellip…ííîî¡¡ìì¿¿ïï««&ldquo“&lsaquo‹&lsquo‘<<&m"
 . "acr¯&mdash—µµ··  &ndash–¬¬ññóóôô&oeligœòò&or"
 . "dfªººøøõõöö¶¶&permil‰±±££"""»»&rdquo”®"
 . "®&rsaquo›&rsquo’&sbquo‚&scaronš§§­ ¹¹²²³³ßßþþ&tilde˜&tim"
 . "es×&trade™úúûûùù¨¨üüýý¥¥ÿÿ"[/color]
 $ := RegExReplace( HTM,"<[^>]+>" )               ; Remove all tags between  "<" and ">"
 Loop, Parse, $, &`;                              ; Create a list of special characters
   L := "&" A_LoopField C, R .= (!(A_Index&1)) ? ( (!InStr(R,L,1)) ? L:"" ) : ""
 StringTrimRight, R, R, 1
 Loop, Parse, R , %C%                               ; Parse Special Characters
  If F := InStr( HT, L := A_LoopField )             ; Lookup HT Data
    StringReplace, $,$, %L%%C%, % SubStr( HT,F+StrLen(L), 1 ), All
  Else If ( SubStr( L,2,1)="#" )
    StringReplace, $, $, %L%%C%, % Chr(((SubStr(L,3,1)="x") ? "0" : "" ) SubStr(L,3)), All
Return RegExReplace( $, "(^\s*|\s*$)")            ; Remove leading/trailing white spaces
}

; Example:
HTM = <a href="/intl/en/ads/">Advertising Programs</a>
MsgBox, % UnHTM( HTM )

Thanks to AGermanUser for NeedleRegEx

:)





[color=red]; Array of Special Character Entities was created with following code[/color]
Loop % 256-33 {
Transform, F, HTML, % Chr( A := A_Index+33 )
If Strlen(F) > 1 && !Instr( F, "#" )
  list .= "&" SubStr(F,2, StrLen(F)-2) Chr(A )
}
StringLower, List, List
Sort, List, D& U
Clipboard := List
MsgBox, 0, % StrLen( List ), % Clipboard


SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
Hi Skan,

what would be the difference between yours and "[stdlib] unHTML - Strips Tags and Entities from given Source" by derRaphael?
<!-- m -->http://www.autohotke...topic38183.html<!-- m -->

SKAN
  • Administrators
  • 9115 posts
  • Last active:
  • Joined: 26 Dec 2005

what would be the difference between yours and .... by derRaphael?


Well. There is also StripHTMLItems() posted by Jamie. The functionality between these three are more or less the same but I wrote mine from scratch for it to be Standalone and Compact... and to be an accessory to StrX() [a wrapper for SubStr()] that I am about to post. The examples for StrX() will be pointing towards this thread.

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
Can't wait :D

First Toy Lab
  • Members
  • 21 posts
  • Last active: Jan 09 2010 01:18 PM
  • Joined: 15 Nov 2009
Well done!

SKAN
  • Administrators
  • 9115 posts
  • Last active:
  • Joined: 26 Dec 2005
Thanks "First Toy Labs". :)

Can't wait :D


Hope I do not disappoint you! :)

StrX() : <!-- m -->http://www.autohotke...topic51354.html<!-- m -->

luetkmeyer
  • Members
  • 38 posts
  • Last active: Jul 01 2011 04:11 PM
  • Joined: 26 Feb 2010
Great!

berban
  • Members
  • 202 posts
  • Last active: Feb 21 2019 06:14 PM
  • Joined: 30 Dec 2009
Well, I guess my code is a bit antiquated then, hehe:

CleanHTML(v)
{
	StringReplace v, v, !, !, All
	StringReplace v, v, ", ", All
	StringReplace v, v, ", ", All
	StringReplace v, v, #, #, All
	StringReplace v, v, $, $, All
	StringReplace v, v, %, `%, All
	StringReplace v, v, &, &, All
	StringReplace v, v, &, &, All
	StringReplace v, v, ', ', All
	StringReplace v, v, (, (, All
	StringReplace v, v, ), ), All
	StringReplace v, v, *, *, All
	StringReplace v, v, +, +, All
	StringReplace v, v, ,, ,, All
	StringReplace v, v, -, -, All
	StringReplace v, v, ., ., All
	StringReplace v, v, /, /, All
	StringReplace v, v, 0, 0, All
	StringReplace v, v, 1, 1, All
	StringReplace v, v, 2, 2, All
	StringReplace v, v, 3, 3, All
	StringReplace v, v, 4, 4, All
	StringReplace v, v, 5, 5, All
	StringReplace v, v, 6, 6, All
	StringReplace v, v, 7, 7, All
	StringReplace v, v, 8, 8, All
	StringReplace v, v, 9, 9, All
	StringReplace v, v, :, :, All
	StringReplace v, v, ;, ;, All
	StringReplace v, v, <, <, All
	StringReplace v, v, <, <, All
	StringReplace v, v, =, =, All
	StringReplace v, v, >, >, All
	StringReplace v, v, >, >, All
	StringReplace v, v, ?, ?, All
	StringReplace v, v, @, @, All
	StringReplace v, v, A, A, All
	StringReplace v, v, B, B, All
	StringReplace v, v, C, C, All
	StringReplace v, v, D, D, All
	StringReplace v, v, E, E, All
	StringReplace v, v, F, F, All
	StringReplace v, v, G, G, All
	StringReplace v, v, H, H, All
	StringReplace v, v, I, I, All
	StringReplace v, v, J, J, All
	StringReplace v, v, K, K, All
	StringReplace v, v, L, L, All
	StringReplace v, v, M, M, All
	StringReplace v, v, N, N, All
	StringReplace v, v, O, O, All
	StringReplace v, v, P, P, All
	StringReplace v, v, Q, Q, All
	StringReplace v, v, R, R, All
	StringReplace v, v, S, S, All
	StringReplace v, v, T, T, All
	StringReplace v, v, U, U, All
	StringReplace v, v, V, V, All
	StringReplace v, v, W, W, All
	StringReplace v, v, X, X, All
	StringReplace v, v, Y, Y, All
	StringReplace v, v, Z, Z, All
	StringReplace v, v, [, [, All
	StringReplace v, v, \, \, All
	StringReplace v, v, ], ], All
	StringReplace v, v, ^, ^, All
	StringReplace v, v, _, _, All
	StringReplace v, v, `, `, All
	StringReplace v, v, a, a, All
	StringReplace v, v, b, b, All
	StringReplace v, v, c, c, All
	StringReplace v, v, d, d, All
	StringReplace v, v, e, e, All
	StringReplace v, v, f, f, All
	StringReplace v, v, g, g, All
	StringReplace v, v, h, h, All
	StringReplace v, v, i, i, All
	StringReplace v, v, j, j, All
	StringReplace v, v, k, k, All
	StringReplace v, v, l, l, All
	StringReplace v, v, m, m, All
	StringReplace v, v, n, n, All
	StringReplace v, v, o, o, All
	StringReplace v, v, p, p, All
	StringReplace v, v, q, q, All
	StringReplace v, v, r, r, All
	StringReplace v, v, s, s, All
	StringReplace v, v, t, t, All
	StringReplace v, v, u, u, All
	StringReplace v, v, v, v, All
	StringReplace v, v, w, w, All
	StringReplace v, v, x, x, All
	StringReplace v, v, y, y, All
	StringReplace v, v, z, z, All
	StringReplace v, v, {, {, All
	StringReplace v, v, |, |, All
	StringReplace v, v, }, }, All
	StringReplace v, v, ˜, ~, All
	StringReplace v, v, ~, ~, All
	StringReplace v, v, , , All
	StringReplace v, v, €, €, All
	StringReplace v, v, , ?, All
	StringReplace v, v, ‚, ‚, All
	StringReplace v, v, ‚, ‚, All
	StringReplace v, v, ƒ, ƒ, All
	StringReplace v, v, &dbquo;, „, All
	StringReplace v, v, „, „, All
	StringReplace v, v, …, …, All
	StringReplace v, v, †, †, All
	StringReplace v, v, †, †, All
	StringReplace v, v, ‡, ‡, All
	StringReplace v, v, ‡, ‡, All
	StringReplace v, v, ˆ, ˆ, All
	StringReplace v, v, ‰, ‰, All
	StringReplace v, v, ‰, ‰, All
	StringReplace v, v, Š, Š, All
	StringReplace v, v, ‹, ‹, All
	StringReplace v, v, ‹, ‹, All
	StringReplace v, v, Œ, Œ, All
	StringReplace v, v, , ?, All
	StringReplace v, v, Ž, Ž, All
	StringReplace v, v, , ?, All
	StringReplace v, v, , ?, All
	StringReplace v, v, ‘, ‘, All
	StringReplace v, v, ‘, ‘, All
	StringReplace v, v, ’, ’, All
	StringReplace v, v, ’, ’, All
	StringReplace v, v, “, “, All
	StringReplace v, v, “, “, All
	StringReplace v, v, ”, ”, All
	StringReplace v, v, ”, ”, All
	StringReplace v, v, •, •, All
	StringReplace v, v, –, –, All
	StringReplace v, v, –, –, All
	StringReplace v, v, —, —, All
	StringReplace v, v, —, —, All
	StringReplace v, v, &tilde, ˜, All
	StringReplace v, v, ˜, ˜, All
	StringReplace v, v, ™, ™, All
	StringReplace v, v, ™, ™, All
	StringReplace v, v, š, š, All
	StringReplace v, v, ›, ›, All
	StringReplace v, v, ›, ›, All
	StringReplace v, v, œ, œ, All
	StringReplace v, v, , ?, All
	StringReplace v, v, ž, ž, All
	StringReplace v, v, Ÿ, Ÿ, All
	StringReplace v, v, Ÿ, Ÿ, All
	StringReplace v, v,  , %A_Space%, All
	StringReplace v, v,  , %A_Space%, All
	StringReplace v, v, ¡, ¡, All
	StringReplace v, v, ¡, ¡, All
	StringReplace v, v, ¢, ¢, All
	StringReplace v, v, ¢, ¢, All
	StringReplace v, v, £, £, All
	StringReplace v, v, £, £, All
	StringReplace v, v, ¤, ¤, All
	StringReplace v, v, ¤, ¤, All
	StringReplace v, v, ¥, ¥, All
	StringReplace v, v, ¥, ¥, All
	StringReplace v, v, ¦, ¦, All
	StringReplace v, v, ¦, ¦, All
	StringReplace v, v, §, §, All
	StringReplace v, v, §, §, All
	StringReplace v, v, ¨, ¨, All
	StringReplace v, v, ¨, ¨, All
	StringReplace v, v, ©, ©, All
	StringReplace v, v, ©, ©, All
	StringReplace v, v, ª, ª, All
	StringReplace v, v, ª, ª, All
	StringReplace v, v, «, «, All
	StringReplace v, v, «, «, All
	StringReplace v, v, ¬, ¬, All
	StringReplace v, v, ¬, ¬, All
	StringReplace v, v, ­, ­, All
	StringReplace v, v, ­, ­, All
	StringReplace v, v, ®, ®, All
	StringReplace v, v, ®, ®, All
	StringReplace v, v, ¯, ¯, All
	StringReplace v, v, ¯, ¯, All
	StringReplace v, v, °, °, All
	StringReplace v, v, °, °, All
	StringReplace v, v, ±, ±, All
	StringReplace v, v, ±, ±, All
	StringReplace v, v, ², ², All
	StringReplace v, v, ², ², All
	StringReplace v, v, ³, ³, All
	StringReplace v, v, ³, ³, All
	StringReplace v, v, ´, ´, All
	StringReplace v, v, µ, µ, All
	StringReplace v, v, µ, µ, All
	StringReplace v, v, ¶, ¶, All
	StringReplace v, v, ¶, ¶, All
	StringReplace v, v, ·, ·, All
	StringReplace v, v, ·, ·, All
	StringReplace v, v, ¸, ¸, All
	StringReplace v, v, ¸, ¸, All
	StringReplace v, v, ¹, ¹, All
	StringReplace v, v, ¹, ¹, All
	StringReplace v, v, º, º, All
	StringReplace v, v, º, º, All
	StringReplace v, v, », », All
	StringReplace v, v, », », All
	StringReplace v, v, ¼, ¼, All
	StringReplace v, v, ¼, ¼, All
	StringReplace v, v, ½, ½, All
	StringReplace v, v, ½, ½, All
	StringReplace v, v, ¾, ¾, All
	StringReplace v, v, ¾, ¾, All
	StringReplace v, v, ¿, ¿, All
	StringReplace v, v, À, À, All
	StringReplace v, v, À, À, All
	StringReplace v, v, Á, Á, All
	StringReplace v, v, Á, Á, All
	StringReplace v, v, Â, Â, All
	StringReplace v, v, Â, Â, All
	StringReplace v, v, Ã, Ã, All
	StringReplace v, v, Ã, Ã, All
	StringReplace v, v, Ä, Ä, All
	StringReplace v, v, Ä, Ä, All
	StringReplace v, v, Å, Å, All
	StringReplace v, v, Å, Å, All
	StringReplace v, v, Æ, Æ, All
	StringReplace v, v, Æ, Æ, All
	StringReplace v, v, Ç, Ç, All
	StringReplace v, v, Ç, Ç, All
	StringReplace v, v, È, È, All
	StringReplace v, v, È, È, All
	StringReplace v, v, É, É, All
	StringReplace v, v, É, É, All
	StringReplace v, v, Ê, Ê, All
	StringReplace v, v, Ê, Ê, All
	StringReplace v, v, Ë, Ë, All
	StringReplace v, v, Ë, Ë, All
	StringReplace v, v, Ì, Ì, All
	StringReplace v, v, Ì, Ì, All
	StringReplace v, v, Í, Í, All
	StringReplace v, v, Í, Í, All
	StringReplace v, v, Î, Î, All
	StringReplace v, v, Î, Î, All
	StringReplace v, v, Ï, Ï, All
	StringReplace v, v, Ï, Ï, All
	StringReplace v, v, Ð, Ð, All
	StringReplace v, v, Ð, Ð, All
	StringReplace v, v, Ñ, Ñ, All
	StringReplace v, v, Ñ, Ñ, All
	StringReplace v, v, Ò, Ò, All
	StringReplace v, v, Ò, Ò, All
	StringReplace v, v, Ó, Ó, All
	StringReplace v, v, Ó, Ó, All
	StringReplace v, v, Ô, Ô, All
	StringReplace v, v, Ô, Ô, All
	StringReplace v, v, Õ, Õ, All
	StringReplace v, v, Õ, Õ, All
	StringReplace v, v, Ö, Ö, All
	StringReplace v, v, Ö, Ö, All
	StringReplace v, v, ×, ×, All
	StringReplace v, v, ×, ×, All
	StringReplace v, v, Ø, Ø, All
	StringReplace v, v, Ø, Ø, All
	StringReplace v, v, Ù, Ù, All
	StringReplace v, v, Ù, Ù, All
	StringReplace v, v, Ú, Ú, All
	StringReplace v, v, Ú, Ú, All
	StringReplace v, v, Û, Û, All
	StringReplace v, v, Û, Û, All
	StringReplace v, v, Ü, Ü, All
	StringReplace v, v, Ü, Ü, All
	StringReplace v, v, Ý, Ý, All
	StringReplace v, v, Ý, Ý, All
	StringReplace v, v, Þ, Þ, All
	StringReplace v, v, Þ, Þ, All
	StringReplace v, v, ß, ß, All
	StringReplace v, v, ß, ß, All
	StringReplace v, v, à, à, All
	StringReplace v, v, à, à, All
	StringReplace v, v, á, á, All
	StringReplace v, v, á, á, All
	StringReplace v, v, â, â, All
	StringReplace v, v, â, â, All
	StringReplace v, v, ã, ã, All
	StringReplace v, v, ã, ã, All
	StringReplace v, v, ä, ä, All
	StringReplace v, v, ä, ä, All
	StringReplace v, v, å, å, All
	StringReplace v, v, å, å, All
	StringReplace v, v, æ, æ, All
	StringReplace v, v, æ, æ, All
	StringReplace v, v, ç, ç, All
	StringReplace v, v, ç, ç, All
	StringReplace v, v, è, è, All
	StringReplace v, v, è, è, All
	StringReplace v, v, é, é, All
	StringReplace v, v, é, é, All
	StringReplace v, v, ê, ê, All
	StringReplace v, v, ê, ê, All
	StringReplace v, v, ë, ë, All
	StringReplace v, v, ë, ë, All
	StringReplace v, v, ì, ì, All
	StringReplace v, v, ì, ì, All
	StringReplace v, v, í, í, All
	StringReplace v, v, í, í, All
	StringReplace v, v, î, î, All
	StringReplace v, v, î, î, All
	StringReplace v, v, ï, ï, All
	StringReplace v, v, ï, ï, All
	StringReplace v, v, ð, ð, All
	StringReplace v, v, ð, ð, All
	StringReplace v, v, ñ, ñ, All
	StringReplace v, v, ñ, ñ, All
	StringReplace v, v, ò, ò, All
	StringReplace v, v, ò, ò, All
	StringReplace v, v, ó, ó, All
	StringReplace v, v, ó, ó, All
	StringReplace v, v, ô, ô, All
	StringReplace v, v, ô, ô, All
	StringReplace v, v, õ, õ, All
	StringReplace v, v, õ, õ, All
	StringReplace v, v, ö, ö, All
	StringReplace v, v, ö, ö, All
	StringReplace v, v, ÷, ÷, All
	StringReplace v, v, ÷, ÷, All
	StringReplace v, v, ø, ø, All
	StringReplace v, v, ø, ø, All
	StringReplace v, v, ù, ù, All
	StringReplace v, v, ù, ù, All
	StringReplace v, v, ú, ú, All
	StringReplace v, v, ú, ú, All
	StringReplace v, v, û, û, All
	StringReplace v, v, û, û, All
	StringReplace v, v, ü, ü, All
	StringReplace v, v, ü, ü, All
	StringReplace v, v, ý, ý, All
	StringReplace v, v, ý, ý, All
	StringReplace v, v, þ, þ, All
	StringReplace v, v, þ, þ, All
	StringReplace v, v, ÿ, ÿ, All
	StringReplace v, v, ÿ, ÿ, All
	return v
}

ther ya go

berban
  • Members
  • 202 posts
  • Last active: Feb 21 2019 06:14 PM
  • Joined: 30 Dec 2009
haha wow yup that doesn't quite show up right IN A HTML WEB BROWSER... whatever

soggos
  • Members
  • 129 posts
  • Last active: Nov 30 2012 10:35 AM
  • Joined: 27 Mar 2008
Thank you SKAN and berban and this great language
.....
:?: AnyBody now to remove only: <script>....</script>
with RegEx ¿
with ahk, all is different!...

%guest%
  • Guests
  • Last active:
  • Joined: --
(not tested)

var := RegExReplace(var, "s)\Q<script>\E(.*?)\Q</script>\E", "$1")


garry
  • Spam Officer
  • 3219 posts
  • Last active: Sep 20 2018 02:47 PM
  • Joined: 19 Apr 2005
another example to remove <xy>
data = Current weather condition: <div class="h2">Partly Cloudy / Windy</div>
MsgBox, % RegExReplace( data, "<.*?>" )


soggos
  • Members
  • 129 posts
  • Last active: Nov 30 2012 10:35 AM
  • Joined: 27 Mar 2008

(not tested)

var := RegExReplace(var, "s)\Q<script>\E(.*?)\Q</script>\E", "$1")

Tested and it's so kool,
note: the last parameter $1 is for keeping <script>$1</script>
and don't keep with
var := RegExReplace(var, "s)\Q<script>\E(.*?)\Q</script>\E", "")

and greats thanks garry for your good another example.

another example to remove <xy>

data = Current weather condition: <div class="h2">Partly Cloudy / Windy</div>
MsgBox, % RegExReplace( data, "<.*?>" )


with ahk, all is different!...

SKAN
  • Administrators
  • 9115 posts
  • Last active:
  • Joined: 26 Dec 2005
Update:

The previous version was removing 'number entities', if it were in hex.
The current updated version will replace them correctly.

Example:
The title tag for <!-- m -->http://www.imdb.com/title/tt0068093/<!-- m --> is

<title>"Kung Fu" (1972)</title>

The entity number is in hex instead of " or "


Please use the update code.

Moebius
  • Members
  • 39 posts
  • Last active: May 11 2015 08:46 PM
  • Joined: 08 Mar 2009
Hi,
i get the following error:

Error at line 16.
Line Text: ;1
Error: The leftmost character above is illegal in an expression.

Line 16:
L := "&" A_LoopField C, R .= (!(A_Index&1)) ? ( (!InStr(R,L,1)) ? L:"" ) : ""

Anyway, thanks Skan for all of your snippets - they are very useful!