htm to txt, AutoHotkey Help (chm file) to txt Topic is solved

Get help with using AutoHotkey and its commands and hotkeys
User avatar
jeeswg
Posts: 5435
Joined: 19 Dec 2016, 01:58
Location: UK

htm to txt, AutoHotkey Help (chm file) to txt

02 Jan 2017, 21:24

Inconsistencies in htm to txt.

I have the AutoHotkey.chm file as a txt file, which is really useful for searching (with a few pages excluded from it). I exploded the chm using HTML Help command line (using short-form paths), and then tried different methods to do htm to txt, e.g. open in GUI/Internet Explorer and copy to clipboard or retrieve outerText directly. I compared the results using WinMerge, I kept getting slightly different formatting results, such as line breaks or bullet points lost. The clipboard method on the whole had the best results.

It would be interesting if anyone has much experience or tips regarding these issues. If AutoHotkey can parse the text effectively, if there are already functions for this, that would be interesting and useful to me. Has anyone tried NirSoft HTMLAsText or any other tool?
Guest

Re: htm to txt, AutoHotkey Help (chm file) to txt

03 Jan 2017, 08:09

Why not use the source of the CHM directly https://github.com/Lexikos/AutoHotkey_L-Docs

If you spot a formatting error you can send a pull request :thumbup:
User avatar
jeeswg
Posts: 5435
Joined: 19 Dec 2016, 01:58
Location: UK

Re: htm to txt, AutoHotkey Help (chm file) to txt

03 Jan 2017, 11:47

Haha, I did think that one outcome of the logic of what I was saying,
was to ask the htm creators to format the html, in such a way
that it would be consistent across all browsing methods.
But 'Mama always said html was like a box of chocolates. You never know what you're gonna get.'
I wouldn't want to ask the creator to do that anyway,
in this case 'beauty is in the eye of the decoder'.

Thanks for this, I've looked at the AHK source a lot, and have since forgotten
the htms are there! I might do some html edits to see what happens.
User avatar
jeeswg
Posts: 5435
Joined: 19 Dec 2016, 01:58
Location: UK

Re: htm to txt, AutoHotkey Help (chm file) to txt  Topic is solved

18 Feb 2017, 00:31

I have made an attempt at an htm (any htm) to txt converter.
It is essentially complete.

Decompile the AutoHotkey Help chm using HTML Help, in order to get the htm files.
The simplest conversion would just strip html tags, leaving plaintext.
This script does some additional alterations to make the resultant text more readable such as: adding line breaks, bullet points, [HDR1] and [COL] tags.

Please notify of any code issues or other issues by commenting below.

Code: Select all

;==================================================

;htm to txt by jeeswg [created: 2017-02-18]
;for use with htms from the AutoHotkey Help chm
;but which can be readily adapted for use with any htm

;the script searches for htms in a folder (with recursion), and converts them to plaintext
;it then puts the text from all the files onto the clipboard

;==================================================

;note: the AutoHotkey version number may appear in:
;docs\AutoHotkey.htm
;docs\AHKL_ChangeLog.htm

;note: it may be preferable to move text from the following htms
;to a separate file as they interfere with searching:
;docs\AHKL_ChangeLog.htm
;docs\ChangeLogHelp.htm

;38	&
vList = ;continuation section
(
34	"
60	<
62	>
96	&agrave;
129	&#129;
160	&nbsp;
162	&cent;
163	&pound;
164	&curren;
165	&yen;
166	&brvbar;
167	&sect;
169	&copy;
170	&ordf;
171	&laquo;
174	&reg;
176	&deg;
177	&plusmn;
181	&micro;
182	&para;
186	&ordm;
196	&Auml;
212	&Ocirc;
220	&Uuml;
252	&uuml;
8211	&ndash;
8230	&#8230;
8364	&#8364;
8364	&euro;
8593	&uarr;
8594	&rarr;
8734	&#x221e;
65533	&#65533;
)

;==================================================

;STAGE 1 - GET TEXT

vDir1 = %A_Desktop%\ahk chm
vPos := StrLen(vDir1)+2
vUrlPart := "https://autohotkey.com/"
vBarrier := StrReplace(Format("{:50}", ""), " ", "=")

vVersion := ""
vOutput := ""
VarSetCapacity(vOutput, 10000000*2) ;10MB*2
Loop, %vDir1%\*.htm, 0, 1 ;(0/1/2=files/both/folders, 0/1=recurse no/yes)
{
	vPath := A_LoopFileFullPath
	SplitPath, vPath, vName, vDir, vExt, vNameNoExt, vDrive
	FileRead, vText, %vPath%

	;get AutoHotkey version number
	if (vName = "AutoHotkey.htm")
		RegExMatch(vText, "(?<=<!--ver-->).*?(?=<!--/ver-->)", vVersion)

	;get url and webpage title
	vUrl := StrReplace(vUrlPart SubStr(vPath, vPos), "\", "/")
	vPos1 := InStr(vText, "<title>") + 7
	vPos2 := InStr(vText, "</title>", 0, vPos1) - 1
	vWinTitle := SubStr(vText, vPos1, vPos2-vPos1+1)

	vOutput .= vBarrier "`r`n`r`n[TITLE]" vWinTitle "`r`n" "[URL]" vUrl "`r`n`r`n" vText "`r`n`r`n"
}
vOutput .= "`r`n" vBarrier "`r`n"

;==================================================

;STAGE 2 - REPLACEMENTS

;put line breaks after certain elements
vOutput := StrReplace(vOutput, "</h1>", "</h1>`r`n")
vOutput := StrReplace(vOutput, "</h2>", "</h2>`r`n")
vOutput := StrReplace(vOutput, "</h3>", "</h3>`r`n")
vOutput := StrReplace(vOutput, "</h4>", "</h4>`r`n")
vOutput := StrReplace(vOutput, "</p>", "</p>`r`n")
vOutput := StrReplace(vOutput, "</pre>", "</pre>`r`n")
vOutput := StrReplace(vOutput, "</ul>", "</ul>`r`n")

;add bullet points to 'li' elements
vOutput := StrReplace(vOutput, "<li>", "<li>" Chr(8226))

;indicate headers and columns
vOutput := StrReplace(vOutput, "</td", "[COL]</td")
vOutput := StrReplace(vOutput, "<h1", "[HDR1]<h1")
vOutput := StrReplace(vOutput, "<h2", "[HDR2]<h2")
vOutput := StrReplace(vOutput, "<h3", "[HDR3]<h3")
vOutput := StrReplace(vOutput, "<h4", "[HDR4]<h3")

;replace 'br' tags
vOutput := StrReplace(vOutput, "<br>", "`r`n")

;diagnostic: check for unexpected tags
if 0
{
	vListX := "</a>,</b>,</body>,</caption>,</code>,</dd>,</div>,</dl>,</dt>,</em>,</h1>,</h2>,</h3>,</h4>,</head>,</html>,</i>,</li>,</meta>,</ol>,</p>,</pre>,</s>,</script>,</small>,</span>,</strong>,</style>,</sup>,</table>,</td>,</th>,</title>,</tr>,</u>,</ul>"
	Loop, Parse, vListX, `,
	vOutput := StrReplace(vOutput, A_LoopField, "")
	Clipboard := vOutput
	MsgBox
}

;remove tags
vOutput := RegExReplace(vOutput, "s)<title.*?>.+?</title>" , "")
vOutput := RegExReplace(vOutput, "s)<style.*?>.+?</style>" , "")
vOutput := RegExReplace(vOutput, "s)<.+?>" , "")

;replace character entities
;(note: '&object;' appears in the text but it is not a character)
Loop, Parse, vList, `n
{
	vTemp := A_LoopField
	StringSplit, vTemp, vTemp, %A_Tab%
	vOutput := StrReplace(vOutput, vTemp2, Chr(vTemp1))
}
vOutput := StrReplace(vOutput, "&", "&")

;diagnostic: check for unexpected character entities
if 0
{
	Clipboard := vOutput
	MsgBox
}

;replace NBSPs
vOutput := StrReplace(vOutput, Chr(160), " ")
;replace tabs
vOutput := StrReplace(vOutput, "`t", A_Space A_Space)
;trim trailing spaces
vOutput := RegExReplace(vOutput, "m) +$", "")
;trim leading CRLFs
vOutput := RegExReplace(vOutput, "^(`r`n)+", "")
;trim multiple blank lines (CRLFCRLFCRLF to CRLFCRLF)
vOutput := RegExReplace(vOutput, "(`r`n){3,}", "`r`n`r`n")

if (vVersion = "")
	Clipboard := vOutput
else
	Clipboard := vBarrier "`r`n`r`nAutoHotkey v" vVersion "`r`n`r`n" vOutput
MsgBox % "done"
return

;==================================================
Current discrepancies (major):
[numbered lists are shown with bullet points but without numbers]
['li' elements should show bullet points/numbers based on whether they are an inside an 'ol'/'ul' element]
[I would be interested in the best approach for this, possibly RegEx]
e.g. docs/Compat.htm

Current discrepancies (minor):
[no differentiation between bullet points and white bullet points]
[• BULLET, Chr(8226)]
[◦ WHITE BULLET, Chr(9702)]
e.g. docs/Compat.htm
[no indent for certain boxes]
e.g. docs/Functions.htm

[EDIT:]
In summary:
- Different methods had given slightly different htm to txt results, no approach had all the best features.
- In the end I used RegEx to remove all html tags leaving plaintext, and compared the plaintext appearance with the htm appearance. Then I added in a few code adjustments relating to different tags, this made the plaintext's appearance more like the htm's.
User avatar
jeeswg
Posts: 5435
Joined: 19 Dec 2016, 01:58
Location: UK

Re: htm to txt, AutoHotkey Help (chm file) to txt

21 Dec 2017, 19:01

Here's an example where I retrieved text from the AHK v1 and v2 documentations, and looked for the inconsistent capitalisation of mixed case words e.g. 'Lock', 'Cdecl', 'Numpad'.

Code: Select all

q:: ;check AHK v1/v2 documentation for inconsistent capitalisation of mixed case words
;check for any mixed case words, and then check whether they are always capitalised consistently
;assumes that the AutoHotkey.chm files have been decompiled to folders called 'AutoHotkey' (e.g. via 7-Zip or HTML Help) (i.e. the chm file is split into htm files)
vDir1 := A_Desktop "\AutoHotkey_1.1.26.01\AutoHotkey"
vDir2 := A_Desktop "\AutoHotkey_2.0-a081-cad307c\AutoHotkey"

vOutput := ""
VarSetCapacity(vOutput, 1000000*2)
Loop, 2
	Loop, Files, % vDir%A_Index% "\*.htm", FR
	{
		vPath := A_LoopFileFullPath
		FileRead, vText, % vPath
		vText := JEE_StrHtmlToTextCustom(vText)
		vOutput .= vText "`r`n"
	}

;vList := "!""#$%&'()*+,-./0123456789:;<=>?@[\]^_``{|}~€…•–¢£¤¥¦§©ª«®°±µ¶º↑→∞�`r`n"
vList := "!""#$%&'()*+,-./0123456789:;<=>?@[\]^``{|}~€…•–¢£¤¥¦§©ª«®°±µ¶º↑→∞�`r`n" ;_ removed
Loop, Parse, vList
	vOutput := StrReplace(vOutput, A_LoopField, " ")

oArrayX := {}
Loop, Parse, vOutput, % " "
	if (JEE_StrGetCase(A_LoopField) = "X")
		oArrayX["z" A_LoopField] := 1

vSCS := A_StringCaseSense
StringCaseSense, On
oArray := {}
Loop, Parse, vOutput, % " "
{
	if !oArrayX["z" A_LoopField]
		continue
	if A_LoopField in % oArray["z" A_LoopField]
		continue
	oArray["z" A_LoopField] .= "," A_LoopField
}
StringCaseSense, % vSCS

vOutput2 := ""
VarSetCapacity(vOutput2, StrLen(vOutput)*2)
for _, vValue in oArray
	if InStr(SubStr(vValue, 2), ",")
		vOutput2 .= SubStr(vValue, 2) "`r`n"
Clipboard := vOutput2
oArray := oArrayX := ""
MsgBox, % "done"
return

;==================================================

JEE_StrHtmlToTextCustom(vOutput)
{
	;38	&
	static vList := ""
	. "`n" "34	""
	. "`n" "60	<"
	. "`n" "62	>"
	. "`n" "96	&agrave;"
	. "`n" "129	&#129;"
	. "`n" "160	&nbsp;"
	. "`n" "162	&cent;"
	. "`n" "163	&pound;"
	. "`n" "164	&curren;"
	. "`n" "165	&yen;"
	. "`n" "166	&brvbar;"
	. "`n" "167	&sect;"
	. "`n" "169	&copy;"
	. "`n" "170	&ordf;"
	. "`n" "171	&laquo;"
	. "`n" "174	&reg;"
	. "`n" "176	&deg;"
	. "`n" "177	&plusmn;"
	. "`n" "181	&micro;"
	. "`n" "182	&para;"
	. "`n" "186	&ordm;"
	. "`n" "196	&Auml;"
	. "`n" "212	&Ocirc;"
	. "`n" "220	&Uuml;"
	. "`n" "252	&uuml;"
	. "`n" "8211	&ndash;"
	. "`n" "8230	&#8230;"
	. "`n" "8364	&#8364;"
	. "`n" "8364	&euro;"
	. "`n" "8593	&uarr;"
	. "`n" "8594	&rarr;"
	. "`n" "8734	&#x221e;"
	. "`n" "65533	&#65533;"

	;put line breaks after certain elements
	vOutput := StrReplace(vOutput, "</h1>", "</h1>`r`n")
	vOutput := StrReplace(vOutput, "</h2>", "</h2>`r`n")
	vOutput := StrReplace(vOutput, "</h3>", "</h3>`r`n")
	vOutput := StrReplace(vOutput, "</h4>", "</h4>`r`n")
	vOutput := StrReplace(vOutput, "</p>", "</p>`r`n")
	vOutput := StrReplace(vOutput, "</pre>", "</pre>`r`n")
	vOutput := StrReplace(vOutput, "</ul>", "</ul>`r`n")

	;add bullet points to 'li' elements
	vOutput := StrReplace(vOutput, "<li>", "<li>" Chr(8226))

	;indicate headers and columns
	vOutput := StrReplace(vOutput, "</td", "[COL]</td")
	vOutput := StrReplace(vOutput, "<h1", "[HDR1]<h1")
	vOutput := StrReplace(vOutput, "<h2", "[HDR2]<h2")
	vOutput := StrReplace(vOutput, "<h3", "[HDR3]<h3")
	vOutput := StrReplace(vOutput, "<h4", "[HDR4]<h3")

	;replace 'br' tags
	vOutput := StrReplace(vOutput, "<br>", "`r`n")

	;diagnostic: check for unexpected tags
	if 0
	{
		vListX := "</a>,</b>,</body>,</caption>,</code>,</dd>,</div>,</dl>,</dt>,</em>,</h1>,</h2>,</h3>,</h4>,</head>,</html>,</i>,</li>,</meta>,</ol>,</p>,</pre>,</s>,</script>,</small>,</span>,</strong>,</style>,</sup>,</table>,</td>,</th>,</title>,</tr>,</u>,</ul>"
		Loop, Parse, vListX, % ","
			vOutput := StrReplace(vOutput, A_LoopField, "")
		Clipboard := vOutput
		;MsgBox()
	}

	;remove tags
	vOutput := RegExReplace(vOutput, "s)<title.*?>.+?</title>" , "")
	vOutput := RegExReplace(vOutput, "s)<style.*?>.+?</style>" , "")
	vOutput := RegExReplace(vOutput, "s)<.+?>" , "")

	;replace character entities
	;(note: '&object;' appears in the text but it is not a character)
	Loop, Parse, vList, `n
	{
		oTemp := StrSplit(A_LoopField, "`t")
		vOutput := StrReplace(vOutput, oTemp.2, Chr(oTemp.1))
	}
	vOutput := StrReplace(vOutput, "&", "&")

	;diagnostic: check for unexpected character entities
	if 0
	{
		Clipboard := vOutput
		;MsgBox()
	}

	;replace NBSPs
	vOutput := StrReplace(vOutput, Chr(160), " ")
	;replace tabs
	vOutput := StrReplace(vOutput, "`t", A_Space A_Space)
	;trim trailing spaces
	vOutput := RegExReplace(vOutput, "m) +$", "")
	;trim leading CRLFs
	vOutput := RegExReplace(vOutput, "^(`r`n)+", "")
	;trim multiple blank lines (CRLFCRLFCRLF to CRLFCRLF)
	vOutput := RegExReplace(vOutput, "(`r`n){3,}", "`r`n`r`n")

	return vOutput
}

;==================================================

JEE_StrGetCase(ByRef vText)
{
	if (vText = "")
		return "Z"
	else if (vText == Format("{:L}", vText))
		return "L"
	else if (vText == Format("{:T}", vText))
		return "T"
	else if (vText == Format("{:U}", vText))
		return "U"
	else
		return "X"
}

;==================================================
And here are the results:

Code: Select all

A_AhkPath,A_AHKPath
A_AhkVersion,A_AHKVersion,a_AhkVersion
A_Index,a_index
A_LoopField,a_loopfield
A_LoopFileName,a_loopfilename
A_LoopRegName,a_LoopRegName
A_LoopRegType,a_LoopRegType
A_PriorHotkey,A_PriorHotKey
A_ScriptName,A_SCRIPTNAME
A_Space,A_SPACE,a_space
A_ThisHotkey,A_ThisHotKey
A_TickCount,A_TICKCOUNT
abc,ABC,abC,AbC
abcXYZ,abcxyz
activeHwnd,ActiveHwnd
ATan,atan
AutoHotkey,autohotkey,AutoHotKey
AutoSize,autosize
ByRef,byref
CapsLock,Capslock
Cdecl,CDecl
checkbox,Checkbox,CheckBox
ClassName,classname
classvar,ClassVar
ClipboardTimeout,ClipboardTimeOut
ComCtl,Comctl
ComSpec,comspec,COMSPEC,Comspec
context,Context,ConTEXT
CurrentDate,currentdate
datetime,DateTime
DBGp,DBGP,dbgp
doubleup,DoubleUp
Driveletter,DriveLetter
DriveSpaceFree,DrivespaceFree
EndKey,endkey
ExitApp,Exitapp
exstyle,ExStyle
extension,eXtension
filename,Filename,FileName
FileSize,filesize
FuncObj,funcobj
GitHub,github
HBRUSH,hBrush
hDC,HDC
hicon,hIcon,HICON
Hn,hN,hn
Hotkey,hotkey,HOTKEY,HotKey
IconN,Iconn
ifEqual,IfEqual
IfMsgBox,ifMsgBox
IfWinExist,ifWinExist
InputBox,Inputbox
InStr,inStr
ipaddress,IPAddress
JoyR,joyr
JoyU,joyu
JoyV,joyv
JoyX,joyx
JoyY,joyy
JoyZ,joyz
keyname,KeyName
ListBox,listbox,LISTBOX
ListView,Listview
loops,LOOPs,Loops
lParam,LPARAM,LParam
LWin,lwin,Lwin
MenuItems,menuitems
MsgBox,msgbox,Msgbox
MSPaint,Mspaint,MsPaint,mspaint
My,my,MY,mY
MyArray,myArray
MyDLL,MyDll
MyGui,myGui
MyObject,myObject
MyScript,myscript
nfile,nFILE
nline,nLine
ntext,nText
numeric,Numeric,NumEric,numEric
Numlock,NumLock
Numpad,numpad,NumPad
NumpadAdd,NumPadAdd
NumpadClear,NumPadClear
NumpadDel,NumPadDel
NumpadDiv,NumPadDiv
NumpadDot,NumPadDot
NumpadDown,NumPadDown
NumpadEnd,NumPadEnd
NumpadHome,NumPadHome
NumpadIns,NumPadIns
NumpadLeft,NumPadLeft
NumpadMult,NumPadMult
NumpadPgDn,NumPadPgDn
NumpadPgUp,NumPadPgUp
NumpadRight,NumPadRight
NumpadSub,NumPadSub
NumpadUp,NumPadUp
objaddref,ObjAddRef
OnOff,ONOFF
OutputVar,Outputvar
overwrite,Overwrite,OverWrite
ParentGui,ParentGUI
RAMDISK,RAMDisk
Readme,ReadMe,README
READONLY,ReadOnly
regex,RegEx
ResponseText,responseText
ReturnValue,returnValue
RWin,Rwin
SafeArray,SAFEARRAY
SB_SetIcon,SB_SETICON
SciTE,scite
ScrollLock,Scrolllock
SendInput,sendinput
SendPlay,sendPlay
SetCapslockState,SetCapsLockState
SetNumlockState,SetNumLockState
someorg,SomeOrg
stdin,StdIn
stdout,StdOut
SubKey,subkey,Subkey
submenu,Submenu,SubMenu
SysMenu,Sysmenu
TextPad,Textpad
ToolTip,tooltip,Tooltip
ToolTips,tooltips
TreeView,Treeview
UInt,uint,Uint
UintP,UIntP
unpause,Unpause,UnPause
URLs,URLS
UserName,Username,username
VarType,VARTYPE
vary,VarY
VKnn,vkNN
vMyCheckbox,vMyCheckBox
vtable,VTable
WebBrowser,webBrowser
WheelDown,wheeldown
WheelUp,wheelup
WinActivate,winactivate
WinLIRC,winlirc
winnt,WinNT
Wn,wN,wn
wordpad,WordPad,Wordpad
wParam,WPARAM
wstr,WStr
xb,xB
xC,XC
xe,xE
xf,xF
xFF,xff
xFFFFFFFF,xffffffff

Return to “Ask For Help”

Who is online

Users browsing this forum: Bing [Bot], Google [Bot], hanslhansl, wyw and 92 guests