Reads MS Word (docx) Fast

Post your working scripts, libraries and tools for AHK v1.1 and older
IMEime
Posts: 750
Joined: 20 Sep 2014, 06:15

Reads MS Word (docx) Fast

20 Mar 2017, 12:26

I wrote a brief code for fun.

It reads docx file's content.
(strings-simply data.)

First of all.
I made a testing.docx file.
It has 100,000 lines(sentences).
Each of it has 10 alpha-numeric random characters.
For instance, "d4j9415852".
And all of the lines are unique (no duplications).

Now.
I tried to extract its contents and made a txt file.

When I run the COM PIA codes.
It took me 3.4 secs.

And, this "Fast" codes took only 0.34 sec.

OMG, it is 10 times faster.
That is good, real good !!!

Though, it is just a level of pseudo code, it works.

Regards


Faster !!

Code: Select all

myWord := "Testing.docx" 	
startTime  :=  A_TickCount	
tempFolder := RegExReplace( myWord, ".*\K\\.*") "\_Word_UnZip\"		
tempName := RegExReplace( myWord, "\.docx") ".zip"
FileCopy, % myWord , % tempName
FileCreateDir, % tempFolder
tempObject := ComObjCreate("Shell.Application")
tempObject.Namespace(tempFolder).CopyHere( tempObject.Namespace(tempName).items, 4|16)	
FileDelete, % tempName
FileEncoding, UTF-8
FileRead, wordContents, % tempFolder "\" "word\document.xml"	
While @ := RegExMatch( wordContents, "<w:t>(.+?)</w:t>", _, @ ? StrLen(_) + @ : 1 )  	
	myContent .= _1 "`n"
FileRemoveDir, % tempFolder, 1
resultFile := RegExReplace( myWord, "\.docx") "_Extracted.txt"
FileAppend,% myContent, % resultFile
MsgBox % A_TickCount - startTime
Slower..

Code: Select all

startTime  :=  A_TickCount													
_ := ComObjCreate( "Word.Application" )
_.Documents.Open( "myWord.docx" )
_.ActiveDocument.SaveAs( FileName := "_Extracted.txt", FileFormat := 2 )   	
_.ActiveDocument.Saved := 1  															
_.Quit
MsgBox % A_TickCount - startTime

When you are going to use someone else's script, Please just leave a brief comment saying thank you.
타인의 스크립트를 이용할 때는 최소한의 감사 표시를 남기시기 바랍니다. 개싸가지 도적질은 그만 하시고..
IMEime
Posts: 750
Joined: 20 Sep 2014, 06:15

Re: Reads MS Word (docx) Fast

20 May 2017, 04:30

Added another RegExp.
It is by far from the perfect though my best and final.

Code: Select all

myPath := "Testing.docx" 
tmpFolder := RegExReplace( myPath, ".*\K\\.*") "\_Word_UnZip\"
tmpName := myPath ".zip"
FileCopy, % myPath , % tmpName
FileCreateDir, % tmpFolder
tmpObject := ComObjCreate("Shell.Application")
tmpObject.Namespace(tmpFolder).CopyHere(tmpObject.Namespace(tmpName).items,4|16)
FileDelete, % tmpName
FileEncoding, UTF-8
FileRead, wordContents, % tmpFolder "\" "word\document.xml"
FileRemoveDir, % tmpFolder, 1
paraPattern := "<w:p(?: [^>]*)?>(.*?)</w:p>"
textPattern := "<w:t(?: [^>]*)?>(.*?)</w:t>"
While singlePara := RegExMatch( wordContents, paraPattern, myPara, singlePara ? StrLen(myPara)+singlePara:1)
{
	While singleText := RegExMatch( myPara1, textPattern, myText, singleText ? StrLen(myText)+singleText:1)
		finalContents .= myText1
	finalContents .= "`n"
}
MsgBox % finalContents
When you are going to use someone else's script, Please just leave a brief comment saying thank you.
타인의 스크립트를 이용할 때는 최소한의 감사 표시를 남기시기 바랍니다. 개싸가지 도적질은 그만 하시고..

Return to “Scripts and Functions (v1)”

Who is online

Users browsing this forum: gwarble and 86 guests