Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate

Pdftk - the pdf toolkit [CMD]


  • Please log in to reply
32 replies to this topic
corrupt
  • Members
  • 2558 posts
  • Last active: Nov 01 2014 03:23 PM
  • Joined: 29 Dec 2004
Hi new,
Have a look at Loop, FilePattern in the AutoHotkey Help Documentation. You could use the Loop command to find all *.pdf files and use the Run or RunWait command to automate processing the files one by one.

PinkBears
  • Members
  • 1 posts
  • Last active: May 20 2008 01:29 PM
  • Joined: 20 May 2008
Hi there,

Further to new's problem and corrupts answer i was just wondering whether the answer had worked for new?

As i have the same problem and was just wondering whether it wos worth trying it out or not.
| - PinkBears

coma
  • Guests
  • Last active:
  • Joined: --
why don't you try it and see

automaticman
  • Members
  • 658 posts
  • Last active: Nov 20 2012 06:10 PM
  • Joined: 27 Oct 2006
I was looking for a tool to split bigger pdf documents into smaller chunks like e.g. 50 pages each and this might help. Great, thanks for the link.

(For the interested: Some applications don't accept bigger .pdf documents as input, so the solution: make smaller chunks and convert each separately overcoming the tools size limitations.)

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
pdf to html

in pdftk,
is there an option to convert pdf to html ?

the pdf to text I found in xpdf lib, and it work ok.

if not ,is there other command line tool to do that ?

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
Based on XPDF,
http://pdftohtml.sourceforge.net/

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
Hi HugoV

I tried to run the pdftohtml but got an error:
Page-1
'gswin32c' is not recognized as an internal or external command,
operable program or batch file.
Error: Failed to launch Ghostscript!

seems, the pdftohtml.exe is not enough, or some other missing software.

do you know on some other pdftohtml command line ?

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
Ghostscript
http://pages.cs.wisc.edu/~ghost/
(I've used it and it works, but you may have to work at it)

If I recall correctly a "pdftohtml" is also included in the google desktop search application (at least it was at some point, don't know if this is still the case as I don't use it) if you have it look for pdf*.exe in the google desktop dirs, it should be there somewhere.

Note: if you want to work with pdfs:
- get pdtfk
- get xpdf
- get pdttohtml
- get ghostscript
- get PDFCreator

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
PDF creator

the pdf creator convert html to PDF , as tagged or untagged PDF ?
or it converts to PDF as image ?

meas:
I can extract text from the created PDF ?

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
PDFCreator converts anything you print to PDF, yes you can extract text later IF the source wasn't an image to begin with. Not sure what you mean
by tagged but it won't make URLs in Word documents clickable in the PDF
nor does it create PDF bookmarks or anything like that.

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
sorry for asking again,

I still didn't found the pdf2html converter,

from google, is there some google doc's api
so I can download it and make own pdf's to htmls ?

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
If you have Google desktop installed:

c:\Program Files\Google\Google Desktop Search\pdftotext.exe
(or where ever you have installed GDS)

usage:

pdftotext -htmlmeta sample.pdf
--> will generate sample.html

follow the link on the URL I gave you before, leads to:
http://sourceforge.n...ects/pdftohtml/
download the windows binary, unpack the tar.gz file

usage:

pdftohtml.exe sample.pdf
--> will generate 3 html files (frameset, TOC and content)

read the doc for more options

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
I tried the (in xpdf lib):
pdftotext -htmlmeta sample.pdf -> sample.html

and got same result as :
pdftotext -layout sample.pdf -> sample.txt

means :
the html have no the same 'look' (more or less) as the original sample.pdf
no frames or some colors

rani
  • Members
  • 217 posts
  • Last active: Jul 21 2016 12:53 PM
  • Joined: 18 Mar 2008
I tried also the pdftohtml.exe

and got worser reuslts,
the text extracted is shown one under another, without keeping any original layout , of .pdf

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
What do you want, the HTML to look the same as the PDF?

Again read the documentation, see the options, try them and
SEE that you can have the HTML look like the PDF, or not
if you wish. Use the -c option

pdftohtml -c sample.pdf
-> sample.html will look like sample.pdf (not 100% but pretty close)
unless you have a very complicated PDF. Again READ the documentation
As you can see even google uses it so why isn't it good enough for you :wink:

IF you need even better or more options you will have to buy something

Sourceforge version:

pdftohtml version 0.39 http://pdftohtml.sourceforge.net/,
based on Xpdf version 3.00
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2004 Glyph & Cog, LLC

Usage: pdftohtml [options] [ ]
-f : first page to convert
-l : last page to convert
-q : don't print any messages or errors
-h : print usage information
-help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom : zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-hidden : output hidden text
-nomerge : do not merge paragraphs
-enc : output text encoding name
-dev : output device name for Ghostscript (png16m, jpeg etc)
-v : print copyright and version info
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)


GOOGLE version:

pdftohtml version 0.39 http://pdftohtml.sourceforge.net/,
based on Xpdf version 3.00
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2004 Glyph & Cog, LLC

Usage: pdftohtml [options] [ ]
-f : first page to convert
-l : last page to convert
-q : don't print any messages or errors
-h : print usage information
-help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom : zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-hidden : output hidden text
-nomerge : do not merge paragraphs
-enc : output text encoding name
-dev : output device name for Ghostscript (png16m, jpeg etc)
-v : print copyright and version info
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)