1. Skip to navigation
  2. Skip to content
  3. Skip to sidebar
Source link: http://archive.mises.org/9526/help-us-with-the-site/

Help us with the site

March 2, 2009 by

Our greatest continuing need at MIses.org is volunteer help with turning PDF to HTML. As you know, the literature section is packed with great stuff, so much of which needs to be run in pieces in HTML. But we have a tiny staff, and this work takes time. You have to pull the PDF, expprt it to text and clean it up, and generate an html. This process can’t really be automated in any serious way. So in hopes of soliciting help, we’ve created a wiki with our wishes up there. Have a look if you are interested. And thank you so much!

{ 9 comments }

Denis March 2, 2009 at 9:32 am

Why can’t this be automated? There are quite a few tools and classes out there that let you translate pdf to html. Surely there must be one that is compatible with an asp server somewhere, no?

JD March 2, 2009 at 10:24 am

Adobe has a conversion tool for this purpose, which can be found here:

http://www.adobe.com/products/acrobat/access_onlinetools.html

I am also curious as to why you stated that the pdf files cannot be converted via a parser. Does it have anything to do with the fact that some of the PDF’s are optical scans from books?

hz March 2, 2009 at 10:24 am

Denis – not really speaking for mises.org, but most likely the reason it can’t be fully automated is that they’re trying to digitize scanned documents, and the OCR is never perfect.

Something like pgdp.net is the best you can do, which is not exactly automatic but does distribute the load. Mises.org might benefit from checking out some of their pre-processing scripts.

Having done quite a bit of this for my own personal use, there are some consistent OCR errors that can be relatively easily caught and fixed… and something like aspell -list will produce a list of “bad” words that you can scan to fix “keywords” that are misspelled for better search results.

But if you want a perfect document you pretty much have to go through it line by line sooner or later.

Diakrisis Logismōn March 2, 2009 at 12:09 pm

I have used Read I.R.I.S. OCR software to convert PDF to html, with very good results :0)

Huang Di March 2, 2009 at 6:26 pm

I missed the explanation for the NEED for HTML conversion of the PDFs …

If it’s more than 5 pages in length, HTML-ization will do it no good, as if people do not read PDFs, they won’t read lengthy HTMLs (plus screen staring is bad for the eyes) …

Maybe out of concern for the application switching ? PDF readers are/will soon incorporate note taking & page bookmarks, making them more definitely usable than HTMLs

Peter March 2, 2009 at 8:37 pm

Huang Di: one advantage of conversion to HTML is easier conversion from there to epub (which is actually just restricted XHTML plus some additional bits); people who won’t read PDFs won’t read HTML (or epub) either, but people who will read PDFs might well prefer to read in reflowable HTML/epub format on a hand-held eInk device – I know I would…not to mention reducing the download size by a factor of 50 or more.

hz March 2, 2009 at 9:25 pm

while there may not be a need for HTML specifically, I can definitely see the need for getting publications out of the fixed page format into a structured format that is easily parsable.

off the top of my head:
1. lower bandwidth
2. device and software independence
3. somewhat easier / more accurate indexing
4. ease of conversion to whatever your needs might be

If it were me, I would be using a lightweight markup language (reStructured text in python Docutils or SiSU are probably the best for this application) as my base format, rather than HTML. From there you can turn out HTML, palm doc, nicely formatted pdf with LaTex, ODF, or simply read the markup as is in plain text.

Peter March 3, 2009 at 6:43 am

Bah. reStructuredText is OK for formatting docstrings without markup, but nowhere near suitable for real work. XML, please (hence XHTML; avoid HTML that doesn’t parse as valid XML, too).

hz March 3, 2009 at 8:29 am

I dunno, I use restructured text quite regularly for these types of documents (i.e. relatively straighforward text with sections and footnotes and a table or two), and it works very well for turning out xhtml and pdf quickly with no hassle. It is definitely easier for humans to read and write than XML. In fact in my technical writing I used to mark up in XML as i wrote, but it got so distracting i switched to something that felt more natural.

But I admit that XML might be “better” in that if marked up right it’s more explicit… it’s just uglier (and XSLT for conversion also stinks.) Not saying you’re wrong, just thinking of what is easy for volunteers to do.

Comments on this entry are closed.

Previous post:

Next post: