Our greatest continuing need at MIses.org is volunteer help with turning PDF to HTML. As you know, the literature section is packed with great stuff, so much of which needs to be run in pieces in HTML. But we have a tiny staff, and this work takes time. You have to pull the PDF, expprt it to text and clean it up, and generate an html. This process can’t really be automated in any serious way. So in hopes of soliciting help, we’ve created a wiki with our wishes up there. Have a look if you are interested. And thank you so much!
Source link: http://archive.mises.org/9526/help-us-with-the-site/
Help us with the site
Previous post: A Study Guide for The Failure of the “New Economics”?
Next post: Another “Free Market” Intellectual Has Second Thoughts



{ 9 comments }
Why can’t this be automated? There are quite a few tools and classes out there that let you translate pdf to html. Surely there must be one that is compatible with an asp server somewhere, no?
Adobe has a conversion tool for this purpose, which can be found here:
http://www.adobe.com/products/acrobat/access_onlinetools.html
I am also curious as to why you stated that the pdf files cannot be converted via a parser. Does it have anything to do with the fact that some of the PDF’s are optical scans from books?
Denis – not really speaking for mises.org, but most likely the reason it can’t be fully automated is that they’re trying to digitize scanned documents, and the OCR is never perfect.
Something like pgdp.net is the best you can do, which is not exactly automatic but does distribute the load. Mises.org might benefit from checking out some of their pre-processing scripts.
Having done quite a bit of this for my own personal use, there are some consistent OCR errors that can be relatively easily caught and fixed… and something like aspell -list will produce a list of “bad” words that you can scan to fix “keywords” that are misspelled for better search results.
But if you want a perfect document you pretty much have to go through it line by line sooner or later.
I have used Read I.R.I.S. OCR software to convert PDF to html, with very good results :0)
I missed the explanation for the NEED for HTML conversion of the PDFs …
If it’s more than 5 pages in length, HTML-ization will do it no good, as if people do not read PDFs, they won’t read lengthy HTMLs (plus screen staring is bad for the eyes) …
Maybe out of concern for the application switching ? PDF readers are/will soon incorporate note taking & page bookmarks, making them more definitely usable than HTMLs
Huang Di: one advantage of conversion to HTML is easier conversion from there to epub (which is actually just restricted XHTML plus some additional bits); people who won’t read PDFs won’t read HTML (or epub) either, but people who will read PDFs might well prefer to read in reflowable HTML/epub format on a hand-held eInk device – I know I would…not to mention reducing the download size by a factor of 50 or more.
while there may not be a need for HTML specifically, I can definitely see the need for getting publications out of the fixed page format into a structured format that is easily parsable.
off the top of my head:
1. lower bandwidth
2. device and software independence
3. somewhat easier / more accurate indexing
4. ease of conversion to whatever your needs might be
If it were me, I would be using a lightweight markup language (reStructured text in python Docutils or SiSU are probably the best for this application) as my base format, rather than HTML. From there you can turn out HTML, palm doc, nicely formatted pdf with LaTex, ODF, or simply read the markup as is in plain text.
Bah. reStructuredText is OK for formatting docstrings without markup, but nowhere near suitable for real work. XML, please (hence XHTML; avoid HTML that doesn’t parse as valid XML, too).
I dunno, I use restructured text quite regularly for these types of documents (i.e. relatively straighforward text with sections and footnotes and a table or two), and it works very well for turning out xhtml and pdf quickly with no hassle. It is definitely easier for humans to read and write than XML. In fact in my technical writing I used to mark up in XML as i wrote, but it got so distracting i switched to something that felt more natural.
But I admit that XML might be “better” in that if marked up right it’s more explicit… it’s just uglier (and XSLT for conversion also stinks.) Not saying you’re wrong, just thinking of what is easy for volunteers to do.
Comments on this entry are closed.