Reading scientific articles as ePubs

While reading PDFs on mobile screens is improving, with `swipe and pinch,’ it’ll probably always be easier to read text of the right font size that flows to fit the margins, i.e., mobile-optimized HTML or ePub. The comprehensive Calibre suite can convert both the PDF and website HTML versions of scientific articles into ePub, but the results are seldom pretty: two-column PDF layouts are often mangled, as are sidebars and javascript in website HTML pages. Few publishers actually offer ePub versions, despite the clear benefit to the reader, and the relative simplicity (for the publisher) of making an ePub from the HTML version. However, several publishers do provide XML versions of articles, generally using the NLM Journal Publishing DTD (PLoS), or a derivative (e.g., Taxpub, as used by Pensoft). Here is a guide to converting these XML files to ePub.

  1. Get the XML version of the article (an example paper from PLoS Biology: page and XML), and save in a directory.
  2. Save the image files in the same directory. For PLoS, these need to be saved one by one. In PhytoKeys, just view the ‘Large HTML’ version, ‘Save complete webpage’ in your browser, and copy the image files to the XML directory.
  3. Get the XSLT tools from NLM; PLoS uses version 2 of the DTD, so get the ViewNLM-v2.3.zip file.
  4. Run an XSLT engine (Saxon or Xalan) to create an HTML file. Note that you will need first to comment out (or delete) the DOCTYPE tag in PhytoKeys XML files, since the tax-treatment-NS0.dtd DTD has no URL. In fact, you can delete the DOCTYPE from any XML to speed up the conversion (xalan downloads the DTD from the web, by default).
  5. In the case of PLoS, the images appear in the XML as doi URLs which sadly are not easily fetched, so a quick search for <img src= in an editor is needed, with an edit to match the name of the image files already downloaded. An automated fix of the XML from doi refs to html refs would be fairly easy, but I haven’t looked into this yet.
  6. Using the NLM CSS stylesheet (by moving it into the same directory) marginally prettifies the resultant ePub, but is not necessary.
  7. Conver the HTML to ePub with Calibre.

Apart from the manual editing possibly needed in Step 5, this all only takes a few seconds. Steps 4 onwards:

  $ java -jar ~/lib/java/xalan.jar -IN journal.pbio.1001220.xml \
         -XSL ../ViewNLM-v2.3/ViewNLM-v2.3.xsl > journal.pbio.1001220.html
  $ emacs journal.pbio.1001220.html
  $ cp ../ViewNLM-v2.3/ViewNLM.css .
  $ ebook-convert journal.pbio.1001220.html journal.pbio.1001220.epub

Finally, enjoy the article in your favourite ebook reader. FBReader is excellent and they even have a Meego version for the wonderful N9. Hopefully more publishers will come to post an NLM XML version, and, even better, an ePub version.