Building Book Production Platforms p2.

Amongst the core requirements for a Book Production Platform are the source file format and the editor, and of course they are intimately linked. The development team is usually faced with choosing the format first, then the editor.

Choosing a Format

The choice is pretty much HTML? or not HTML?

Currently, HTML is the ruling choice of format for a web-based book production platform. HTML is native to the browser and has associated standards-compliant support, such as CSS and javascript. Inversely, not choosing HTML puts you in a bit of a hole and can create a lot of overhead.

It might be interesting to look back a little and learn from some others since there have already been projects in this space that started down non-HTML roads and then gave it up for HTML. Kathi Fletcher, originally the project manager and technical director for Connexions (now OpenStax) which built a custom XML editing environment for academic materials, later researched in-browser XML vs HTML editing environments for her Shuttleworth funded OERPUB project. Kathi became convinced HTML was the way to go and did some great work on HTML editor usability with the Aloha HTML editor.

We have chosen to use HTML5 as the canonical format for open textbooks, because developers and tools are more plentiful for web technologies than XML technologies.

http://www.w3.org/2012/12/global-publisher/statements-of-interest/29-oerpub.html

The (closed source) O'Reilly Atlas platform also started with the complex AsciiDoc format (a form of markdown) and eventually awoke to the power of HTML in 2012.

HTML5-based authoring offers a streamlined production workflow for producing both print and digital outputs, facilitates “digital first” content development, and is a perfect fit for creating a WYSIWYG, web-based writing experience.

http://radar.oreilly.com/2013/09/html5-is-the-future-of-book-authorship.html

They then got an extra dose of religion and started a project called HTML Book which is a suggested 'spec' for a subset of HTML elements to be used in books.

So far I have not seen a book production platform travel the reverse direction, from HTML to something else. Instead we are seeing more and more platforms start with, or change to, HTML as a source file format.

Markdown

Markdown is sometimes put forward as the way to go but I'm not going to go into that in too much detail here. I have talked about this elsewhere. The only additional thing I will say is that markdown causes even more issues for book production platforms than those included in that article. Namely, in an in-browser markdown environment, the markdown will most likely be displayed as rendered HTML next to the authoring pane. That is a huge amount of lost screen space and extra UI junk for no apparent gain. Think of the UX cost. If you don't have that rendered display then you will most likely only see pure markdown in a textfield with no rendered display. The user won't really know if their document looks right until it is rendered somewhere down the line, which is also a tremendous cost to the user for no apparent gain. Markdown: all pain, no gain.

NB: There is a possible good use case for markdown as a helpful add-on for HTML WYSI editors but I will cover that later.

LaTeX

There is a more valid use case for LaTeX in the browser since some scientists and academics will never use anything else, and you'll never convince them to adopt HTML regardless of the benefits. You are up against the great Church of Knuth and I don't fancy your chances. If your audience are LaTeX addicts, then I think you have no choice other than support that.

Many times I have talked about remedies for unstructured MS Word documents (for scientific manuscripts) only to have someone earnestly comment that if everyone just learned LaTeX we would be in a much better position... They might be right, but I'm pretty sure it's never going to happen.

The preference for LaTeX is a legacy issue, and problematic, but needs to be dealt with. (Unfortunately, today's markdown heroes are growing legacy issues like this with each passing day, and that is going to cost us down the road).

Recently there has been some interesting work on inbrowser LaTeX editing including the (closed source) Authorea platform and, most notably the (open source) ShareLatex platform. ShareLatex round trips the LaTeX syntax displayed and edited in a text area (in the browser), renders that to a bitmap on the server, and returns it to the browser for a side-by-side 'WYSIWYG'. The effect is that you can see a just-in-time rendered view of the LaTeX as you type. Its a neat trick and effective if you insist on LaTeX in a web based platform. Then you just have to live with the UI costs. However you only need this approach if you wish to support the full LaTeX syntax. If you wish to just support LaTeX equations, you can use an HTML editor with a LaTeX plug in based on MathJax or the Khan Academies KaTeX (and there are some other solutions such as Mathoid).

Incidentally, if you need to support full LaTeX I highly recommend checking out ShareLaTeX over WriteLaTeX. They both have the same approach but WriteLaTeX is proprietary whereas you can pick up the ShareLaTeX code and integrate it straight away. You could even build your own ShareLaTeX-like interface, it's not too tricky - together with a colleague - Rizwan Reza - and I (Riz did all the hard work) we managed to develop a workable prototype in about 2 days, but there are many gotchas setting up the LaTeX compiler correctly.

Not many book projects need LaTeX, so I will leave this as an interesting edge case. There are solutions if you need it, but not many people need it.

XML

I think I will just leave it to the words of the brilliant David Cramer (Hachette Book Group):

So we've chosen to describe our content with HTML, and build our production system around HTML.

When I tell people that, they smile condescendingly, and chuckle a bit. “That's cute. Why don’t you use real XML?”

I then ask them what you can do in Docbook (or TEI, or NLM) that you can't do in XHTML? I haven't heard a good answer to that question yet. XHTML is XML, by definition. Calling something "para" rather than "p" doesn't get you anything, except carpal tunnel syndrome and invoices from consultants

The problem with non-HTML XML is that it is essentially just XML the browser can't use. Hence you lose all that other good stuff like WYSI editors, CSS design tools, cool tricks with JavaScript, and all the cool tools that are being developed for HTML. XML just can't compete, plus you are going to need to convert the XML into HTML anyway. So don't make life more complicated than it already is - continue your love affair with XML as long as it's XHTML!

HTML

HTML is king in the browser and it gives you all you need to make books. I don't want to spend a lot of time arguing the merits of HTML in this post as there is a lot to say and I want to bring that in at other points of the conversation. But in brief:

  • HTML is supported by JS and CSS.
  • The DOM is known natively by the browser.
  • HTML is standards-based.
  • It is straightforward.
  • HTML is easy to read and easy to clean.
  • HTML is the most popular file format on the planet.
  • You can use HTML to build structure in documents with assigned class and id values, or microdata formats.
  • HTML is the native file format for EPUB.
  • PDF can be rendered directly from HTML in the browser (more on this later).
  • HTML can be paginated in the browser.
  • CSS is moving towards supporting more and more page based elements.
  • The browser can act as a design environment.
  • You can create real what-you-see-is (WYSI) production environments.
  • Basic editing is built into the format itself.
  • HTML is supported by an enormous number of tools for conversion (in and out).
  • HTML is supported by an enormous repository of examples (the web).
  • HTML is cheap to develop with.
  • Even book designers are getting used to it.
  • Some schools teach it.
  • There are a million free tutorials online about it.
  • A lot of people know HTML.
  • HTML is supported by a rapidly proliferating body of JavaScripts for typography, graph production, animation, interactions, dynamic rendering etc etc etc etc

The basic idea really comes down to this.

  • HTML is the cheapest format of our time.
  • HTML is the most popular format of our time.
  • HTML is the networked document format of our time.

Increasingly HTML is the way stories are told, whether that is in books or on the web. It's a trite analogy perhaps, but HTML is the paper of our time. As David Cramer says:

why start with something other than HTML, when you have to turn it into HTML anyway?

http://infogridpacific.typepad.com/files/idpf-2012-cramer-smaller.pdf

It should be noted that David Cramer also turns HTML into paper, and the Hachette Book Group have produced many beautiful paper books using HTML as the source format. Many of these books you will now find in the best-selling sections of your local brick and mortar bookstore.

Other print producers are also using HTML as source. Print-on-demand services, used to producing very ugly books by ingesting MS Word and dealing with all that ugly conversion, are also adopting HTML production environments. Books on Demand, Germany's largest Print on Demand service, adopted Booktype so their customers could have an easy in-browser book production environment. The source format is HTML but the users don't know that, and the books look better. That's the beauty of HTML.

Finally, helped a lot by the efforts of David Cramer and the Hachette Book Group, Sourcefabric, the people at O'Reilly, and others adopting HTML, we might be starting to see the very beginning of the changing of the guard.

HTML is the way to go for Book Production Platforms. If you choose another format you will find you inherit a lot of costs and additional overheads and, sadly, you will soon be left behind. There is just no format going forward at the same speed as HTML. Not even close. So, my advice is first ask the question - can HTML do what you need? Push your team to answer that question. Will format X give you anything HTML can't? As an exercise ask your team to prove HTML is a bad choice, and if the answer is not-HTML, then contact me and let me try and talk you into it!