Using html2pdfr

HTML and CSS do a good job at automatically laying out and styling content particularly in tables, however it is not natively designed for pagination. This library converts HTML content into PDF and PNG formats for embedding into LaTeX documents, within the constraints of page sizes. It allows use of HTML table layout from HTML first libraries such as gt and huxtable within latex documents, or presentations, and which appear the same as the HTML versions of those tables. HTML content can grow in width up to the page dimensions, but preventing it from overflowing, and without forcing table layout to be wider than it would normally be. This heurisitic calculation of the output size up to fit within set limits is one of the differentiators between this and other HTML to PDF converters.

html2pdfr PDF images can be included in LaTeX files using an includegraphics directive in exactly the same way as figures. Although focussed on tabular content, html2pdfr can convert other simple HTML, including SVG and MathML content, with variable success rates.

The R Package is a wrapper around the Java OpenHTML2PDF library (https://github.com/danfickle/openhtmltopdf), and requires a working installation of Java and rJava. All other dependencies are resolved automatically at runtime. It does not require a graphical display and would suit running on a headless server. The underlying Java library does not support javascript which would be required for D3 content or rendering shiny apps, and for which webshot or webshot2 would be a better option. The library relies on locally installed fonts, and the paths to local .ttf files must be supplied; this is managed by the systemfonts package. The library can resolve local and remote CSS and image files, specified relative to the HTML and locate them without the need for a web server.

Installation instructions

html2pdfr is based on a java library and must have a working version of Java and rJava installed. The following commands can ensure that your rJava installation is working.

install.packages("rJava")
rJava::.jinit()
rJava::J("java.lang.System")$getProperty("java.version")

Binary packages of html2pdfr are available on the r-universe for macOS and Windows. html2pdfr can be installed from source on Linux. html2pdfr has been tested on R versions 3.6, 4.0, 4.1 and 4.2.

options(repos = c(
  terminological = 'https://terminological.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
# Download and install html2pdfr in R
install.packages('html2pdfr')
# Browse the html2pdfr manual pages
help(package = 'html2pdfr')

Unstable versions are available but on windows build may fail if the multi-arch option is set. Windows users will also need RTools4.2:

devtools::install_github("terminological/html2pdfr", args = c("--no-multiarch"))

The Java libraries in html2pdfr are 29 Mb which are too large for CRAN.

Initialising the library

On first use the major Java library dependencies of this project must be downloaded and cached. This can take some time but only needs to be done once. The following basic initialization code sets up the library:

# this produces a verbose output which can be hidden with suppressMessages:
conv = html2pdfr::html_converter()

Once this is complete the conv object provides the useful functions of the package.

Generating a PDF

PDF rendering of HTML can be done direct from a URL, or from a locally stored HTML file. Pulling in a URL and converting it to PDF is done like so:

 html2pdfr::url_to_pdf(
   htmlUrl = "https://cran.r-project.org/banner.shtml",
   outFile = out("docs/articles/example-output.pdf")
)
## [1] "/tmp/RtmperVQR2/docs/articles/example-output.pdf"

The resulting pdf is here, Your success rendering HTML will vary as complex web pages (including in this example, frames) are not supported by the underlying engine. The focus of html2pdfr is on simpler static html content and not complex pages, for which alternatives already exist (see webshot2 for example).

Generating a pdf from HTML content

In the following, more usual, example the HTML is generated within R (as you might find from a tabular data library such as huxtable or gtables) and passed to the converter with some target page dimensions. The converter will lay out the table within the confines of the maximum space available, overflowing to new pages, where-ever required.


irisHtml = iris[c("Species","Sepal.Width")] %>% huxtable::as_hux() %>% huxtable::theme_article() %>% huxtable::to_html()
html2pdfr::html_fragment_to_pdf(
  htmlFragment = irisHtml, 
  maxWidthInches = 8, maxHeightInches = 8, 
  outFile = out("docs/articles/example-output-2.pdf")
)
## [1] "/tmp/RtmperVQR2/docs/articles/example-output-2.pdf"

And the resulting pdf of the generated HTML is here. This document should not have pages any more than 8 inches high. The width in this case is determined by the content, which is much less wide than the maximum specified 8 inches. If there was very wide content, the converter would wrap content within cells to stay within the specified bounding box size. This bounding box behaviour means that we can insert the generated pdf into a latex document simply without risk of overfull boxes.

The layout engine should support simple SVG and MathML content. However it does not execute javascript so is not be able to lay out D3 content. If this is something you need then using webshot2, which wraps a whole Chrome instance, may be a better option.

# Javascript Does not work:
# conv$urlToPdf(
#   htmlUrl = "https://bl.ocks.org/mbostock/raw/1389927/?raw=true",
#   outFile = here::here("docs/articles/example-d3.pdf")
# )

# MathML does work.
html2pdfr::url_to_pdf(
  htmlUrl = "https://fred-wang.github.io/MathFonts/mozilla_mathml_test/",
  outFile = out("docs/articles/example-mathml.pdf")
)
## [1] "/tmp/RtmperVQR2/docs/articles/example-mathml.pdf"

The resulting mathml

if (interactive() || identical(Sys.getenv("IN_PKGDOWN"), "true")) {
  message("as we are running in pkgdown and rendering the site we copy the output files to the correct location to be picked up by the pkgdown site")
  fs::dir_copy(out("docs/articles"), here::here("docs/articles"), overwrite = TRUE)
} 

Multipage output

One likely output of the package when passed a large amount of data in a table is a multipage pdf, where the pages can be designed small enough to fit into the overall flow of a latex document. This can be included into a latex document using the following approach which includes each page seperately. I this way an html table can be converted to a multipage pdf which can be embedded into a parent latex document, even possibly in landscape as here, but with consistent page furniture:

\begingroup
\begin{sidewaysfigure}
    \begin{center}
    %\fbox{
    \includegraphics[page=1, width=\linewidth]{multipageTable.pdf}%}
    \end{center}
\end{sidewaysfigure}

\begin{sidewaysfigure}
    \begin{center}
    %\fbox{
    \includegraphics[page=2, width=\linewidth]{multipageTable.pdf}%}
    \captionof{table}{Caption}
    \label{your_label}
    \end{center}
\end{sidewaysfigure}
\endgroup