--- title: "Using html2pdfr" author: "Rob Challen" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using html2pdfr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- HTML and CSS do a good job at automatically laying out and styling content particularly in tables, however it is not natively designed for pagination. This library converts HTML content into PDF and PNG formats for embedding into LaTeX documents, within the constraints of page sizes. It allows use of HTML table layout from HTML first libraries such as `gt` and `huxtable` within latex documents, or presentations, and which appear the same as the HTML versions of those tables. HTML content can grow in width up to the page dimensions, but preventing it from overflowing, and without forcing table layout to be wider than it would normally be. This heurisitic calculation of the output size up to fit within set limits is one of the differentiators between this and other HTML to PDF converters. `html2pdfr` PDF images can be included in LaTeX files using an `includegraphics` directive in exactly the same way as figures. Although focussed on tabular content, `html2pdfr` can convert other simple HTML, including SVG and MathML content, with variable success rates. The R Package is a wrapper around the Java OpenHTML2PDF library (https://github.com/danfickle/openhtmltopdf), and requires a working installation of Java and `rJava`. All other dependencies are resolved automatically at runtime. It does not require a graphical display and would suit running on a headless server. The underlying Java library does not support javascript which would be required for D3 content or rendering shiny apps, and for which `webshot` or `webshot2` would be a better option. The library relies on locally installed fonts, and the paths to local `.ttf` files must be supplied; this is managed by the `systemfonts` package. The library can resolve local and remote CSS and image files, specified relative to the HTML and locate them without the need for a web server. ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, collapse = TRUE ) library(html2pdfr) here::i_am("vignettes/html2pdfr.Rmd") out = function(path) { return(fs::path(tempdir(),path)) } ``` ## Installation instructions `html2pdfr` is based on a java library and must have a working version of `Java` and `rJava` installed. The following commands can ensure that your `rJava` installation is working. ```R install.packages("rJava") rJava::.jinit() rJava::J("java.lang.System")$getProperty("java.version") ``` Binary packages of `html2pdfr` are available on the r-universe for `macOS` and `Windows`. `html2pdfr` can be installed from source on Linux. `html2pdfr` has been tested on R versions 3.6, 4.0, 4.1 and 4.2. ```R options(repos = c( terminological = 'https://terminological.r-universe.dev', CRAN = 'https://cloud.r-project.org')) # Download and install html2pdfr in R install.packages('html2pdfr') # Browse the html2pdfr manual pages help(package = 'html2pdfr') ``` Unstable versions are available but on windows build may fail if the multi-arch option is set. Windows users will also need `RTools4.2`: ```R devtools::install_github("terminological/html2pdfr", args = c("--no-multiarch")) ``` The Java libraries in `html2pdfr` are 29 Mb which are too large for CRAN. ## Initialising the library On first use the major Java library dependencies of this project must be downloaded and cached. This can take some time but only needs to be done once. The following basic initialization code sets up the library: ```{r} # this produces a verbose output which can be hidden with suppressMessages: conv = html2pdfr::html_converter() ``` Once this is complete the `conv` object provides the useful functions of the package. ## Generating a PDF PDF rendering of HTML can be done direct from a URL, or from a locally stored HTML file. Pulling in a URL and converting it to PDF is done like so: ```{r} html2pdfr::url_to_pdf( htmlUrl = "https://cran.r-project.org/banner.shtml", outFile = out("docs/articles/example-output.pdf") ) ``` [The resulting pdf](example-output.pdf) is here, Your success rendering HTML will vary as complex web pages (including in this example, frames) are not supported by the underlying engine. The focus of `html2pdfr` is on simpler static html content and not complex pages, for which alternatives already exist (see `webshot2` for example). ## Generating a pdf from HTML content In the following, more usual, example the HTML is generated within R (as you might find from a tabular data library such as `huxtable` or `gtables`) and passed to the converter with some target page dimensions. The converter will lay out the table within the confines of the maximum space available, overflowing to new pages, where-ever required. ```{r} irisHtml = iris[c("Species","Sepal.Width")] %>% huxtable::as_hux() %>% huxtable::theme_article() %>% huxtable::to_html() html2pdfr::html_fragment_to_pdf( htmlFragment = irisHtml, maxWidthInches = 8, maxHeightInches = 8, outFile = out("docs/articles/example-output-2.pdf") ) ``` And [the resulting pdf](example-output-2.pdf) of the generated HTML is here. This document should not have pages any more than 8 inches high. The width in this case is determined by the content, which is much less wide than the maximum specified 8 inches. If there was very wide content, the converter would wrap content within cells to stay within the specified bounding box size. This bounding box behaviour means that we can insert the generated pdf into a latex document simply without risk of overfull boxes. The layout engine should support simple SVG and MathML content. However it does not execute javascript so is not be able to lay out D3 content. If this is something you need then using `webshot2`, which wraps a whole Chrome instance, may be a better option. ```{r} # Javascript Does not work: # conv$urlToPdf( # htmlUrl = "https://bl.ocks.org/mbostock/raw/1389927/?raw=true", # outFile = here::here("docs/articles/example-d3.pdf") # ) # MathML does work. html2pdfr::url_to_pdf( htmlUrl = "https://fred-wang.github.io/MathFonts/mozilla_mathml_test/", outFile = out("docs/articles/example-mathml.pdf") ) ``` [The resulting mathml](example-mathml.pdf) ```{r} if (interactive() || identical(Sys.getenv("IN_PKGDOWN"), "true")) { message("as we are running in pkgdown and rendering the site we copy the output files to the correct location to be picked up by the pkgdown site") fs::dir_copy(out("docs/articles"), here::here("docs/articles"), overwrite = TRUE) } ``` # Multipage output One likely output of the package when passed a large amount of data in a table is a multipage pdf, where the pages can be designed small enough to fit into the overall flow of a latex document. This can be included into a latex document using the following approach which includes each page seperately. I this way an html table can be converted to a multipage pdf which can be embedded into a parent latex document, even possibly in landscape as here, but with consistent page furniture: ```LATEX \begingroup \begin{sidewaysfigure} \begin{center} %\fbox{ \includegraphics[page=1, width=\linewidth]{multipageTable.pdf}%} \end{center} \end{sidewaysfigure} \begin{sidewaysfigure} \begin{center} %\fbox{ \includegraphics[page=2, width=\linewidth]{multipageTable.pdf}%} \captionof{table}{Caption} \label{your_label} \end{center} \end{sidewaysfigure} \endgroup ```