plain text, papers, pandoc

2056

收藏 2016-07-26

Over the past few months, I’ve had several people ask me about the tools I use to put papers together. I maintain a page of resourcessomewhat grandiosely headed “Writing and Presenting Social Science”. Really it just makes public some configuration files and templates for my text editor and related tools. Things have changed a little recently—which led to people asking the questions—so I will try to lay out the current setup here. I will also try to avoid veering off into generalized noodling about the nature of writing or creativity. (That’s fine for Merlin.) This is mostly because although I am not a bad writer, I am an excellent procrastinator, and it is embarrassing to write about how to write papers when you could be actually writing papers. My excuse today is that I have a headcold.

So, first I will say a little bit about the general problem, and then I will tell you something specific: how to take the draft of a scholarly paper, typically including bibliographical references, figures, and the results of some data analysis, and turn it into nice-looking PDF and HTML output. The hopefully redeeming thing about this discussion is that it will help you use the various resources I make available for doing this. If you want to copy what I do, you should be able to.

what’s the problem?

The problem is that doing scholarly work is intrinsically a mess. There’s the annoying business of getting ideas and writing them down, of course, but also everything before, during, and around it: data analysis and all that comes with it, and the tedious but unavoidable machinery of scholarly papers—especially citations and references. There is a lot of keep track of, a lot to get right, and a lot to draw together at the time of writing. Academic papers are by no means the only form of writing subject to constraints of this sort. Consider this extremely sensible discussion by Dr Drang, a consulting engineer and blogger you should be reading:

I don’t write fiction, but I can imagine that a lot of fiction writing can be done without any reference materials whatsoever. Similarly, a lot of editorials and opinion pieces are remarkably fact-free; these also can spring directly from the writer’s head. But the type of writing I typically do—mostly for work, but also here—is loaded with facts. I am constantly referring to photographs, drawings, experimental test results, calculations, reports written by others, textbooks, journal articles, and so on. These are not distractions; they are essential to the writing process.
And it’s not just reference material. Quite often I need to make my own graphs and drawings to include in a report. Because the text and the graphics are all part of a coherent whole, I need to go back and forth between the two; the words inform the pictures and the pictures inform the words. This is not the Platonic ideal of a clean writing environment—a cup of coffee on an empty desk in a white room—that you see in videos for distraction-free editors.
Some of the popularity of these editors is part of the backlash against multitasking, but people are confusing themselves with their computers. When I’m writing a report, that is my single task, and I bring to bear whatever tools are necessary to complete it. That my computer is multitasking by running many programs simultaneously isn’t a source of confusion or distraction, it’s the natural and efficient way for me to get my one task done.

A lot of academic writing is just like this. It is difficult to manage. It’s even worse when you have collaborators and other contributors. So, what to do?

the office model and the engineering model

Let me make a crude distinction. There are “Office Type” solutions to this problem, and there are “Engineering Type” solutions. Please don’t get hung up on the distinction or the labels. Office solutions tend towards a cluster of tools where something like Microsoft Word is at the center of your work. A Word file or set of files is the most “real” thing in your project. Changes to your work are tracked inside that file or files. Citation and reference managers plug into them. The outputs of data analyses—tables, figures—get dropped into them or kept alongside them. The master document may be passed around from person to person or edited and updated in turn. The final output is exported from it, perhaps to PDF or to HTML, but maybe most often the final output just is the .docx file, cleaned up and with the track changes feature turned off.

In the Engineering model, meanwhile, plain text files are at the center of your work. The most “real” thing in your project will either be those files or, more likely, the Git, Mercurial, or SVN repository that controls the project. Changes are tracked outside the files. Data analysis is managed in code that produces outputs in (ideally) a known and reproducible manner. Citation and reference management will likely also be done in plain text, as with a BibTeX .bib file. Final outputs are assembled from the plain text and turned to .tex, .html, or .pdf using some kind of typesetting or conversion tool. Very often, because of some unavoidable facts about the world, the final output of this kind of solution is also a .docx file.

This distinction is meant to capture a tendency in organization, not a rigid divide—still less a sort of personality. Obviously it is possible organize things on the Office model. (Indeed, it’s the dominant way.) Applications like Scrivener, meanwhile, combine elements of the two approaches. Scrivener embraces the “bittyness” of large writing projects in an effective way, and can spit out clean copy in a variety of formats. Scrivener is built for people writing lengthy fiction (or qualitative non-fiction) rather than anything with quantitative data analysis, so I have never used it extensively. Microsoft Word, meanwhile, still rules large swathes of the Humanities and the Social Sciences, and the production process of many publishers. So even if you prefer plain text for other reasons—especially in connection with project management and data analysis—the routine need or obligation to provide a Word document to someone is one of the main reasons to want to be able to easily convert things. HTML is a great lingua franca.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

oliyiyi

2016-7-26 16:34:55

The examples directory also includes a sample .Rmd file. The code chunks in the file provide examples of how to generate tables and figures in the document. In particular they show some useful options that can be passed to knitr. Consult the knitr project pagefor extensive documentation and many more examples. To produce output from the article-knitr.Rmd file, launch R in the working directory, load knitr, and process the file. You will also need the ascii, memisc, and ggplot2 libraries to be available.

library(knitr)knit("article-knitr.Rmd")

If things are working properly, then a markdown file called article-knitr.md will be produced, together with some graphics in the figures/ subfolder and some working files in the cache/folder. We set things up in the .Rmd file so that knitr produces both PNG and PDF versions of whatever figures are generated by R. That prepares the way for easy conversion to HTML and LaTeX. Once the article-knitr.md file is produced, HTML, .tex, and PDF versions of it can be produced as before, by typing make at the command line. You can also run the pandoc commands manually, of course, or run pandoc from inside R via knitr’s pandoc() helper function, or set your editor up to run make for you as needed, if it can do that.

Sample PDF output from an Rmd file, with figures and tables generated automatically. Click the image for a PDF version

using marked

In everyday use, I find Brett Terpstra’s Marked.app to be a very useful way of previewing text while writing. Marked shows you your markdown files as HTML, updating on the fly whenever the file is saved. It supports pandoc as a custom processor. Essentially, you tell it to run a pandoc command like the one above to generate its previews, instead of its built-in markdown processor. You do this in the “Behavior” tab of Marked’s preferences.

Marked's Preference dialog

The “Path” field contains the full path to pandoc, and the “Args” field contains all the relevant command switches—in my case, as above,

-r markdown+simple_tables+table_captions+yaml_metadata_block -w html -S --template=/Users/kjhealy/.pandoc/templates/html.template --filter pandoc-citeproc --bibliography=/Users/kjhealy/Documents/bibs/socbib-pandoc.bib

When editing your markdown file in your favorite text editor, you point Marked at the file and get a live preview. Like this:

Previewing in HTML with Marked

You can add the CSS files in the pandoc-templates repo to the list of Custom CSS files Marked knows about, via the “Style” tab in the Preferences window. That way, Marked’s preview will look the same as the HTML file that’s produced.

The upshot of all of this is powerful editing using Emacs, ESS, R, and other tools; flexible conversion using pandoc; quick and easy previewing via HTML and Marked; and high-quality PDF typesetting at the same time (or whenever needed)—all generated directly from plain text and including almost all of what most of the scholarly papers I write need to include. While this may seem quite complex when laid out in this way, from my point of view the result is very straightforward. I just live in my text editor, the various scripts and settings do their work quietly, as they should, and I get the formatted output I want.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群