Based on Ubuntu 10.04 (a variety of Debian Linux).

Document workflow

Option 1: commercial: Word 97, MathType, EndNote 5.0

Why modernize? This works for anything up to books. But inline equations are typographically suboptimal, and reference handling is more labour-intensive than it could be.
Import references by PubMed ID with a customized connection file that has a PMID search field. Use File → Connection Files → Open Connection Manager → (find the "PubMed (NLM)" file) → Edit → Search Attributes → insert a new field with "Search Field Name"="PMID" and "Use"="PMID" →.

Option 2: OpenOffice/LibreOffice, Zotero standalone

LibreOffice (the major fork of OpenOffice) isn't bad. Zotero itself is excellent. However, the problems are (a) OpenOffice Math remains something of a pain for equations, (b) citation entry/editing isn't as easy as it could be (e.g. this formatting bug and the fact it's hard to add text at the end of a line after inserting a citation), and (c) ridiculously, you can't copy and paste citations. So:

Option 3: LyX, LyZ, Zotero standalone

Zotero is superb, particularly the standalone version. LyX is an editor that does a good job of hiding its underlying platform, LaTeX, and has superb maths capabilities (e.g. inline equations appear perfectly, and it's very quick to write in). LyZ is a plugin for Zotero that integrates it very well with LyX.

So, installation:

sudo apt-get install lyx
Download and install Zotero Standalone
Install the Zotero plugin for your favourite web browser.
Download the LyZ plugin, e.g. lyz-2.1.4.xpi
In Zotero: Tools → Add-ons → Click the top-right icon → Install add-on from file → choose the LyZ .xpi file → restart Zotero
A LyZ icon/dropdown menu appears in Zotero. Choose LyZ → Settings. Then set "Lyx Server" = "~/.lyxpipe"; "Create cite key"="yes"; and choose your citation key style. I like "author _ year _ title"; the spaces get removed, so you end up with citation keys like "baars_1988_acognitive".
Configure LyX to talk to LyZ. In LyX: Tools → Preferences → Paths → "LyXServer pipe"="~/.lyxpipe" → restart Lyx

In use:

For each document:
- Encoding to UTF8: Document → Settings → Language
- Choose the document class, e.g. "article": Document → Settings → Document Class. This chooses the "environments" (akin to "styles" in other wordprocessors) that will be available, and the default layout.
- Get rid of the ghastly fonts (Computer Modern/Latin Modern) that announce you're using LaTeX. Document → Settings → Fonts. For example, pleasant Roman fonts include Times Roman, Palatino, Bitstream Charter, Utopia. For sans serif, there's Helvetica. For typewriter, there's Courier, Computer Modern Typewriter, Latin Modern Typewriter. The only downside of avoiding the nasty-looking default fonts is that inline symbols such as left/right arrows fail; you have to insert them within maths objects (but that's no slower and the output is identical).
- Maybe turn the date at the top off: Document → Settings → Document Class → Suppress default date.
- Optionally, alter the styles. The easy way is to create a file called e.g. my_style_modifications.inc and then include that by typing \include{my_style_modifications.inc} into Document → Settings → LaTeX preamble. Here's an example of such a file to modify the section heading styles:
```
\usepackage{titlesec}

\titleformat{\chapter}
    [display] % puts the title chapter on a separate line
    {\normalfont\sffamily\huge\bfseries}
    {\chaptertitlename\ \thechapter}
    {20pt}
    {\Huge}

\titleformat{\section}
    {\normalfont\sffamily\large\bfseries}
    {\thesection}
    {1em}
    {}
    [{\titlerule}] % with a rule under it

\titleformat{\subsection}
    {\normalfont\sffamily\normalsize\bfseries} % sans serif, bold
    {\thesubsection}
    {1em}
    {}

\titleformat{\subsubsection}
    {\normalfont\sffamily\normalsize\mdseries\itshape} % sans serif, italic
    {\thesubsubsection}
    {1em}
    {}
```
In use:
- Fetch citations by e.g. looking at a PubMed or a Library of Congress entry, and just click on the "Send to Zotero" button in the browser's address bar.
- Insert citations from Zotero into LyX at the current point by right-clicking the citation and choosing "Cite in LyX". The first time you do this, it gives you the option of creating a BiBTeX database for that document, or using an existing one. The rest is automated; LyZ will create a BiBTeX citation key for the record you want, according to the naming rule you specified, and insert it into the document's BiBTeX database (which is just a textfile); your Zotero collection stays as the master database.
- You can insert a multiple-article citation in one go, but you can't automatically merge two adjacent citations that were inserted separately; you have to add the second to the first by editing it in Lyx. Double-click citations to edit them (also e.g. to add suffixes like page number citations).
- Once, in the LyX document, go to where you want your bibliography to end up, then: Insert → List/TOC → BiBTeX bibliography → choose the BiBTeX file.
- If you're manually inserting citations in LyX, and the "available citations" list is underpopulated, right-click the bibliography → Settings → Rescan.
- Choose your citation/bibliography style by right-clicking the bibliography. Journals typically supply .bst files to save you effort (see e.g. PLoS Computational Biology for a relatively decent numeric citation style).
- To view the output as a PDF, click the icon or Ctrl-R (and then there's another icon or Ctrl-Shift-R to update a PDF that you're already viewing).

Illustration: Illustrator

Bugs persist (Mar 2012) in Inkscape that mean it's not yet up to Illustrator. Specifically, cutting lines with a shape doesn't work.

Fetch a list of URLs to PDF

For example, to batch-fetch a list of sources in a convenient (relatively immutable, PDF) form, given a textfile containing a list of URLs: see below or fetch_multiple_urls_to_pdf.py.

#!/usr/bin/python

# Requires Debian packages: wget wkhtmltopdf

import sys;
from subprocess import call;

if len(sys.argv) != 2:	# the program name and one other
	sys.exit("Syntax: fetch_multiple_urls_to_pdf.py urllistfile")

f = open(sys.argv[1], "r")
for url in f:
	url = url.strip()
	destfile = url.replace("http://", "")
	if destfile[-1] == "/":
		destfile = destfile[:-1]
	destfile = destfile.replace("/", "_")
	if destfile[-4:] == ".pdf":
		command = "wget " + url + " -O " + destfile
	else:
		destfile += ".pdf"
		command = "wkhtmltopdf " + url + " " + destfile
	print "Executing " + command
	ret = call(command, shell=True)

To generate the URL list from a textfile...

#!/usr/bin/perl
use strict;

my $filename = shift;
if ($filename eq "") {
	die "Syntax: list_urls.pl <filename>\n";
}

open(INFILE, "$filename")
	or die "Couldn't open $filename for reading.\n";
my $url;
while (<INFILE>) {
	if (/(http\S*)/) {
		print "$1\n";
	}
};

PDF manipulation

# The main package: pdftk (does all sorts of things)

sudo apt-get install pdftk
man pdftk
# example:
pdftk in.pdf cat 1-12 14-end output out1.pdf

# Extracting text:
pdftotext in.pdf out.txt

# Processing a PDF file through GhostScript (useful to find PDF corruption, amongst other things)
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pdf
# ... use -sDEVICE=x11 to view, or -sDEVICE=nullpage just to pass through input, as less restrictive error checks

# When GhostScript can't handle a pdf:
pdftops myfile.pdf
# ... generates myfile.ps
# Now list printers:
lpstat -p -d
# Now print the PostScript:
lpr -P<printer> myfile.ps

Systematic reviews

Perl script for a systematic review of PubMed (see systematic_review_pubmed.pl) — customize as required.
Perl script for a systematic review of OMIM, the Online Mendelian Inheritance in Man database (see systematic_review_omim.pl) — customize as required.
Google trawl. Google doesn't allow programs to access its search API directly. So...
1. Set up Google query, e.g. psychosis OR psychotic OR delusion OR delusions OR hallucination OR hallucinations OR catatonic OR catatonia OR "thought disorder" "drug information" site:www.rxlist.com more:for_health_professionals
2. Ensure 100 hits/page by adding &num=100&filter=0 to the query URL (100 results/page, which is the maximum; turns off the near-duplicates filter).
3. NOTE that Google won't give >1000 results for any query. So make sure your query doesn't give that many hits! In this case, restricting to "www.rxlist.com" (rather than "rxlist.com") eliminates the consumer information leaflets.
4. Go through the Google results pages (in a web browser, by hand), saving them.
5. Run a Perl script that scans all that output and makes something more useful from it. See summarize_google_results.pl.

Quick optical character recognition (OCR)

Suppose your scanner produces page1.png. Then we'll use convert (part of ImageMagick) and tesseract to do some quick OCR.

convert page1.png page1.tif
tesseract page1.tif page1

The final result is page1.txt.

Quick Python/shell script hacks for indexing purposes

Renaming a bunch of files in a filename-sensitive manner

Probably achievable in a bash script, but this is quicker to write. Having split a PDF (with pdftk input.pdf burst), now renumber the pages so that pages 1–26 are named "...roman1"–"...roman26", but pages 27– are renumbered 1–. See below or rename_proof_pages.py.

#!/usr/bin/python2.7

import sys, getpass, shlex, subprocess, re

def get_external_command_output(command):
	args = shlex.split(command)
	return subprocess.check_output(args) # this needs Python 2.7 or higher

filelist = get_external_command_output("bash -c \"ls *.pdf\"").split() # need bash to do the wildcard expansion, then split() to make an array of a multiline string
numberregex = "\d+"
for infile in filelist:
	result = re.search(numberregex, infile)
	if result != None:
		ipagenum = int(result.group(0))
		# Now, the first 26 pages are roman-numeralled.
		if ipagenum>26:
			opagestr = str(ipagenum - 26)
		else:
			opagestr = "roman_" + str(ipagenum)
		outfile = "page_" + opagestr + ".pdf"
		get_external_command_output("cp " + infile + " " + outfile)

Extracting plaintext from PDFs

Use the following shell script (extract_pdf_text) as e.g. extract_pdf_text *.pdf. All it does is call pdftotext.

#!/bin/bash

# extract_pdf_text
# does very little!
# usage: extract_pdf_text *.pdf

# make sure you always put $f in double quotes to avoid any nasty surprises i.e. "$f"
for f in $@
do
  echo "Processing $f file..."
  pdftotext "$f" "$f.txt"
done

A sort of "multigrep"

Take a list of words (in a wordlist file); grep each word in turn against a bunch of files specified by the filespec; format the output a little. See below or multigrep.py.

#!/usr/bin/python2.7

import sys, getpass, shlex, subprocess, re

def raw_default(prompt, dflt=None):
	prompt = "%s [%s]: " % (prompt, dflt)
	res = raw_input(prompt)
	if not res and dflt:
		return dflt
	return res

def get_external_command_output(command):
	args = shlex.split(command)
	return subprocess.check_output(args) # this needs Python 2.7 or higher

wordlistfile = raw_default("Wordlist file", "../../wordlist.txt")
resultsfile = raw_default("Results file", "../../abbreviation_search_results.txt")
infilespec = raw_default("Input filespec", "*.txt")

infile = open(wordlistfile, "r")
wordlist = infile.read().split()
for word in wordlist:
	print "------------------------------------"
	print word
	print "------------------------------------"
	print get_external_command_output("bash -c \"grep " + word + " " + infilespec + "\"")
	print

Converting DOCX files

You could use OpenOffice. But sometimes it crashes. Alternative way:

# Prerequisites
sudo apt-get install rpm libgif4

# Now fetch odf-converter. Go to Novell, register (free)/log in, search under "OpenOffice", get "OpenOffice.OpenXML Translator 4.0" (e.g. odf-converter-4.0-12.1.i586.rpm)

# Now unpack/install
rpm2cpio odf-converter*rpm | cpio -ivd
sudo cp usr/lib/ooo-2.0/program/OdfConverter /usr/bin

# It wants libtiff.so.3, and we probably have libtiff.so.4, so give it a symlink
cd /usr/lib/
sudo ln -s libtiff.so.4 libtiff.so.3

# Can now use it:
OdfConverter /i example.docx

# That should produce example.odt (suitable for OpenOffice).

OpenOffice bits and bobs

Accentuate — insert accents (see also release notes). Typically you type two characters (such as e'), then F7 to replace them with the accented version (such as é).
Other helpful macro key shortcuts it may be worth creating:
- Alt-D to insert en dash
- Ctrl-Alt-D to insert em dash