Techniques — Science writing
|
Back to techniques
https://rcardinal.ddns.net/techniques/sciencewriting.html
|
Based on Ubuntu 10.04 (a variety of Debian Linux).
LibreOffice (the major fork of OpenOffice) isn't bad. Zotero itself is excellent. However, the problems are (a) OpenOffice Math remains something of a pain for equations, (b) citation entry/editing isn't as easy as it could be (e.g. this formatting bug and the fact it's hard to add text at the end of a line after inserting a citation), and (c) ridiculously, you can't copy and paste citations. So:
Zotero is superb, particularly the standalone version. LyX is an editor that does a good job of hiding its underlying platform, LaTeX, and has superb maths capabilities (e.g. inline equations appear perfectly, and it's very quick to write in). LyZ is a plugin for Zotero that integrates it very well with LyX.
So, installation:
sudo apt-get install lyx
In use:
my_style_modifications.inc
and then include that by typing \include{my_style_modifications.inc}
into Document → Settings → LaTeX preamble. Here's an example of such a file to modify the section heading styles:
\usepackage{titlesec} \titleformat{\chapter} [display] % puts the title chapter on a separate line {\normalfont\sffamily\huge\bfseries} {\chaptertitlename\ \thechapter} {20pt} {\Huge} \titleformat{\section} {\normalfont\sffamily\large\bfseries} {\thesection} {1em} {} [{\titlerule}] % with a rule under it \titleformat{\subsection} {\normalfont\sffamily\normalsize\bfseries} % sans serif, bold {\thesubsection} {1em} {} \titleformat{\subsubsection} {\normalfont\sffamily\normalsize\mdseries\itshape} % sans serif, italic {\thesubsubsection} {1em} {}
.bst
files to save you effort (see e.g. PLoS Computational Biology for a relatively decent numeric citation style).Bugs persist (Mar 2012) in Inkscape that mean it's not yet up to Illustrator. Specifically, cutting lines with a shape doesn't work.
For example, to batch-fetch a list of sources in a convenient (relatively immutable, PDF) form, given a textfile containing a list of URLs: see below or fetch_multiple_urls_to_pdf.py.
#!/usr/bin/python # Requires Debian packages: wget wkhtmltopdf import sys; from subprocess import call; if len(sys.argv) != 2: # the program name and one other sys.exit("Syntax: fetch_multiple_urls_to_pdf.py urllistfile") f = open(sys.argv[1], "r") for url in f: url = url.strip() destfile = url.replace("http://", "") if destfile[-1] == "/": destfile = destfile[:-1] destfile = destfile.replace("/", "_") if destfile[-4:] == ".pdf": command = "wget " + url + " -O " + destfile else: destfile += ".pdf" command = "wkhtmltopdf " + url + " " + destfile print "Executing " + command ret = call(command, shell=True)
To generate the URL list from a textfile...
#!/usr/bin/perl use strict; my $filename = shift; if ($filename eq "") { die "Syntax: list_urls.pl <filename>\n"; } open(INFILE, "$filename") or die "Couldn't open $filename for reading.\n"; my $url; while (<INFILE>) { if (/(http\S*)/) { print "$1\n"; } };
# The main package: pdftk (does all sorts of things) sudo apt-get install pdftk man pdftk # example: pdftk in.pdf cat 1-12 14-end output out1.pdf # Extracting text: pdftotext in.pdf out.txt # Processing a PDF file through GhostScript (useful to find PDF corruption, amongst other things) gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pdf # ... use -sDEVICE=x11 to view, or -sDEVICE=nullpage just to pass through input, as less restrictive error checks # When GhostScript can't handle a pdf: pdftops myfile.pdf # ... generates myfile.ps # Now list printers: lpstat -p -d # Now print the PostScript: lpr -P<printer> myfile.ps
psychosis OR psychotic OR delusion OR delusions OR hallucination OR hallucinations OR catatonic OR catatonia OR "thought disorder" "drug information" site:www.rxlist.com more:for_health_professionals
&num=100&filter=0
to the query URL (100 results/page, which is the maximum; turns off the near-duplicates filter).Suppose your scanner produces page1.png
. Then we'll use convert
(part of ImageMagick) and tesseract
to do some quick OCR.
convert page1.png page1.tif tesseract page1.tif page1
The final result is page1.txt
.
Probably achievable in a bash script, but this is quicker to write. Having split a PDF (with pdftk input.pdf burst
), now renumber the pages so that pages 1–26 are named "...roman1"–"...roman26", but pages 27– are renumbered 1–. See below or rename_proof_pages.py.
#!/usr/bin/python2.7 import sys, getpass, shlex, subprocess, re def get_external_command_output(command): args = shlex.split(command) return subprocess.check_output(args) # this needs Python 2.7 or higher filelist = get_external_command_output("bash -c \"ls *.pdf\"").split() # need bash to do the wildcard expansion, then split() to make an array of a multiline string numberregex = "\d+" for infile in filelist: result = re.search(numberregex, infile) if result != None: ipagenum = int(result.group(0)) # Now, the first 26 pages are roman-numeralled. if ipagenum>26: opagestr = str(ipagenum - 26) else: opagestr = "roman_" + str(ipagenum) outfile = "page_" + opagestr + ".pdf" get_external_command_output("cp " + infile + " " + outfile)
Use the following shell script (extract_pdf_text) as e.g. extract_pdf_text *.pdf
. All it does is call pdftotext
.
#!/bin/bash # extract_pdf_text # does very little! # usage: extract_pdf_text *.pdf # make sure you always put $f in double quotes to avoid any nasty surprises i.e. "$f" for f in $@ do echo "Processing $f file..." pdftotext "$f" "$f.txt" done
Take a list of words (in a wordlist file); grep each word in turn against a bunch of files specified by the filespec; format the output a little. See below or multigrep.py.
#!/usr/bin/python2.7 import sys, getpass, shlex, subprocess, re def raw_default(prompt, dflt=None): prompt = "%s [%s]: " % (prompt, dflt) res = raw_input(prompt) if not res and dflt: return dflt return res def get_external_command_output(command): args = shlex.split(command) return subprocess.check_output(args) # this needs Python 2.7 or higher wordlistfile = raw_default("Wordlist file", "../../wordlist.txt") resultsfile = raw_default("Results file", "../../abbreviation_search_results.txt") infilespec = raw_default("Input filespec", "*.txt") infile = open(wordlistfile, "r") wordlist = infile.read().split() for word in wordlist: print "------------------------------------" print word print "------------------------------------" print get_external_command_output("bash -c \"grep " + word + " " + infilespec + "\"") print
You could use OpenOffice. But sometimes it crashes. Alternative way:
# Prerequisites sudo apt-get install rpm libgif4 # Now fetch odf-converter. Go to Novell, register (free)/log in, search under "OpenOffice", get "OpenOffice.OpenXML Translator 4.0" (e.g. odf-converter-4.0-12.1.i586.rpm) # Now unpack/install rpm2cpio odf-converter*rpm | cpio -ivd sudo cp usr/lib/ooo-2.0/program/OdfConverter /usr/bin # It wants libtiff.so.3, and we probably have libtiff.so.4, so give it a symlink cd /usr/lib/ sudo ln -s libtiff.so.4 libtiff.so.3 # Can now use it: OdfConverter /i example.docx # That should produce example.odt (suitable for OpenOffice).