Paul Ivanov
TL;DR version: "What's easy, won't last. What lasts, won't be easy."
In this talk, I will focus on the how of reproducible research. I will focus on specific tools and techniques I have found invaluable in doing research in a reproducible manner. In particular, I will cover the following general topics (with specific examples in parentheses): version control and code provenance (git), code verification (test driven development, nosetests), data integrity (sha1, md5, git-annex), seed saving ( random seed retention ) distribution of datasets (mirroring, git-annex, metalinks), light-weight analysis capture ( ttyrec, ipython notebook)
My background: (since this will be autobiographical)
Data is hard to get: 1-2 years training animal on task, "minor brain surgery, every day of data collection" for 4-6 months, every day, 6-10 hours per day.
Data is very rich. It is hoarded. With a very tight lid.
My naive conclusion: Data is precious. Free the DATA!
If data was just more accessible....
But the reality is that, having accessible data is not enough...
http://crcns.org
You need the code, and I don't mean a tar ball.
(including this presentation)
show of hands - familiar with some form of version control?
Git specifically?
It's not rocket science! There are sane GUIs for novice users.
I explained the benefits of version control to my biologist friend Sara, and put SmartGit on her machine. No more _v1, _v3_works, etc. "I didn't think it would be this easy".
Back in my home lab, we do computational experiments.
Unsupervised learning of natural signals. "How should the brain encode images given their properties?"
Very popular dataset (camera calibrated, linearized, uncompressed, etc) - the paper to cite it came out in 1998.
As of 2007, it had 336 citations according to Google Scholar, (then 99th most cited paper in the Vision literature).
Today that number is up to 776.
Then is 2010, there's an email sent to a vision community mailing list saying:
Does anyone have a copy of van Hateren database? I have been looking
for the 4000 still image database. The links to images
http://hlab.phys.rug.nl/imlib/l1_200/index.html
are broken! And it looks like there is no mirror of the full database
anywhere. I would appreciate your help and suggestions.
So I put up a mirror.
Shortly thereafter, another grad student in a lab in Germany (one of my academic nephews), did the same.
This happened again a year later with another dataset. Luckily, I had downloaded that one as well, and now host the canonical version.
Lesson learned: don't take today's data sources for granted.
multiple resources for the same data (http, ftp, bittorrent) in one container file.
I have some data, and a problem: I don't want to lose my data.
So I make a copy -- and now I have two problems.
distributed digital hoarding.
Keep track of file signatures (hashes)
%%bash
# %%bash is an IPython magic that allows me to type shell commands in the rest of the cell
cd ~/cur/siam; chmod +rwx -R tmp fake_usb;
rm -fr tmp fake_usb; # start with a clean slate
mkdir -p tmp
cd tmp
git init .
git annex init "local laptop"
Initialized empty Git repository in /home/pi/code/workspace/everest/siam/tmp/.git/ init local laptop ok (Recording state in git...)
chmod: cannot access ‘fake_usb’: No such file or directory
%cd ~/cur/siam/tmp
/home/pi/code/workspace/everest/siam/tmp
%%bash
# let' just make a file to see how annex works
echo "pretend this is a large file" > original.dus
for x in {1..10000}; do echo GATTACA >> original.dus; done
ls -lh
total 80K -rw-r--r-- 1 pi pi 79K Feb 28 17:37 original.dus
This is a ~80K file, let's check it into git annex
!git annex add original.dus
add original.dus (checksum...) ok (Recording state in git...)
Let's see what happened to it:
!ls -lh
total 4.0K lrwxrwxrwx 1 pi pi 184 Feb 28 17:37 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a
So by annexing the file, we've hashed its contents, and renamed the file to that hash, making a symbolic link to the file. (content-based addressing)
It turns out git annex
also staged this symbolic link for us in git.
!git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
# (use "git rm --cached <file>..." to unstage)
#
# new file: original.dus
#
Let's check that into git.
!git commit -m"original data of unusual size checked in"
[master (root-commit) fb0751e] original data of unusual size checked in 1 file changed, 1 insertion(+) create mode 120000 original.dus
!git log
commit fb0751ef2b2044e623d855ba0b881d3744115cac
Author: Paul Ivanov <pi@berkeley.edu>
Date: Thu Feb 28 17:39:29 2013 -0500
original data of unusual size checked in
What did we actually check in? just one line - a symbolic link pointing to the contents of original.dus
!git log -p
commit fb0751ef2b2044e623d855ba0b881d3744115cac Author: Paul Ivanov <pi@berkeley.edu> Date: Thu Feb 28 17:39:29 2013 -0500 original data of unusual size checked in diff --git a/original.dus b/original.dus new file mode 120000 index 0000000..b30551c --- /dev/null +++ b/original.dus @@ -0,0 +1 @@ +.git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a \ No newline at end of file
!git annex whereis ./original.dus
whereis original.dus (1 copy) 74eda928-81f7-11e2-bfa9-e3a22b8f91cd -- here (local laptop) ok
Let's copy this repository to an external harddrive:
!git clone ./ ../fake_usb
Cloning into '../fake_usb'... done.
cd ../fake_usb/
/home/pi/code/workspace/everest/siam/fake_usb
!git annex init "pi's external harddrive"
init pi's external harddrive ok (Recording state in git...)
ls -al
total 16 drwxr--r-- 3 pi pi 4096 Feb 28 17:40 ./ drwxr--r-- 6 pi pi 4096 Feb 28 17:40 ../ drwxr--r-- 9 pi pi 4096 Feb 28 17:40 .git/ lrwxrwxrwx 1 pi pi 184 Feb 28 17:40 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a[K
!head original.dus
head: cannot open ‘original.dus’ for reading: No such file or directory
On the external harddrive, we only have a catalogue of the annexed files. We can grab them explicitly:
!git annex whereis original.dus
(merging origin/git-annex into git-annex...) whereis original.dus (1 copy) 74eda928-81f7-11e2-bfa9-e3a22b8f91cd -- origin (local laptop) ok
!git annex get original.dus
get original.dus (from origin...) ok (Recording state in git...)
ls -al
total 16 drwxr--r-- 3 pi pi 4096 Feb 28 17:40 ./ drwxr--r-- 6 pi pi 4096 Feb 28 17:40 ../ drwxr--r-- 9 pi pi 4096 Feb 28 17:41 .git/ lrwxrwxrwx 1 pi pi 184 Feb 28 17:40 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a[K
!head original.dus
pretend this is a large file GATTACA GATTACA GATTACA GATTACA GATTACA GATTACA GATTACA GATTACA GATTACA
Here's an example of one of my annexes: total known annex size is 557 Gb, but this laptop only has 6 Gb of it (and it only has a 100Gb SSD).
The key point is that the catalogue is available in a very lightwheight manner. Everything in the catalogue is just a git annex get away.
%%bash
# this cell will only run on pi's computer
cd ~/annex
git annex status
supported backends: SHA256 SHA1 SHA512 SHA224 SHA384 SHA256E SHA1E SHA512E SHA224E SHA384E WORM URL supported remote types: git S3 bup directory rsync web hook trusted repositories: 0 semitrusted repositories: 13 00000000-0000-0000-0000-000000000001 -- web 094dfddc-df61-11e1-9750-37e06b8a9271 -- passport 370e23eb-e8a6-4e4e-a541-126e86707d24 -- pirr (pirsquared.org: ~/data) 3a3f810a-bcba-11e1-8c79-abd1241ca1ec -- mybook-baregit (My Book bare git repo) 3ee45d74-bcbb-11e1-9248-134861891c80 -- mybook 82f51036-bb84-11e1-97d3-4b1ce8380653 -- apxrsync (ApxuMed rsync) 9a3d31ac-bb83-11e1-aa4a-07e10779860b -- here (HbIOTOH) ApxuMed: -- ~/data a113d11c-0cdc-45c4-95ec-896f051f58b9 -- apxumed cb0ec9da-bcb9-11e1-ba58-aba4ddc57a0f -- mybook d85bce23-e501-489c-aa3f-af86fac17b14 -- ApxuMed ~/data e8a75f48-d852-11e1-b3b8-771474033e82 -- g2usb1 (16GB DT101 G2) eb874b2a-bbf0-11e1-b174-afbb2a5f838f -- HbIOTOH /home/pi/data untrusted repositories: 0 dead repositories: 1 ce41e790-bcb9-11e1-bcf6-0b61b49dc5c3 -- mybook transfers in progress: none available local disk space: 4 gigabytes (+1 megabyte reserved) temporary directory size: 89 megabytes (clean up with git-annex unused) local annex keys: 699 local annex size: 6 gigabytes known annex keys: 5621 known annex size: 557 gigabytes bloom filter size: 16 mebibytes (0.1% full) backend usage: SHA256: 6300 URL: 20
Ok, now that we have data under control, let's move on to doing something with it... (code)
"Trust, but verify!"
How do you know that a tool is any good?
import numpy as np
np.test()
Running unit tests for numpy NumPy version 1.6.2 NumPy is installed in /usr/lib/pymodules/python2.7/numpy Python version 2.7.3 (default, Jan 2 2013, 13:56:14) [GCC 4.7.2] nose version 1.1.2
.............................................................................................................................................................................................................................................................................................S............................................................................................................................................................................................................................................................................KK...................................................................................................................................SSS.........................................................................................................................................................................................................................................................................................K.....................................................................................................K......................K...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
---------------------------------------------------------------------- Ran 3196 tests in 19.538s OK (KNOWNFAIL=5, SKIP=4)
<nose.result.TextTestResult run=3196 errors=0 failures=0>
Lightweight capture tool (I use this daily, it helps me account for how I spend my time). Just writes everything you see in the shell to a file, with timing information, which you can later play back.
demo in the shell (ttyplay ~/2012-08-01_2.tty
)
Paul Ivanov
TL;DR version: "What's easy, won't last. What lasts, won't be easy."