A merge-based gitflow

So, I’ve never considered myself a particularly adept git user; worse, my last few (collaborative) coding experiences involved janky dropbox protocols so project members didn’t have to use git. Now that I’m a project that doesn’t involve dropbox, and everyone accepts using git (and github) to collaboratively code, the question of workflow came up. In particular, “how should we use git to most effectively write code?” Because I didn’t want to tackle the issue of how each project member should use git for their individual needs (that’s pretty personal, and I’m still working out my own philosophy), I figured it’d be more fruitful to address the question of how the group should collectively manage code development.

I originally wanted to implement a rebase-based workflow, a la The Every Deployable Github Workflow but (long story short) github won’t let you execute that work flow.1 So, I decided to stick with the “integration manager” theme where I’d use my github repo as the “blessed repository” and also play the role of the integration manager. I like the idea of a, senior, integration manager since it places the responsibility of handling merges — an operation that can go awry if handled by someone new to git — into the hands of one person, thus minimizing the chance of any git-related errors. Ideally (and with competent team mates), the integration manager won’t have to do anything other than a cursory code-review, since any pull-requests should be yield conflict-free merges. However, life isn’t perfect, so we want the integration manager to be the guy with the most git-ninja skills in the room.

Because I understand that git can be something of an overwhelming tool for beginners, I figured I’d graphically demonstrate what the development cycle of a feature might look inside this merge-based gitflow. It starts with a few ideas/rules:

  1. the code is the master branch is “pure” in the sense that anyone should be able to fork the repo, and branch off of master and be good-to-go.
  2. topic branches should have meaningful names (a rule the diagrams below violate, unfortunately).
  3. branches named “someone/topic_branch” are private to that someone, and shouldn’t be considered stable to pull without consulting that someone (or a pull-request to the integration manager).
  4. no one works on the same topic branch. If two people want to work on the same topic branch, they create their own branches, appropriately named, and have the integration manager pull their private branches back into the topic branch.

So, let’s see how this would work (somewhat) in practice:

  1. we’ll demo this with a collaboration between Alice and Bob. They’ve decided to develop a (complicated) feature and dedicated branch DEV to it. The repo might look something like:


    Alice’s private branch off of DEV is named Alice (it’s head is indicated by the rectangular tag), Bob’s private branch is named Bob and both have created independent commits. Now,

  2. Alice might decide that her progress by her second commit is stable enough to make it to the DEV branch. At this point, she’d submit a pull-request to the integration manager, noting that he’s to pull Alice into DEV. Since DEV hasn’t progressed since Alice (and Bob) branched off it, this pull is just a fast-forward of DEV‘s HEAD. The resulting history tree may look something like so:
  3. Sometime, and a few lines of code, later, Alice has pushed another commit and Bob has realized that he needs to merge in the updated DEV into his branch. He does so, and makes some more head way, pushing another commit up, as well:
    # on Bob's computer
    git fetch origin
    git checkout Bob
    git merge DEV
    # ... Bob does more work ... #
    git commit
    git push origin Bob

    The repo history, now, might look like this
    The thing to note, here, is that while DEV and Alice were one-and-the-same in the previous figure, Alice has (again) diverged from DEV with the new commit. So, we’re looking at 3 branches. Next,

  4. let’s say that with Bob’s last commit, he’s decided that that’s all the work he’s going to do on DEV, and sends a pull-request to the integration manager (and lets the manager know he’s done). Bob then get’s merged into DEV, which then has Alice merge DEV into her branch (just to make sure her work is still compatible with the code in DEV).


    Alice, is not done hammering at this feature, so

  5. she makes one final commit and issues a pull-request to the integration manager (and let’s him know she’s done). At this point, the integration manager knows that DEV is done being developed and merges DEV into Master for a history like:


Now some people may be inclined to do some clean up, after all this merging, and delete branches. For the sake of reverting changes, or rolling back bad code, I’m encouraging the integration manager to be lazy and not delete any topic branch (private or otherwise).


1 Really, the problem is that github will not let you push local changes up to a remote branch unless the changes would result in a fast-forward of the remote branch’s HEAD. So, say you were tracking a remote branch, origin/dev with a local branch, dev. Once you rebase dev onto another branch, you can’t (simply) git push origin dev back up to origin/dev since that requires something more than a fast-forward of origin/dev‘s HEAD; you can, however, push it up via git push -f origin dev. But, I consider that really sloppy practice, especially for beginners, so I’m trying not to encourage it.

Accessing your iPhone’s HDD the smart way.

By that, I mean via sshfs.

To access your phone’s hard drive, you’ll need to:

  1. jailbreak it. Unfortunately, this means that you’ll need to roll back the operating system to iOS 6.1.2 (or earlier).
  2. Open up Cydia and install OpenSSH.
  3. Connect to a wifi network — let’s say the network is called “home”
  4. Get your phone’s IP address. There are a few ways to do this, but the way that’s common to everyone is by going to Settings > Wi-Fi, then accessing “home”.

Once you’ve done all that, you can SSH into your phone like you would any other device — if you’re on a Mac or Windows box, and have no idea what that even means, check out this link. Just note that the default username and password for your device are root and alpine (respectively).

Now, you could call it quits at this point. But, that’s dumb. Especially if you’re like me and you take pictures/videos and have the (occasional) need to transfer them to your computer. Why is this a problem with plain, old, SSH? Well, tunneling to your phone’s HDD via SSH doesn’t let you see photo (or video) thumbnails (not in Ubuntu’s Nautilus, at least). This is a huge bummer when you have to wait through dozens of photos looking for a particular collection.

If only there was a way we could connect to our phones, and also see thumbnail previews… Oh wait, (if you’re using linux) there is! Mount your phone’s hard drive via sshfs, following the guidelines setup over in my first post about this wonderful technology.

You can follow those directions to a t and be good to go, but I’ve taken one extra step: I added a public key to my iPhone so I don’t need to keep re-entering my password every time I mount and un-mount the phone’s filesystem. Note: the directory ~/.ssh may not exist on your phone when you’re first attempting this, so you may need to ssh in and make it (use the mkdir command). Also, feel free to create an alias for your iPhone so you don’t need to keep typing ssh root@ip.address or scp /some/file root@ip.address:/some/destination at the command line.

As an example, my ~/.ssh/config file contains

host iPhone
User root
IdentityFile ~/.ssh/id_rsa

and consequently, the last line of my /etc/fstab file is

iPhone:/ /home/steven/iPhone fuse.sshfs defaults,user,noauto,idmap=user,allow_other 0 0

I’m pretty sure this should get you up and running, but, if not: drop a note in the comments and maybe I can help you.

A recollection on approximating integrals…

So, I’m getting back in to the groove of school work, and upon a lecture slide I see the following:

If T_1 = t_1, T_2 = t_2, \ldots, T_n = t_n are observed, then we may estimate \Lambda(t) by
\Lambda_{n}(t) := \displaystyle \int_{0}^{t} g(x) \, dF_{n}(x) = \dfrac{1}{n} \sum_{i=1}^{n} g(t_i) I(0 \leq t_i \leq t)

Understandably, my eyes immediately glossed over and my brain went hazy after reading that. I may have even gone partially slack-jawed, and uttered a “what… the…”. So, in an effort to figure out why/how we get the second equality above I went back to the basics: good old measure theory.

Turns out, that was pretty much a waste of time. I mean, you’re never worse for wear after a measure theory refresher; however, the answer for my quandary doesn’t demand anything close to that kind of heavy machinery. What we’re dealing with here is the Weak Law of Large Numbers, plain and simple.

Why?/How? Well, recall that (using fancy notation) we write
\displaystyle E[X] = \int_{\text{supp}(X)} \omega \, dF(\omega)
where F is the cumulative distribution function of our random variable X. Furthermore, if g is some (deterministic) function, then g(X) inherits randomness from X, and this allows us to calculate its expectation via the formula
\displaystyle E[g(X)] = \int_{\text{supp}(X)} g(\omega) \, dF(\omega)

So, what the Weak Law is saying is that given a collection of observations, X_1, X_2, \ldots, X_n (all assumed to be independent of each other and sharing common distribution function, F), we can approximate the integral above with the average
\displaystyle \hat{\mu}_{n} := \frac{1}{n}\sum_{i=1}^{n} g(X_i)

But now, we’re only a hop, skip, and a jump away from reconciling our original problem since we need only remember that the measure dF_{n}(\omega) is just short-hand for “only weight the observed values, and give all of them equal mass”. That is,
dF_{n}(\omega) = \begin{cases} n^{-1} &\text{if } \omega \text{ is an observed value} \\ 0 &\text{otherwise}\end{cases}

\displaystyle \hat{\mu}_{n} = \int_{\text{supp}(X)} g(\omega) \, dF_{n}(\omega)

Now, we’re pretty much done because we have the simple identity
\displaystyle \int_{0}^{t} g(\omega) \, dF(\omega) = \int_{\text{supp}(X)} g(\omega) I(0 \leq \omega \leq t) \, dF(\omega)
and this puts all the pieces together:

\displaystyle \Lambda_{n}(t) = \int_{\text{supp}(X)} g(\omega) I(0 \leq \omega \leq t) \, dF_{n} = \frac{1}{n} \sum_{i=1}^{n} g(t_i) I(t_i \in [0,t])

(Conditional) Expectations — setting the bar low, and keeping it there

For whatever reason, I’ve struggled with the concept of Conditional Expectation my entire probabilistic life. I’ve been consoled that this is a rather subtle topic, but I feel like that encouragement is a bit hollow.

That being said, I was sitting around thinking some more about the general, underlying, idea of a conditional expectation and came up with something. Consider two random variables, X and Y and some set, B. To make things concrete, you can pretend that B = {1,2,3}, though it could anything (e.g. a single number, interval, union of intervals). Now, say I’m interested in the expected value of X given Y takes on values in B. Precisely,

E[X \mid Y \in B]

At first blush, my (stupid) brain wants to do something like:

  1. Find out which events place Y in B (i.e. when is Y equal to 1, 2, or 3). Call these points, C, and then
  2. take a weighted average of X at these points.

In mathematics, that’s written

\displaystyle \int_{\omega \in C} X(\omega) \, dP(\omega)

However, there’s a problem here. This number doesn’t take into account the fact that Y, necessarily, takes values in B. Right now, we’re just computing the weighted average of X wherever Y lands in B. We want to inflate (or deflate) this by the knowledge that Y isn’t as random as it could be. Pedantically, we want to normalize this by the proportion of the time Y lands in B. Fancy speak aside, we need to divide by the probability that Y lands in B.

Again, in mathematics, this translates to

\displaystyle E[X \mid Y \in B] = \dfrac{\displaystyle \int_{\omega \in C} X(\omega) \, dP(\omega)}{P(Y \in B)}

But, for the advanced reader, it turns out that there’s a bit more here. In particular, we can rewrite everything to only include expectations and indicator functions:

\displaystyle E[X \mid Y \in B] = \dfrac{E[ X \, \mathbb{I}(Y \in B)]}{E[\mathbb{I}(Y\in B)]}

Even better still, this formula immediately reveals conditional probability as a special case of conditional expectation. To see this, suppose you have another set A, and then replacing X with the random variable that indicates whether or not X is in A we have:

E[ \mathbb{I}(X \in A) \mid Y \in B] = \dfrac{E [ \mathbb{I}(X \in A) \, \mathbb{I}(Y \in B) ] }{E[\mathbb{I}(Y\in B)]} = \dfrac{E[ \mathbb{I}(X \in A, Y \in B) ]}{ E[\mathbb{I}(Y \in B)] }

Since we recognize the expectation (conditional, or otherwise) of an indicator as a probability, we see that the left-most object above is P(X \in A \mid Y \in B), while the right-most object is P(X \in A, Y\in B)/P(Y \in B). This is exactly the formula you’re taught in Probability 101!

As an application of this, I’ll look at the very question that got me started on this whole thing:

Let T be a non-negative random variable and denote its survival function by S. The mean residual lifetime at t ≥ 0, denoted by m(t), is defined as E( T – t | T > t) and, due to its interpretability, can a very useful measure in practice. Show that m(t) can be calculated as \int_{t}^{\infty} S(u) \, du/S(t) for each t ≥ 0.

This problem is pretty simple, once we use the identity above. Just replace X with T-t, Y with T and B with (t, \infty):

m(t) = \dfrac{E[(T-t)\mathbb{I}(T > t)]}{P(T > t)} = \dfrac{\displaystyle \int_{t}^{\infty} (u-t) \, dF(u)}{S(t)} = \dfrac{\displaystyle \int_{t}^{\infty}(t-u) \, dS(u)}{S(t)}

But, if you use integration by parts, and assume the existence of the first moment for T, you get

m(t) = \dfrac{(t-u) S(u)\biggr|_{u=t}^{\infty} + \displaystyle \int_{t}^{\infty} S(u) \, du}{S(t)} = \dfrac{\displaystyle \int_{t}^{\infty} S(u) \, du}{S(t)}

Neural Networks I: The Backwards-Propagation Algorithm

Consider the situation where you’re using a Neural Network for a multi-class classification problem. Explicitly, you have data (x,t) \in \mathbb{R}^{n} \times \{e_1,e_2,\ldots, e_p\} for p \geq 3 where e_i \in \mathbb{R}^{p} is the unit vector in the pth direction (i.e. e_i = (0,0,\ldots, 1, 0,\ldots, 0)) and you’re going to use the logistic function g(x) = (1+e^{-x})^{-1} as your primitive function (in the network). To make this a learning problem, we need a cost function. Let’s consider the standard cost function taken when doing multi-class logistic regression:

\displaystyle E_{\Theta}(x,t) = -\sum_{i=1}^{p} \ell_{\Theta}(x,t_i) = \sum_{i=1}^{p} t_i \log(h_{\Theta}(x)_{i}) + (1-t_i)\log(1-h_{\Theta}(x)_{i})

While there are lots of ways to minimize this function with respect to \Theta, we’ll use calculus by taking advantage of the fact that our hypothesis, h_{\Theta}(x) is the composition of differentiable functions, and hence is, also, differentiable. Before we do anything, let’s set some conventions:

  1. a^{(i)} = \begin{cases} x &\text{ if } i=1 \\ g(z^{(i)}) &\text{ if } i \geq 2 \end{cases}
  2. z^{(i+1)} = \Theta^{(i)} a^{(i)}
  3. \delta^{(l)}_{i} \stackrel{\Delta}{=} \dfrac{\partial E}{\partial z^{(l)}_{i}}

So, a^{(i)} is the vector of activation values for the i-th (hidden) layer and thus z^{(i)} are the input values for the nodes in the i-th layer. In particular, let’s say that we have L layers, and the j-th layer has s_{j} nodes. Note that s_{L}=p.

Now, if we short-hand a^{(L)} = h_{\Theta}(x), it will be helpful to consider our cost function as a composition of functions, E = f(a^{(L)},t) = f(g(z^{(L)}),t). Thus, via the chain-rule

\displaystyle \delta_{j}^{(L)} = \dfrac{\partial E}{\partial z_{j}^{(L)}} = \sum_{i=1}^{p} \dfrac{\partial E}{\partial a_{i}^{(L)}} \dfrac{\partial a_{i}^{(L)}}{\partial z_{j}^{(L)}} = \sum_{i=1}^{p} \left(- \sum_{s=1}^{p}\dfrac{\partial \ell_{\theta}(a^{(L)},t_s)}{\partial a_{i}^{(L)}}\right) \dfrac{\partial g(z_{i}^{(L)})}{\partial z_{j}^{(L)}} = \dfrac{a_{j}^{(L)}-t_j}{a_{j}^{(L)}(1-a_{j}^{(L)})} \, g'(z_{j}^{(L)})

which simplifies to \delta_{j}^{(L)} = a_{j}^{(L)}-t_j, since g'(x) = g(x)(1-g(x)). Why do we care about this? Well, it turns out that because z^{(l+1)} = \Theta^{(l)} a^{(l)} = \Theta^{(l)} g(z^{(l)}), \delta_{j}^{(l)} can be recursively defined:

\displaystyle \delta_{j}^{(l)} = \dfrac{\partial E}{\partial z_{j}^{(l)} } = \sum_{i=1}^{s_{l+1}} \dfrac{\partial E}{\partial z_{i}^{(l+1)} } \dfrac{ z_{i}^{(l+1)} }{z_{j}^{(l)} } = \sum_{i=1}^{s_{l+1}} \delta_{i}^{(l+1)} \dfrac{\partial}{\partial z_{j}^{(l)}} \sum_{k=1}^{s_{l}} \Theta_{ik}^{(l)} g(z^{(l)}_{k}) = \sum_{i=1}^{s_{l+1}} \delta_{i}^{(l+1)} \Theta_{ij}^{(l)} g'(z^{(l)}_j)

More compactly,

\displaystyle \delta_{j}^{(l)} = \langle g'(z_j^{(l)}) \Theta_{j}^{(l)}, \delta^{(l+1)} \rangle

where \Theta_{j}^{(l)} is the j-th column of \Theta^{(l)}. Hence, we can write out the vector \delta^{(l)} as the element-wise product

\displaystyle \delta^{(l)} = g'(z^{(l)}) * \left[\left(\Theta^{(l)}\right)^{T} \delta^{(l+1)} \right]

Now, we can finally figure out our weight updates for gradient descent (or whatever other method we’re using) since

\displaystyle \dfrac{\partial E}{\partial \Theta_{ij}^{(l)}} = \sum_{k=1}^{s_{l+1}} \dfrac{\partial E}{\partial z_{k}^{(l+1)} } \dfrac{\partial z_{k}^{(l+1)} }{\partial \Theta_{ij}^{(l)}} = \sum_{k=1}^{s_{l+1}} \delta_{k}^{(l+1)} \dfrac{\partial}{\partial \Theta_{ij}^{(l)}} \sum_{m=1}^{s_{l}} \Theta_{km}^{(l)} a_{m}^{(l)} = \delta_{i}^{(l+1)} a_{j}^{(l)}

At this point there are a few things to note:

  1. This derivation was done for the cost of training on a single point. More realistic would be the scenario where you have m data points, and in which case you’d consider the cost function

    \displaystyle J_{\lambda}(\Theta) = \frac{1}{m}\sum_{s=1}^{m} E_{\Theta}(x^{(s)},t^{(s)}) + \dfrac{\lambda}{2m}\sum_{\substack{i,j,l \\ j \neq 0}} \left(\Theta_{ij}^{(l)}\right)^2

    whose “gradient” is just the average of the terms we’ve derived, plus a regularization penalty:

    \displaystyle \dfrac{\partial J_{\lambda}}{\partial \Theta_{ij}^{(l)}} = \dfrac{1}{m}\sum_{s=1}^{m} \dfrac{\partial E(x^{(s)},t^{(s)})}{\partial \Theta_{ij}^{(l)}} + \lambda \Theta_{ij}^{(l)} I(j\neq 0) \equiv \dfrac{1}{m}\sum_{s=1}^{m} \Delta(s) + \lambda \Theta_{ij}^{(l)} I(j \neq 0)

    This gives us a fairly straight-forward algorithm for computing the partials of J:

    Given a training set {(x[1],t[1]),...,(x[m],t[m])}
    initialize Theta[i,j,l]
    set Delta[i,j,l] := 0
    for (s=1 to m):
        set a[1] := x[s]
        do forward-propagation to calculate a[i] for i=2,3,...,L
        set delta[L] := a[L] - t[s]
        do backwards-propagation to calculate delta[i] for i=L-1,...,2
        set Delta[i,j,l] := Delta[i,j,l] + a[l][j]*delta[l+1][i]
    for (j != 0):
        set D[i,j,l] := Delta[i,j,l]/m + lambda*Theta[i,j,l]
    for (j = 0):
        set D[i,j,l] := Delta[i,j,l]/m
  2. Nothing about this derivation of \delta^{(l)}_{i} used properties of g(x), except during the calculation of \delta^{(L)}. Thus, you can perform this algorithm with any differentiable function, f(x), and perform back-propagation starting with

    \delta^{(L)} = \dfrac{ (a^{(L)}-t)f'(z^{(L)}) }{a^{(L)}(1-a^{(L)})}

SSHFS, a how-to

So as I’m soon to write about, I found myself in a situation where I needed to run a simulation study on Grizzly Bear (the Biostat department’s cluster over at Berkeley). Given that this is a remote workstation, the standard interface/workflow (pardon the over-loaded jargon) is going to involve the terminal and at least one ssh command. In reality, my situation involved a rather lengthy study, as well as multiple, graphical, outputs as the simulation progressed. So, in addition to ssh‘ing in and out, one would anticipate some calls to tar (or gzip) and then scp. This work flow is annoying (to say the least) and really makes you wonder why there isn’t a better way.

Turns out, (like all things computer related) smarter people than I have encountered (and worked out a solution to) this problem. Linux users may already know it in a neutered form, but let me present to you sshfs. In a compact soundbite,

sshfs allows you to mount (and unmount) remote directories as if they were local.

Now, you might be saying, “but Steven, if I go to nautilus and type in (s)ftp://user@remote.address:path/to/my/stuff, I can just enter my credentials and I’m already browsing my remote files as if they were on my computer”. While this is fair, there are a few problems with this solution:

  1. It’s annoying to keep having to type (s)ftp://user@remote.address:path/to/my/stuff
  2. This solution doesn’t work when you’re connecting to your remote workstation with a .pem file or some other identification key. (e.g. you want to access files on an Amazon EC2 instance.)

In effect, setting up sshfs will allow you to deal with both of the above problems, and never think about them again. So, what do you need to do?

  1. Go read Ubuntu’s setup guide
  2. Set up an ssh alias so you don’t have to keep typing that annoying ssh command into the terminal. For example, my ~/.ssh/config file is pretty bare bones:
    host grizzlybear
    HostName grizzlybear.berkeley.edu
    User steven.pollack
    host EC2
    User ubuntu
    IdentityFile ~/path/to/my/pem/file

    So, you can see how the second entry lets me shortcut typing

    $ ssh -i ~/path/to/my/pem/file ubuntu@

    with just

    ssh EC2

    As it stands, I can’t find any option in the ssh_config man page that tells me how I can create an ssh alias that automatically calls tmux (or screen) after I connect. However, this blog post seems to suggest the (plain, old) alias

    alias ec2.w.tmux="ssh EC2 -t tmux a"

    will do that job for me, with no issues.

  3. Configure your /etc/fstab file to allow for smooth mounting and unmounting from within nautilus. Warning: improperly editing your /etc/fstab file can nuke your system install. This happened to me and I had to turn my thumb drive into a bootable copy of Ubuntu to get back to my file system and fix the changes. After everything was said and done, I now have an /etc/fstab that looks likefstabTo briefly explain:
    • I’m using ~/grizzlybear as a mounting point for the remote directory,
    • I’ve set my device filesystem type to fuse.sshfs since I was having trouble unmounting my device from nautilus (and this blog post seemed to yield a solution),
    • and my options, defaults,user,noauto,idmap=user,allow_other say that I should set all the default options to this device, however:
      1. let non-root users mount and unmount
      2. do not auto-mount this device
      3. make sure the permissions line up between the remote and local users. So, even though my local user name is steven, and my remote user name is steven.pollack, make sure there is an understanding that in regards to permissions, these are the same user, and
      4. allow other users to access the ftp mount directory

After all of that, you should see your device in the “Devices” section of Nautilus, and (un)mounting should be as smooth as glass:

If you want to learn more about ssh_config files, this blog post has some more in-depth examples.

One more thing (on the Median)

This is a quick extension of the post I made the other day on why we shouldn’t be so quick to conflate the mean and the median of a data set.

Suppose you’re given some sample data that tracked weight change for 15 different people (strangers, really) before some treatment, and then after some treatment. The data may come to you in the form (x_1,y_1), (x_2, y_2), \ldots, (x_{15}, y_{15}), but what you’re really interested in, is the affect of the treatment: d_i = y_i - x_i. In particular, you want to know if the treatment had some affect. You may want to look at the mean of the differences,

\text{mean}(d_1, d_2, \ldots, d_{15})

but you may also want to investigate the median,

\text{median}(d_1, d_2, \ldots, d_{15})

In fact, beyond just looking at the median, you want to consider the hypothesis that the median is zero. The material from the last post lends itself really well towards creating a test for this. It pretty much comes for free from the duality between confidence intervals and hypothesis tests. How? Well, in the derivation of our confidence interval, we considered the indicator

I_{-}(X_i) = \begin{cases} 1 &\text{if } X_i < \eta \\ 0 &\text{if } X_i \geq \eta \end{cases}

So, if H_0: \eta = 0, then replacing X_i with D_i = Y_i - X_i, we have that

P_{0}\left(\eta \in (D_{(j)},D_{(n-j+1)})\right) = 1 - 2P(Y < j)

where Y \sim Bin(n,p=1/2). Thus, if \alpha \in (0,1) is set and j(\alpha) is such that

P(Y < j(\alpha) ) \leq \alpha/2 \text{ and } P(Y < j(\alpha)+1) > \alpha/2


P_{0}\left(\eta \in (D_{(j(\alpha))},D_{(n-j(\alpha)+1)})\right) = (1-\alpha^{*}) \geq 1-\alpha


P_{0}\left(\eta \not\in (D_{(j(\alpha))},D_{(n-j(\alpha)+1)})\right) = \alpha^{*} \leq \alpha

Which means that the decision-rule,

\text{reject \textit{iff} } 0 \not\in (D_{(j(\alpha))},D_{(n-j(\alpha)+1)})

will incorrectly reject the null with probability \alpha^{*} \leq \alpha.

However, in this whole rigamarole we lost track of what Y really counts:

Y = \sum_{i=1}^{n} I_{-}(D_i)

That is, Y is just the number of D_i which are negative. In R,

 Y.obs <- sum(which(D < 0))

And since, under the null hypothesis we know the distribution of Y, getting its p-value (we’re assuming H_{1} : \eta \neq 0) is pretty straight-forward

p.value = 2 * if (Y.obs < (length(D)+1)/2) {
          pbinom(q=Y.obs, size=length(D), prob=0.5)
          } else { # Y.obs is in left tail
          pbinom(q=Y.obs-1, size=length(D), prob=0.5, lower.tail=F)

There are a few pieces of good news, here:

  1. This entire derivation was brutally simple to get through. All the “heavy lifting” was done in a previous post, and I’m fairly hesitant to even call it that.
  2. This test actually goes by an official name: the sign test. This is great because people probably won’t know what you’re talking about when you mention “that test, on pollackphoto.net, that looks at whether the median of differences is zero.”

So go forth, use this knowledge and become more friendly with the medians of your data. I’m sure not many people will appreciate that you’ve added a robust statistic and nonparametric method to your daily repertoire. But the few who do, will probably be impressed.

Bootstrap basics (more confidence intervals!)

Today, I realized that I’ve been doing something wrong for months, and never thought about it. Today, I think I’ve finally learned how to correctly extract a confidence interval through the bootstrap.

To understand what I was doing wrong, we need to first understand what is the correct way to do things.

Say you’re trying to estimate a parameter \theta, with an estimator, \hat{\theta}. If you knew the distribution of \hat{\theta} - \theta then you could find the quantiles q_{-}, q^{+} such that

F_{\hat{\theta}-\theta}(q_{-}) = P( \hat{\theta} - \theta \leq q_{-}) = 0.025 \text{ and } F_{\hat{\theta} - \theta}(q^{+}) = 0.975

which would imply that

P(q_{-} \leq \hat{\theta} - \theta \leq q^{+}) = 0.95

Or, through some basic algebra,

P( \hat{\theta} - q_{-} \geq \theta \geq \hat{\theta} + q^{+} ) = 0.95.

That is, we’d be able to obtain a 95% confindence interval for \theta that was centered about our estimator.

However, we’re seldom fortunate enough to know the parameter we’re estimating (hence why we’re endeavoring to estimate it in the first place). Worse, the distribution of our estimate, F_{\hat{\theta}}, might, also, be unknown to us (or incredibly difficult to figure out). So what do we do? We use the bootstrap to simulate \hat{\theta}‘s distribution, and we use \hat{\theta} as our best guess for our original parameter, \theta. Why? Well, we’d like to think that if we have enough data then we might be able to say that \hat{\theta} \approx \theta, and if we take enough bootstrap replicates, then we can write

\displaystyle \hat{F}_{\hat{\theta}^{*} - \hat{\theta}}(x) = \dfrac{1}{B}\sum_{i=1}^{B} I( \hat{\theta}^{*}_{j} - \hat{\theta} \leq x ) \approx F_{\hat{\theta} - \theta}(x)

At this point, we’re banking that we’ve let B be large enough that our sample quantiles,

\displaystyle \hat{F}_{\hat{\theta}^{*} - \hat{\theta}}(\hat{q}_{-}) = 0.025 \text{ and } \hat{F}_{\hat{\theta}^{*} - \hat{\theta}}(\hat{q}^{+}) = 0.975

are accurate enough. That is, we want to write

\displaystyle \left(\hat{\theta} - \hat{q}^{+}, \hat{\theta} - \hat{q}_{-}\right) \approx (\hat{\theta} - q^{+}, \hat{\theta} - q_{-})

Here’s where I’ve been screwing up, though. I’ve been collecting my bootstrap’d distribution just fine, however, I haven’t been taking my quantiles from the centered distribution. Instead, I’ve been setting my approximate quantiles through

\hat{F}_{\hat{\theta}^*}(\hat{q}_{-}) = 0.025 \text{ and } hat{F}_{\hat{\theta}^*}(\hat{q}^{+}) = 0.975

and then to make things worse, reporting

\displaystyle \left(\hat{q}_{-}, \hat{q}^{+}\right)

as my approximate 95% confidence interval. In R, my folly looked like:

bootstrap.CI <- quantiles(x=bootstrap.sampling.dist, probs=c(0.025,0.975))

Turns out, my mistake is not irreparable, however. If you recognize that for any constant, c, F_{X}(q) = F_{X+c}(q+c), then you can still do your analysis on the uncentered bootstrap results. Why? Because, at the end of the day, we want quantiles such that

0.025 = F_{\hat{\theta}-\theta}(q_{-}) = F_{\hat{\theta}}(q_{-}+\theta) = F_{\hat{\theta}}(r_{-})

So, if we can approximate r_{-}, then we just need to subtract \hat{\theta} from it to get an approximation to q_{-}. That is, we’re assuming that

q_{-} \approx \hat{q}_{-} = \hat{r}_{-} - \hat{\theta}

But don’t forget, the upper bound of our confidence interval has the form \hat{\theta} - q_{-}. Hence,

\hat{\theta} - q_{-} \approx \hat{\theta} - \left(\hat{r}_{-} - \hat{\theta}\right) = 2\hat{\theta} - \hat{r}_{-}

and similarly, the lower bound is

\hat{\theta} - q^{+} \approx 2 \hat{\theta} - \hat{r}^{+}

This means that calculating the confidence interval in R can be done elegantly:

bootstrap.CI <- 2*theta.hat - quantiles(x=bootstrap.sampling.dist, probs=c(0.975,0.025))

I’m sure this is old news to most users, but I figure I’d write it here for my own sake later down the road when I suffer from more serious brain flatulence.

The Median: the “good twin”

The basic stats you learn in grade 4 are the Mean, Median, Mode and Range. You remember the Mode since, “the Mode comes up the most” (it’s the most frequent number in a set). The Range is just the “width of the set” (largest value minus smallest). The Mean is the average of the values (we take Means all the time). But the Median is trickier to deal with. You could say the Median is “the middle value” of a set, but (for most people) that muddies the distinction between the Median and the Mean; we’re pretty used to thinking of the Mean as (at once) the average and the middle (because why shouldn’t the average be in the middle).  This conflation makes people write off the Median as anything other than a synonym for Mean.

To see why they are truly different, consider the following “data set”: {0,1,2,3,4,20,100} (values conveniently come in increasing order). The average of these values is 120/7 (or about 17.14), however the median is the middle value, 3, which is certainly no where near 17.14. While it may be good enough to stop here, let’s see what happens when we add “extreme” values to our data set. For instance, what if our data set was now {-100,0,1,2,3,4,20,120,1000}? The median still hasn’t changed. The mean, however, is now around 116.67 (adding two values increased our mean nearly seven-fold). We describe what’s happening here by saying

The Median is robust to outliers; whereas the Mean is not.

So now that we’ve gotten that out the way, the title of this post might make more sense. To most people, the Mean and the Median are these easily confused twins. Well, if you insist on looking at it this way, let’s beg to the most tired of twin stereotypes and label one the “good twin” and another the “evil twin”. As far as I’m concerned, when I’m interested in figuring out the middle point of my data, I don’t want my analysis to be distorted by the presence of outliers (or “extreme values”). Hence, I’ll take the “good twin”, Mr. Median. However, there are more reasons to consider the median as a statistic to report. For one, under pretty weak conditions you can form exact confidence intervals for it.

(Warning, this is where this blog post takes a turn for the technical. There’s nothing wrong with leaving now if you’ve had your fill.) So, assume you have a collection of data points (x_1, x_2, \ldots, x_n) and you’ve decided to assume each point is an independent realization of a random variable X \sim F, where F is some continuous distribution. A real life example of this kind of data could be a day’s caloric consumption from n strangers who lived in the same city and admitted to having similar dietary restrictions. Under this model, we can consider our data a realization of the random vector (X_1, X_2, \ldots, X_n), wherein we may want to investigate properties of the interval (X_{(j)}, X_{(k)}) bounded by the and k-th order statistics. In particular, lets check out the chance that this interval contains the median, \eta.

$latex P(X_{(j)} \leq \eta \leq X_{(k)}) = 1 – P( \eta \not \in (X_{(j)}, X_{(k)}) ) = 1 – P(\eta < X_{(j)} \text{ or } \eta > X_{(k)})$.

Now, because $latex \eta < X_{(j)}$ and $latex \eta > X_{(k)}$ are mutually exclusive events (we’re implicitly assuming that $latex j < k$), we can consider the probability of each event separately. For the first situation, consider the following (shitty, ascii) picture

<---|--------|--- o o o---|------|---- o o o ---|-----|---o o o ---|------>
   x_(1)   x_(2)       x_(j-1) x_(j)         x_(k) x_(k+1)       x_(n)

With this picture in mind, you can translate the statement $latex \eta < X_{(j)}$ into $latex \eta \in (-\infty, X_{(1)}) \text{ or } \eta \in (X_{(1)}, X_{(2)}) \text{ or } \cdots \text{ or } \eta \in (X_{(j-1)}, X_{(j)})$. (Note that I use the convention that for collection of n random variables, X_{(0)} = -\infty and X_{(n+1)} = +\infty.)

So, if \eta \in (X_{(i-1)}, X_{(i)}) then, certainly i-1 observations are less than \eta. Hence,

$latex \displaystyle P(\eta < X_{(j)}) = \sum_{i=1}^{j} P(\eta \in (X_{(i-1)}, X_{(i)}) = \sum_{i=0}^{j-1} P( i \text{ observations } < \eta) $ A nearly identical argument (again, use the picture) has that $latex \displaystyle P(\eta > X_{(k)}) = \sum_{i=k}^{n} P(\eta \in (X_{(i)}, X_{(i+1)}) = \sum_{i=k}^{n} P( i \text{ observations } < \eta) $ So, we just need figure out what $latex P(i \text{ observations } < \eta)$ is. We can start by introducing the indicator $latex I_{-}(X_i)$ where $latex \displaystyle I_{-}(X_i) = \begin{cases} 1 &\text{if } X_i < \eta \\ 0 &\text{if } X_i \geq \eta \end{cases}$ Next, we consider $latex E[I_{-}(X_i)] = P( X_i < \eta )$, and recall that the general definition of the median is such that $latex P(X_i > \eta ) = P(X_i < \eta) = 0.5$ for any $latex X_i$. Hence, $latex E[I_{-}(X_i)] = 1/2$ which, along with the independence of the random variables, implies that $latex Y_{-} = \sum_{k=1}^{n} I_{-}(X_k) \sim Bin(n,p=1/2)$ Thus, $latex \displaystyle P( i \text{ observations } < \eta) = P\left( \sum_{k=1}^{n} I_{-}(X_k) = i \right) = P( Y_{-} = i ) $ From here, we're pretty much done, we just need to put the pieces together: $latex \displaystyle P(\eta < X_{(j)}) = \sum_{i=0}^{j-1} P( i \text{ observations } < \eta) = \sum_{i=0}^{j-1} P(Y_- = i) = P(Y_{i} \leq j-1) $ and similarly, $latex \displaystyle P(\eta > X_{(k)}) = \sum_{i=k}^{n} P( i \text{ observations } < \eta) = \sum_{i=k}^{n} P(Y_{-} = i) = \\ = P(k \leq Y_{-} \leq n) $ So adding these guys up $latex 1 - P(\eta < X_{(j)} \text{ or } \eta > X_{(k)}) = 1 – \left( P(\eta < X_{(j)}) + P(\eta > X_{(k)}) \right) = \\ = 1-P( k \leq Y_{-} \leq n) – P ( Y_{-} \leq j-1 ) =
P(Y_{-} < k ) - P(Y_{-} < j) \\ = P(j \leq Y_{-} \leq k-1) $ which demonstrates that $latex \displaystyle P\left(\eta \in (X_{(j)},X_{(k)})\right) = P(j \leq Y_{-} \leq k-1)$ which can be calculated in R, quite simply, via

pbinom(q=k-1,size=n,prob=0.5) – pbinom(q=j,size=n,prob=0.5)

It should be noted that if we chose k = n-j+1, then

$latex \displaystyle P\left( \eta \in (X_{(j)}, X_{(n-j+1)}) \right) = P( j \leq Y_{-} \leq n-j ) = 1 – 2P( Y_{-} < j)$ (by the symmetry of the binomial distribution). This yields the simple coverage probability formula $latex \displaystyle P(\eta \in (X_{(j)}, X_{(n-j+1)})) = 1 - 2\sum_{i=0}^{j-1}\binom{n}{i} \left(\dfrac{1}{2}\right)^{n} $ Again, easily calculated in R:

 1 – 2*pbinom(q=j-1, size=n, prob=0.5) 

So there we have it. An exact, nonparametric procedure for creating a confidence interval about the median that only took the (weak) assumptions of a continuous distribution function, and independent observations. Pretty neat!

Testing for Conditional Independence (Part 1)

So, if this ends up according to plan, this will end up being a multi-part blog post.

Background: I’m currently TA’ing an introductory statistics class at Berkeley (STAT135: Concepts in Statistics), and we just went over the various Chi-Squared, \chi^2 , tests. I.e. \chi^2-test for goodness-of-fit, homogeneity, and independence. However, these are all for two-dimensional contingency tables. This may all be needless jargon to you, but just look at the following table:

 Gender  Yes  No
 Male  10  30
 Female  24  16

These are (fake) results from an office survey of 40 men and 40 women which asked if the employees were dieting. Anyway, this is formally known as a two-way contingency table. (In fact, it’s a 2×2 contingency table.)

That aside, we may want to ask a few simple questions about this table. For example, is there a relationship between a person’s gender and their response? This is where a test for independence would come in handy, and where the Pearson’s Chi-Squared test for independence might be used — link provided for thoroughness, you don’t need to bother with it, though — (watch out for small cell counts, as they may throw your inference off). Using this test, we get a statistic who’s p-value is 0.1543814% (i.e. 0.001543814 for my percent-hating friends). Meaning,

IF gender and response were independent of each other, the chance we’d see a statistic with as much, if not more, extreme of than the one we calculated is slightly less than 16-in-10,000.

Effectively, you can conclude there’s some relationship between gender and response and not lose any sleep at night. So, that’s a pretty light amount of background.

Now, consider what would happen if you were able to collect each employee’s age (at the time of the survey). But, to keep things simple, you just noted which employees were over 30 and which were not. You’d now have 2 sets of 2×2 contingency tables. They might look like

Employees under 30
 Gender  Yes  No
 Male  4  15
 Female  12  8


Employees over 30
 Gender  Yes  No
 Male  6  15
 Female  12  8

These two tables are one representation of a three-way contingency table. Explicitly, you’re looking at 2x2x2 contingency table. The important thing to note about these tables is that we can slice-and-dice them however we’d like. For instance, want to look at what happens when we “slice” according to gender? No problem:

Male Employees
Age Yes No
Under 30 4 15
Over 30 6 15


Female Employees
Age Yes No
Under 30 12 8
Over 30 12 8

From this view, we may be poised to ask questions like: is there a relationship between age and whether a respondent is dieting, when the respondent is a guy? Likewise for women. From a quick view of the latter table, you’d probably be inclined to think that age isn’t really a factor in a woman’s choice to diet (since the proportion of dieters to non-dieters doesn’t change across age groups). However, for the men you have more men above thirty dieting, than below. This could be cause to ask whether the difference is statistically significant.

There are a few ways to test this (recognizing the small cell counts, one could be lead to use Fisher’s Exact test) however I’ll keep consistent and  (despite knowing better) use Pearson’s Chi-Squared test for independence on the first of the latter two tables. This yields a p-value of 58.34%, which is high enough for most people to say that they don’t have a reason to suspect an association between a male employee’s age and his being on a diet.

As an aside, it should be known that if we perform Fisher’s Exact test with test statistic being the number of male’s over 30 who responded “Yes” to dieting, our statistic has distribution:

x P(X =x)
0 0.0001089799
1 0.0022885789
2 0.0187247365
3 0.0790599984
4 0.1915684577
5 0.2791426098
6 0.2481267643
7 0.1329250523
8 0.0410503838
9 0.0065883332
10 0.0004161053

Hence, our (one-sided) p-value is P(X \geq 6) = 42.9\%. So, even using Fisher’s exact test, we don’t have enough evidence to reject the idea that the age and diet of a male employee are independent.

This does not mean we accept that they’re dependent (or associated). It just means that we shouldn’t be so quick to reject the idea that they are independent.

So, this probably brings you to the question: “why are you telling me all of this?”. Because we’ve just taken a rather long warm-up jog through Conditional Independence county, and I wanted to make sure you could appreciate a problem I just finished working on.

The question of investigating conditional independence in a 2x2x2 contingency table came up during one of my office-hours. It was clear to me that we can do this by performing 2, individual, tests for independence (one for each separate 2×2 table), but I figured there must be one, all-encompassing test. Turns out my hunch was correct:

For an IxJxK (three-way) contingency table, testing for conditional independence between the rows and columns can be done by calculating Pearson’s Chi-squared statistic (as if you were testing for regular independence) for each IxJ table, and then summing all K statistics. (This statistic ends up having (I-1) \times (J-1) \times K degrees of freedom.)

The technical aspects (proof of the degrees of freedom, as well as a simulation study) of the above statement will be the focus of the second part of this post. For now, the thing to take away from this post is that

  1. Contingency tables are cool,
  2. We shouldn’t be afraid to ask questions about (the nature of) their content, and
  3. We have tools to go about investigating (a certain amount of) said questions.