Friday, October 24, 2014

Let's learn R: Confidence intervals, and the "So what?" question for history and literary studies

To catch up with the last two posts (one, two): In 2007, Goran Proot and Leo Egghe published an article in Papers of the Bibliographical Society of America (102: 149-74) in which they suggested a method for estimating the number of missing editions of printed works based on surviving copies. In 2008, Quentin Burrell (in Journal of Informetrics 2:101-5) showed that their approach was a special case of the unseen species problem, which has been widely studied in statistics, and suggested a more robust approach based on the statistical literature. The problem is that applying Proot and Egghe's method is within the abilities of most book historians, while applying Burrell's is not. The last few posts have been dedicated to implementing Burrell's approach in R and bringing it within reach of the non-statistician.

The specific estimate that Burrell's method offers is not substantially different from Proot and Egghe's. Burrell's approach does offer two important advantages, however. The first is a confidence interval: within what range are we 95% confident that the actual number of lost or total editions lies? Burrell offers the following formula, where n is the number of editions (804 in Proot and Egghe's data):

This formula in turn relies on the formula for the Fisher information of a truncated Poisson distribution, which Burrell derives for us:

This again looks fearsome, but all we have to do is plug the right numbers into R, with e = Euler's number and λ = .2461 (as we found last time).

Let's define i according to Burrell's equation (5):
> i=(1-(1+.2461)*exp(1)^-.2461)/(.2461*(1-exp(1)^-.2461)^2)
> i
[1] 2.198025


So according to Burrell's equation (4), the confidence interval is .2461 plus or minus the result of this:
> 1.96/(sqrt(804*i))
[1] 0.04662423


Or in other words,
> .2461-1.96/(sqrt(804*i))
[1] 0.1994758
> .2461+1.96/(sqrt(804*i))
[1] 0.2927242


So we expect that there is a 95% chance that the actual fraction of missing editions lies somewhere between .7462 and .8192, which we find in the same way as mentioned in the last post, which also lets us estimate a range for the number of total editions:

> exp(-.1994758)
[1] 0.81916
> exp(-.2927242)
[1] 0.7462279


> 804/(1-exp(-.1994758))
[1] 4445.92
> 804/(1-exp(-.2927242))
[1] 3168.197


The second advantage of using Burrell's method is of fundamental importance: It forces us to think about when we can apply it, and when we can't. We can observe both numerically and graphically that Proot and Egghe's data are a very good fit for a truncated Poisson distribution, and therefore plug numbers into Burrell's equations with a clean conscience. (NB: I have stated in print that a truncated Poisson distribution is a very poor fit for modelling incunable survival, and I think it will usually be a poor model for most situations involving book survival. What to do about that is a problem that still needs to be addressed.)

In addition, Burrell offers a statistical test of how well the truncated Poisson distribution fits the observed data. Burrell's Table 1 compares the observed and the expected number of editions with the given number of copies, to which he applies a chi-square test for goodness of fit. Note, however, that a rule of thumb for chi-square tests is that no category should have five or less observations, and so Burrell combines the number of 3-, 4-, and 5-copy editions into one category.

Can R do this? Of course it can. R was made to do this.

> observed <- c(714,82,8)
> expected <- c(709.11,87.27,7.62)

> chisq.test (observed, p=expected, rescale.p=TRUE)

        Chi-squared test for given probabilities

data:  observed
X-squared = 0.3709, df = 2, p-value = 0.8307


So we derive nearly the same chi-squared value as Burrell does, .371 compared to his .370. My quibble with Burrell is that I can't see how there can be only 1 degree of freedom (df), as Burrell says, rather than 2, or .89 for a p-value rather than .831. The chi-squared value can be anything from zero (for a perfect fit) on up (for increasingly poor fits), while the degrees of freedom is defined as the number of categories (which here are the 1-copy editions, 2-copy editions, or 3/4/5-copy editions) minus one. The p-value ranges from very small (for a terrible fit) up to 1 (for a perfect fit). There is a great deal written about how to apply and interpret the results of a chi-square or other statistical tests, but the values above support the visual impression, as Burrell notes, that Proot and Egghe's data is a very close fit to a truncated Poisson distribution.

* * *

So why should a book historian or scholar of older literature care about any of this? Stated very briefly, the answer is:

For studying older literature, we often implicitly assume that what we can find in libraries today reflects what could have been found in the fifteenth or sixteenth centuries, and that manuscript catalogs and bibliographies of early printing are a reasonably accurate map of the past. But we need to have a clearer idea of what we don't know. We need to understand what kinds of books are most likely to disappear without a trace.

For book history, the question of incunable survival has been debated for over a century, with the consensus opinion holding until recently that a small fraction of editions are entirely unknown. It now seems clear that the consensus view is not correct.

For over seventy years, the field of statistics has been working on ways to provide estimates of unobserved cases, the "unseen species problem." The information that the statistical methods require - the number of editions and copies - is information that can be found, with some effort, in bibliographic databases. The ISTC, GW, VD16, and others are coming closer to providing a complete census and a usable copy count for each edition.

Attempts to estimate lost editions from within the field of book history have taken place independently of the statistical literature. This has only recently begun to change. Quentin Burrell's 2008 article made an important contribution to moving the discussion forward, and he challenged those studying lost books to make use of the method he outlined.

Statistical arguments are difficult for scholars in the humanities to follow, however, and the statistical methods Burrell suggested are difficult for scholars in the humanities to implement. The algebraic formula offered by Proot and Egghe is much more accessible - but limited. We have a statistical problem on our hands, and making progress requires engaging with the statistical arguments.

We can do it. We have to understand the concepts, but humanists are good at grappling with abstractions. We can use R to handle the calculations. These three posts on R and Burrell provide all we need in order to turn copy counts into an estimation of lost editions along with confidence intervals and a test of goodness of fit. Every part of the more robust approach suggested by Burrell is now implemented in R. We could write a script to automate the whole process. This doesn't solve all our problems - there's the problem of most kinds of book survival not being good fits for a Poisson distribution - but it at least gets us caught up to 2008.

No comments:

Post a Comment