Hwæt’s Up With Statistics

Not too long ago a paper on Old English linguistics swept the internet, garnering write-ups in major British newspapers and popping up on numerous websites. “We’ve been getting the first word of Beowulf wrong!” blasted the headlines, which of course led many non-Anglo-Saxonists to ask, “Have we been getting the first word wrong?” and many Anglo-Saxonists to answer, “We’ve never really known what the first word is…so maybe?”

The paper, for those who don’t recall, was George Walkden’s “The status of hwæt in Old English,” published in English Language and Linguistics and available on Walkden’s website. There was a fair amount of skepticism as well, since newspapers are not exactly the most reliable source for academic claims. For those who dug into the story a little bit, it began to seem like it was, perhaps, not such a big deal. Instead of being an interjection, hwæt was now being considered an exclamative. What exactly an exclamative is and how it differs from an interjection is, I think it’s fair to say, not immediately clear to most people.

I think Walkden’s article is of great importance. It’s the kind of work that all future scholarship on the subject will have to take account of, and (presuming that I am able to get a job that will allow me to teach Beowulf) will change the way that I teach the important opening lines. Most scholars could only dream of publishing something that influential. What’s also remarkable is that Walkden is drawing on really current research in linguistics to make his argument, most of which I think is great. Time has only improved my opinion of the overall work and its importance.

The basic claim is that we’ve been misreading Hwæt, translating it as a stand-alone interjection, when it should be more properly understood as a part of the following clause. This exclamative sense would be rendered in English as “How much we have heard of the might of the nation-kings in the ancient times of the Spear-Danes.” To take possibly the other most famous Anglo-Saxon hwæt, in “The Dream of the Rood” it becomes in Walkden’s reading, “How I want to tell you of the best of dreams.” Key to this understanding of hwæt is the idea of gradability, that there needs to be an element in the following clause that can be understood in terms of degree. We have heard so much about those kings; I so want to tell you about my dreams. The genius of this approach is that it allows hwæt to be quite flexible in its uses while also being consistent. I think this is a really great reading. How I want someone to translate Beowulf with this in mind!

That said, I don’t think that Walkden’s argument is completely convincing. For one thing, I don’t think the objections to the status quo raised by Eric Stanley are as significant as Walkden does. But the main thing I take issue with is Walkden’s statistical analysis, which I think has some pretty major flaws in it.

Let me pause for just a second to mention my familiarity with statistics. If you read my dissertation or hear me give a paper at a conference, you may be left with the impression that I have very little to do with math. It’s all about questions like, “How is a saint’s life like a swamp?” “How is a manuscript like Christ?” and other things in a similar vein. However, before I went all touchy-feely (quite literally), I was primarily a math person. I started college studying Mathematics and Chemistry, and as a first/second year student did graduate coursework in probability and statistics. I have forgotten a lot since then, so I also talked over Walkden’s statistics with my dad, who is a professor of Animal Science who specializes in population genetics. Basically, he’s Gregor Mendel with beef cattle.

Walkden’s statistical analysis uses Fisher’s exact test to examine word-order in clauses preceded by hwæt (or huat in the case of Old Saxon). Before I go on I want to explain what this test is, as it’s my default assumption that most people who study Old English do not take classes in statistics and understanding what a statistical test does is important to understanding how to interpret the results. This gets a bit long, so this explanation could be completely skipped without too much harm done, as I’m going to use a made up example to illustrate how the test works. I’ll put a big “End of Fisher’s exact test explanation” down below so that the non-mathematically inclined reader will know where to skip to.

BEGIN FISHER’S EXACT TEST EXPLANATION

Fisher’s exact test is a tool for evaluating contingency tables. As a very silly example, let’s imagine that twenty people watch the Norwegian zombie movie Dead Snow. Five out of twenty like it and the other fifteen dislike it. Furthermore, eight out of the twenty have curly hair and the other twelve have straight hair. We could construct a table for this data that looks like this:

Table 0

Have seen Dead Snow

have curly hair

have straight hair

Totals

liked it

a

b

5

didn’t like it

c

d

15

Totals

8

12

20

 The totals set up constraints, but there are a whole bunch of possible ways the table could be filled out with the group of twenty people, and for the moment I’ve left it blank. If you pick a certain value for a, it will determine the other three values, because everything will have to add up to the totals. For any given set of values, there are multiple possible combinations. Like when you roll dice, you can roll a seven by rolling any of the following combinations: 1+6, 2+5, 3+4, 4+3, 5+2, and 6+1. By contrast, there is only one way you can roll a 2: 1+1. The greater the number of ways of getting to the total, the more likely a certain number is to be rolled, as anyone who has played Settlers of Catan knows.

Contingency tables are a bit more complicated than dice; however, the basic idea is pretty easy to grasp. Values that are close to the proportions of the totals are more likely than values that aren’t. In this table the values are exactly proportional.

Table 1

Have seen Dead Snow

have curly hair

have straight hair

Total

liked it

2

3

5

didn’t like it

6

9

15

Total

8

12

20

 One fourth of the people like the movie (5/20), and this matches up with one fourth of the curly haired people liking it (2/8) and one fourth of the straight haired people like it.

Now, contrast that with this table:

Table 2

Have seen Dead Snow

have curly hair

have straight hair

Total

liked it

5

0

5

didn’t like it

3

12

15

Total

8

12

20

Here no straight haired people like the movie, a result that would seem kind of strange if we went along with the null hypothesis that hair type and love of Norwegian zombie movies have nothing to do with each other. It makes sense, then that the probability of this result would be pretty low.

Of course, when doing statistics, gut feelings (especially gut feelings about made-up examples) don’t count. This site gives a fairly simple tool for evaluating these types of problems, and I would highly recommend trying it out for yourself if you want to understand where these numbers are coming from.

To do the calculations for Table 1, input a=2, b=3, c=6, and d=9.

The calculator returns several values as well as a graph. The first value is the hypergeometric probability, 0.3973. This is the probability of producing this particular table with the given totals. The column graph on the right shows the probability of different values of a. The column for a=2 is the highest one, indicating that this is the result with the highest probability.

This tells us how rare something is, which is what we’re interested in with probability, but it doesn’t tell us how weird something is, which is what we’re interested in with statistics (these are not technical terms). If I had a die with one million sides, getting a 227 on my first roll would be very rare, but it wouldn’t be especially weird because every number has an equal possibility of being rolled.

Instead, we’re interested in the second value to the left of the graph, the two sided (or tailed) p-value. This tells us how likely it is that we would get a result as, or more, weird than the result for the selected value of a. The two tails refer to the tapering on either side of the distribution. In order to calculate the p-value, you take the probability of the desired a, and then add the probability for every other less probable event. Since a=2 has the highest probability, you have to add every other probability, which will give a value of 1.000. Statistical significance is only achieved for p<0.05, so we fail to reject the null hypothesis, which is that the values in the contingency table are the result of chance.

You can also see that a=5, which would be Table 2, has the lowest hypergeometric probability at .00361 (hover over the columns in the graph to get the exact values, although it’s a little bit tricky for this column because of how small it is). Since no other probability is smaller, the two-sided p-value will just be equal to the hypergeometric probability, which you can verify by inputting the values from Table 2 into the calculator.

My two example tables represent the most and the least probable results, because I artificially constructed them to show these extremes, but most of the time you are dealing with something in between. Let’s say that I used a=3, which would give the following table:

 Table 3

Have seen Dead Snow

have curly hair

have straight hair

Total

liked it

3

2

5

didn’t like it

5

10

15

Total

8

12

20

Inputting these values into the web calculator gives a hypergeometric probability of .2384 and a two-sided p-value= 0.3473. The graph of hypergeometric probabilities shows that a=1 and 2 are both more likely, which means that the two-sided p-value is found by adding up all the other probabilities. The 0.3473 comes, then, from adding the probabilities of a=0, 3, 4, and 5. The two-sidedness of this probability comes from the fact that you have to take account of probabilities at both extremes of the graph. We would still fail to reject the null hypothesis because random chance would result in a contingency table as weird as this one 34.73%.

END OF FISHER’S EXACT TEST EXPLANATION

When Walkden calculates statistical significance, he uses tables that compare verb placement in hwæt clauses to verb placement in main clauses and subordinate clauses, running a separate Fisher’s exact test for each comparison. The null hypothesis is that the ratio is the same, and the hypothesis is rejected for p<0.05.

One misconception that is easy for non-statisticians to fall into is that failing to reject the null hypothesis is equivalent to accepting it, but this is not the case. In my dad’s words,

 A lack of significance does not mean that the null hypothesis is accepted; it just means that you fail to reject the null hypothesis.  The null hypothesis in this case is that the ratios are similar for both situations. It is a subtle point, but an important one. The conclusion is not “no difference,” the conclusion is “not enough evidence to suggest that there is a difference.”  In that context, describing the reason that there is no difference can be somewhat dicey.

On the other side of this, rejecting the null hypothesis does not mean accepting whatever alternative hypothesis is offered. It is very important to consider multiple possible explanations for results, and jumping to a particular cause can be very misleading. For example, someone might conclude that ice cream consumption causes shark attacks at beaches, when in fact both are simply more likely in hot weather.

Walkden conducts analyses on three texts: the Old Saxon Heliand, the Old English Bede, and the Ælfric’s Lives of Saints.

1)  He begins by comparing verb placement in main and huat clauses in the Heliand, and gets results from Fisher that show the differences are statistically significant. This leads Walkden to assert:

For anyone who takes huat to be clause-external, this result must surely be a mystery: if huat influences the constituent order of the clause that follows it, it must be a part of that clause, and hence not an ‘interjection’ (472).

However, this demonstrates a real lack of imagination when it comes to developing alternative explanations for statistically significant differences. Given that verbs appear in multiple positions in Old Saxon, it seems reasonable to think that stylistic considerations, whether unconsciously expressed or consciously chosen, could influence verb placement, and the rarity of huat clauses certainly makes them stylistically significant. It is not at all clear to me that anything that influences word order within a clause must be a part of that clause. It seems like a kind of arbitrary rule, and I could with just as much justification assert a rule that clauses following huat used as an interjection will tend to have verbs in later positions.

2)  Continuing on, Walkden then compares huat clauses in the Heliand to subordinate clauses, and Fisher reveals that the difference is not statistically significant, with p=0.2545. Walkden then says,

This suggests that we should hypothesize that these two types of clause pattern together; in other words, clauses introduced by huat have the word order of subordinate clauses.

As mentioned above, a failure to reject the null hypothesis is not sufficient grounds to accept it. I also feel like Walkden lapses into some very imprecise phrasing. All the test allows us to say is “clauses introduced by huat have a word order that is not different in a statistically significant way from the word order of subordinate clauses.” It does not actually say that they are the same.

3)  Walkden then moves on to a consideration of his two Old English sources. He begins his discussion by stating, “Similar results are found for Old English,” and echoes this sentiment by introducing the contingency tables with, “The results of contingency tests based on these data are clear.” The problem is that the results are only similar for one of the two Old English texts.

The results from Bede are consistent with the results from the Heliand. The null hypothesis is rejected for main clauses and fails to be rejected for subordinate clause.

The issue is with the results from Lives of Saints, in which hwæt-clauses significantly differ from both main and subordinate clauses rather than just main clauses. This means that Walkden has to reject the null hypothesis with regard to subordinate clauses, meaning that hwæt-clauses and subordinate clauses do not pattern together, which is acknowledged, briefly, but as soon as the paragraph is over Walkden seems to erase all memory of it. He begins the next paragraph by asserting that “broadly the same results are obtained for Old English and Old Saxon” and concludes the statistical analysis statement with the following paragraph:

To recapitulate: in terms of constituent order, clauses introduced by hwæt in Old English and Old Saxon generally pattern statistically with subordinate clauses (including dependent questions and free relatives), rather than with root clauses as would be expected if hwæt were a free-standing interjection. The constituent order data presented in this section therefore give us strong reason to doubt that hwæt had such a syntactic role or status.

The problem here is that Walkden is using the adverbs broadly and generally to paper over inconsistencies in his statistics. Clauses introduced by hwæt don’t “generally pattern statistically with subordinate clauses.” You can’t generally statistically pattern. It’s not actually too far off from Anchorman. “Sixty percent of the time it works every time,” except here it’s “67% of the time it works every time.” People who throw down the gauntlet with their use of rigorous statistical analyses and challenge critics to explain their results—as Walkden clearly did when he said that the results in the Heliand must be a mystery to interjectionists—would benefit from casting the same critical eye at their own arguments when they are not supported by the statistics rather than reverting to “generally” and “broadly” hand-wavy type statements about percentages. He compounds this issue by later writing, “Rett’s claim that exclamatives pattern morphosyntactically with free relatives rather than questions fits perfectly with an account of Old English (and Old Saxon) hwæt-clauses as exclamatives, since, as I demonstrated in section 3, hwæt-clauses pattern with embedded clauses in terms of verb position.” As I hope is now clear, Walkden did not demonstrate this, and ignoring the results from Lives of Saints doesn’t mean they don’t exist.

The key thing that I find missing in Walkden’s article is a serious consideration of alternative explanations for his statistical results. I see three possibilities (although it is certainly possible there are more):

1)      Walkden is right that hwæt-clauses pattern with subordinate/embedded clauses, in which case the results from Lives of Saints are anomalous. This makes it vital to test more texts in order to determine whether or not Lives of Saints is anomalous and, if so, offer an explanation as to  why it is, especially as Walkden introduces it as a good example of Old English prose.

2)      Hwæt-clauses have their own pattern characterized by a preference for later verb position that sometimes, but not always, appears similar to the verb placement in subordinate clauses. Again, it would be important to examine more texts. If Walkden had argued for this position rather than the first position, I would find it much less problematic.

3)      Hwæt is an interjection which exerts a stylistic pressure pushing verbs later in following clauses. Walkden takes it as a given that anything that influences word order must be a part of the clause, but this is not at all clear to me, especially in a language like Old English where multiple verb placements are perfectly grammatical and may be influenced by stylistic considerations. Examining more texts would again be useful.

The key to distinguishing between explanations 2 and 3 is the idea of gradability that Walkden takes from Jessica Rett. If every instance of hwæt appears before a sentence containing a gradable element, it would be powerful evidence for hwæt as an exclamative rather than an interjection.

In any event, Walkden’s statistical results need much more testing. I’m especially curious about whether or not Ælfric’s other texts will behave in a similar fashion. The main issue is that an experimental unit of three texts is not really sufficient for secure conclusions. As my dad says,

It would be akin to trying to compare varieties of apple trees when you have only two trees.  Measuring 40 apples from each tree does not eliminate the fact that you only have two trees.

The tests Walkden runs allows him to reject the null hypothesis for verb placement within each text, but this is not enough to generalize about verb placement in Old English, nor is it actually sufficient for drawing conclusions between texts, as Walkden does not run any tests to determine how verb placement compares across texts. In order to rectify this, I used Walkden’s data to compare root verb placement for the pairings Heliand/Bede, Heliand/Ælfric, and Bede/Ælfric, and also subordinate verb placement in the same pairings. The null hypothesis then would be that any differences in verb placement would be due to chance. In every single case, the null hypothesis is rejected. There are statistically significant differences in both main verb placement and subordinate verb placement for every pairing. This is not especially surprising for the Heliand comparisons since Old Saxon is a different language, but it does raise some questions about how well we currently understand the factors that influence verb placement in Old English. Walkden chose Bede and Ælfric because he felt that they were not overly dependent on Latin word order, but the differences in verb placement between the texts must have other explanations, such as conscious style, dialectal differences in verb placement frequency, individual preference in different types of clauses, or something else entirely. I’ve included the contingency tables for each test below.

I like statistics and think they can be a powerful tool in historical linguistics, provided that scholars make a serious effort to understand them. I also really like Walkden’s argument in favor of the exclamative hwæt, although I am not entirely convinced of it.  I hope the criticisms brought up here don’t detract from the arguments made in the rest of the article. However, I think his small sample makes it very difficult to generalize to Old English as a whole, which is something he is clearly interested in doing, and the results he obtains in the article don’t actually support the claims he is making about hwæt-clauses patterning with subordinate clauses in Old English, although it does offer a, to me, more exciting possibility of clauses with their own unique behavior.

Heliand/Bede Root V1/V2 V-later Total
Heliand Root 2078 270 2348
Bede Root 1898 819 2717
Total 3976 1089 5065

p=9.718×10^(-61)

Heliand/Ælfric Root V1/V2 V-later Total
Heliand Root 2078 270 2348
Ælfric Root 3204 969 4173
Total 5282 1239 6521

p=9.051×10^(-33)

Bede/Ælfric Root V1/V2 V-later Total
Bede Root 1898 819 2717
Ælfric Root 3204 969 4173
Total 5102 1788 6890

p=1.998×10^(-10)

Heliand/Bede Sub V1/V2 V-later Total
Heliand Sub 567 1629 2196
Bede Sub 1863 3067 4930
Total 2430 4696 7126

p=2.376×10^(-23)

Heliand/Ælfric Sub V1/V2 V-later Total
Heliand Sub 567 1629 2196
Ælfric Sub 3467 2168 5635
Total 4034 3797 7831

p=1.728×10^(-182)

Bede/Ælfric Sub V1/V2 V-later Total
Bede Sub 1863 3067 4930
Ælfric Sub 3467 2168 5635
Total 5330 5235 10565

p=1.946×10^(-132)

About these ads

1 Comment

Filed under Uncategorized

One response to “Hwæt’s Up With Statistics

  1. Renee Massarello

    This is an enlightening explanation! I have never taken a course in statistics, but the explanation you give is clear and easy to follow. It is intersting, in any case, to use Fisher’s test to understand the use of a word, and perhaps your post will spur further examination of other sources.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s