χ2 Goodness of Fit Test for the QCS Letters
Let Pi = Proportion of words of length i in Twains writings.
Brinegar selected seven letters (11,000 words) that are indisputably Mark Twain's. From these letters he calculated Pi . The counting was done by tally method, with the count limited to text words. Arbitrarily, he omitted headings, proper names, direct quotes, foreign words, abbreviations, and involved dialectical spellings. Hyphenated words were counted as separate words. This was done in an effort to eliminate any characteristics of the subject matter. The resulting proportion of words of length i is as follows:
P1 = 0.048, P2 = 0.181, P3 = 0.230, P4 = 0.193, P5 = 0.114,
P6 = 0.074, P7 = 0.060, P8 = 0.040, P9 = 0.029, P10+ (10 or more) = 0.031
As a check on the consistency over a protracted time, two additional samples of Twain's work were counted and compared (evaluated). Brinegar noted that a word frequency table maintained a high degree of consistency over a span of 40 years.
The ten QCS letters had a total of 13,175 words and the number of words of length i is presented in the table below. Also, the expected number of words of length I, if the QCS letters were written by Mark Twain, is presented in the third row:
Brinegar selected seven letters (11,000 words) that are indisputably Mark Twain's. From these letters he calculated Pi . The counting was done by tally method, with the count limited to text words. Arbitrarily, he omitted headings, proper names, direct quotes, foreign words, abbreviations, and involved dialectical spellings. Hyphenated words were counted as separate words. This was done in an effort to eliminate any characteristics of the subject matter. The resulting proportion of words of length i is as follows:
P1 = 0.048, P2 = 0.181, P3 = 0.230, P4 = 0.193, P5 = 0.114,
P6 = 0.074, P7 = 0.060, P8 = 0.040, P9 = 0.029, P10+ (10 or more) = 0.031
As a check on the consistency over a protracted time, two additional samples of Twain's work were counted and compared (evaluated). Brinegar noted that a word frequency table maintained a high degree of consistency over a span of 40 years.
The ten QCS letters had a total of 13,175 words and the number of words of length i is presented in the table below. Also, the expected number of words of length I, if the QCS letters were written by Mark Twain, is presented in the third row:
For instance, 632.4 in the third row were obtained from (13,175 X 0.048). Particularly noticeable are the differences in numbers of words of one, two, three, and four letters, as well as Twain's tendency to use relatively fewer words of seven or more letters. A statistical method (χ2 goodness of fit test) was conducted to determine whether the same person wrote the two sets of writings.
Since 294.7 are much greater than 21.7( , H0 cannot be accepted: the discrepancy across the number of words of length i is far too large to be attributed to
random fluctuations, i.e. the letters do not seem to have been written by Mark Twain. Brinegar concluded, based on the results of applying a statistical test of authorship, both to the ten QCS letters and to known contemporary Twain writing, that Twain was not the author of the disputed letters.
Notes
Brinegar, C., "Mark Twain and the Quintus Curtis Snodgrass Letters: A Statistical Test of Authorship", Journal. American Statistical Association, 1963, 58 (301): 85-96.
random fluctuations, i.e. the letters do not seem to have been written by Mark Twain. Brinegar concluded, based on the results of applying a statistical test of authorship, both to the ten QCS letters and to known contemporary Twain writing, that Twain was not the author of the disputed letters.
Notes
Brinegar, C., "Mark Twain and the Quintus Curtis Snodgrass Letters: A Statistical Test of Authorship", Journal. American Statistical Association, 1963, 58 (301): 85-96.