Wednesday, March 12, 2014


Over the course of the last few years, I've been using statistical data that I've compiled manually on the works of Bernard Shaw. The links below give you access to a number of spreadsheets containing the full list of words of each of Shaw's plays, prefaces, and novels. There are a couple of missing items (Immaturity being one of them), but for the most part everything else is there. 

The spreadsheets contain six columns, distributed as follows: 

N: The number (in descending order of frequency) the word occupies; i.e., word 1 is the most frequent. 
Word: The word itself, always spelled in capitals. 
Freq: The absolute frequency of the word; i.e. how many times we find it in the text. 
%: The percentage the frequency of the word represents out of the total number of words in the text. The cell is left blank if the percentage is lower than 0,01%.

The last two columns, Text and the second % are set to 1 and 100 by default, respectively. These two sets of data would only be relevant if we were studying more than one text at a time, and we wanted to know in how many of them the word appears, regardless of its aggregate frequency. 

From my experience as a corpus linguist, there are a couple of caveats you may want to consider. First of all, word processors have a tendency to abhor Shaw's spelling, so if you are going to analyze his use of contractions, you may be in for a hard time. Also, some examples of eye dialect (I'm thinking of John Bull's Other Island or Pygmalion, for example) count as separate words even though they may actually be phonetic spellings of a given word. So the safest policy if you want to study the dialectological varieties in Shaw's plays is to read the source texts carefully rather than plunge into statistical analysis head first. 

I hope you enjoy this, and if you find it useful, you might want to join the International Shaw Society and support this and other initiatives. 

No comments:

Post a Comment