new drinking laws in the UK. The article from "The Sun" is titled "Sup all night". The article from "Daily Mail" is titled "Police braced for the great British binge" (see references for more detail).
The research consists of the following steps. First, I select a sample of 100 words from each article. I count the word length and frequency of the same length words putting it into the summary table and analyze the findings. Then I do the same procedure for 200 words and 400 words. The reason why I decided to split my analysis into those 3 consecutive steps is in order to see any possible changes in my statistical indicators (such as mean, median, mode). On average, they should not volatile drastically for each article when moving to a larger size sample. But they should become more accurate as in a larger-size samples random differences should smooth out.
As was noted above, for each step sample size I calculate mean, median and mode. The mean shows me what the average word length in the sample is by merely dividing the total number of letters in the sample by the total number of words. So it can be any decimal number, like 4.53. It doesn't tell me the exact number of letters in the word (as there are no words with 4.53 letters), but it gives a good estimation of distribution of letters across the words.
However, the mean could yield a bit misleading results if the data distribution is skewed to the left or right. Then the outliers will have too big weigh in contribution to mean, distorting the real picture. To deal with this I also calculate the median, which shows the length of the word that is in the middle of the sample. If I range all 100 words from the sample starting from 1-letter words to 14-letter words, the median will be the length of the word in the centre.
The mode simply shows the word length of the most frequently met words with the same length.
I did the following assumptions when selecting the samples and calculating the word lengths:
1) I do not count punctuation marks (i.e. commas, periods, questions marks, quotes, etc. are taken into account).
2) A word with a hyphen is counted as if it is a word without a hyphen (e.g. shake-up is regarded as a seven-letter word).
3) The apostrophe in a word is not counted (e.g. labour's is regarded as a seven-letter word).
4) When I encounter a number, a date or a time, I take it into account. The number of symbols in it becomes the word length (e.g. 10500 is regarded as a five-letter word, 11pm is regarded as a four-letter word).
1) Based on the sample of 100 words from each articles I received the following data:
I can conclude that there are quite some differences here. All mean, median and mode are slightly higher for "Daily Mail" that for "The Sun". However, this difference is certainly not significant. In each article there is a similar tendency of high frequency of short-length words. The words up to 5 letters (including 5) account for 75% of all words for "The Sun" and for 61% of all words for "Daily Mail". So based on this sample we can argue that the author of "Daily Mail" article on average uses longer words than that in "The Sun".
The histograms below show the distribution of frequencies for each article.
Words with 3 and 4 letters each account for more than 20% of all the words in the article.
No category account for more than 20% of all the words (unlike in "The Sun"). Each 2-, 3-, 4- 5- and 7-letter words has more than 12% share of the total