Descriptive statistics
Specification: Descriptive statistics: measures of central tendency – mean, median, mode; calculation of mean, median and mode; measures of dispersion – range and standard deviation; calculation of range; calculation of percentages.
Once quantitative data has been collected, it is important to summarise this data numerically. This quantitative summary is called descriptive statistics, and allows researchers to view the data as a whole. It also helps the reader to get an understanding of the data and saves them from needing to navigate through lots of results to get a basic understanding of the data. Descriptive statistics typically include a measure of central tendency and a measure of dispersion (which will have been selected based on the type of data collected), and can also include percentages.
Measures of central tendency
Measures of central tendency tell us about the central, most typical, value in a data set and are calculated in different ways.
Mean
Perhaps the most widely used measure of central tendency is the mean. The mean is what people most are referring to when they say ‘average’: it is the arithmetic average of a set of data. It is the most sensitive of all the measures of central tendency as it takes into consideration all values in the dataset. Whilst this is a strength as it means that all the data is being taken into consideration, the sensitivity of the mean is something that must be considered when deciding which measure of central tendency to use. It can be very misrepresentative of the data set if there are extreme scores present.
The mean is calculated by adding all of the data together, and dividing the sum by how many values there are in total. The value that is then given should be a value that lies somewhere between the maximum and minimum values of that dataset. If it isn’t, then there is a human error with the calculations!
Example: a student sits five mock A‐level psychology exams, and gets 65%, 72%, 71%, 67% and 79%. To calculate their mean score, you would add all the scores together (65+72+71+67+79 = 354) and then divide by the number of scores there are (354÷5 = 70.8). This gives a mean score of 71%. Looking at the data set, a mean of 71% looks quite accurate, as all of the scores are quite close to this value.
However, if we now imagine that the student got 12% instead of 65% in their final mock, then this completely alters our mean value and the student’s typical score! (12+72+71+67+79 = 301, 301÷5 = 60.2%). This is a mean score of 60%, which is lower than four of their other scores, isn’t providing a very fair view of how well the student normally performs.
Median
In cases where there are extreme values in a data set, thus making it difficult to get a true representation of the data through using the mean, the median can be used instead. The median is not affected by extreme scores, so is ideal when considering a data set that is heavily skewed. It is also easy to calculate, as the median takes the middle value within the data set.
Example: If there are an odd number of scores, then the median is the number which lies directly in the middle when you arrange the scores from lowest to highest. Using the previous data set as an example, there are five values that would be placed in the following order: 12%, 67%, 71%, 72%, and 79%. Therefore, the median value would be the third value which is 71%.
Interestingly, the median score for this data set is 71%, yet the mean score was 60.2%. It is clearly apparent from the data that the median is a more representative score which has not been distorted by the extreme score of 12%, unlike the mean. If there is an even number of values within the data set, there will be two values that fall directly in the middle. In this case, the midpoint between these two values is calculated. To do this, the two middle scores are added together and then divided by two. This value will then be the median score.
Example: If the above data set included a sixth score, e.g. 12%, 34%, 67%, 71%, 72%, and 79% then the median score would be 69% (67%+71%÷2).
Mode
The third measure of central tendency is the mode. This refers to the value or score that appears most frequently within the data set. Whilst easy to calculate, it can be quite misleading of the data set. Imagine if the lowest value in the example data set (12%) appeared twice. It wouldn’t be truly representative of the whole data set; however, this would be the mode score.
A strength of using the mode is that it can be used on categorical data, whilst the mean and median cannot. For example, if participants were asked to identify the way that they travelled to work each day, and gave answers such as ‘car’, ‘bus’, or ‘walk’, then a mode could still be identified for this set of data, as it is simply the response that was given most often.
It is possible that a data set can have more than one mode. If there are two, then the data set is called bi‐modal. If there are more than two, then the dataset is considered multi‐modal. However, it is also possible that there is no mode for a data set, if all of the values are different.
Exam Hint: Whenever you are asked to calculate any measure of central tendency, make sure you show your calculations. Often, the question will be worth two or three marks; so, it is important to show how you reached your final answer for maximum marks!
Measures of dispersion
Measures of dispersion are descriptive statistics that define the spread of data around a central value (mean or median). There are two measures of dispersion: range and standard deviation (SD).
Range
The range is calculated by subtracting the lowest score in the data set from the highest score in the data set and (usually) adding 1. The addition of 1 to the calculation is a mathematical correction which allows for the fact that some of the scores in the data set will have been rounded up or down.
Referring to the earlier example, the lowest value was 12 and the highest was 79, resulting in a range of 67 (79‐12=67, or 79‐12+1=68). This value is very straightforward to calculate, which is a clear strength of using the range. However, it is important to recognise that a data set with a strong negative skew can have a similar range to a data set with a strong positive skew, in which case it may be providing a very limited insight into the data set. Equally, it is only taking into consideration the two extreme scores, which may not be an accurate representation of the data set as a whole.
Students often ask “Why do you add 1 to the range?” and the answer is a simple one which is best illustrated with an example: If the lowest score is 5 and the highest score is 9, the possible scores are 5, 6, 7, 8 and 9. There are five possible scores, but 9 – 5 = 4. The simple calculation ignores the fact that you have to include the lowest score in the range, so you add 1.
Standard deviation
A much more informative measure of dispersion is the standard deviation. However, the increased level of detail comes at the cost of a slightly more complicated calculation in comparison to the range. The standard deviation looks at how far the scores deviate from the mean. If the standard deviation is large, this suggests that the data is very dispersed around the mean and, for example, the participants scored very differently. However, if the standard deviation value is quite small, this suggests that the values are very concentrated around the mean, and that everyone scored relatively similarly to one other.
The standard deviation score takes into consideration all of the values within the data set, and is a very precise measurement. However, in the same way as the mean, the fact that it takes into account every value means that it can be easily distorted by an extreme value, which could in turn mean that it misrepresents the data.
Exam Hint: Questions regarding interpretation of standard deviation values are often worth several marks, so it is important to make sure you link your answer back to the question, rather than just pointing out how they are different. Make sure you tell the examiner what these scores actually tell you about the data!
Calculation of percentages
Providing percentages in the summary of a dataset can help the reader get a feel for the data at a glance, without needing to read all of the results. For example, if there are two conditions comparing the effects of revision vs. no revision on test scores, a psychologist could provide the percentage of participants who performed better having revised, to give a rough idea of the findings of the study. Let’s imagine that out of a total of 45 participants, 37 improved their score by revising.
In order to calculate a percentage, the following calculation would be used:
The bottom number in the formula should always be the total number in question (such as total number of participants, or total possible score), with the top number being the number that meets the specific criteria (such as participants who improved, or a particular score achieved). This answer is then multiplied by 100 to provide the percentage.
Calculation of percentage increase
In order to calculate a percentage increase, firstly the difference, i.e. increase, between the two numbers being compared must be calculated. Then, the increase should be divided by the original figure and multiplied by 100 (see example calculation below).
For example: A researcher was interested in investigating the effect of listening to music on time taken to read a passage of text. When participants were asked to read with music playing in the background, the average time to complete the activity was 90 seconds. When participants, undertook the activity without any music the average time taken to complete the reading was 68 seconds.
Calculate the percentage increase in the average (mean) time taken to read a passage of text when listening to music. Show your calculations. (4 marks)
Increase = new number – original number
Increase = 90 – 68 = 22
% increase = increase ÷ original number × 100
% increase = 22 ÷ 68 × 100
22/78 = 0.3235
0.3235 × 100 = 32.35%
Calculation of percentage decrease
In order to calculate a percentage decrease, firstly the decrease between the two numbers being compared must be calculated. Then, the decrease should be divided by the original figure and multiplied by 100 (see example calculation below).
For example: A researcher was interested in investigating the effect of chewing gum on time taken to tie shoelaces. When participants were asked to tie a pair of shoelaces in trainers whilst chewing a piece of gum, the average time to complete the activity was 20 seconds. When participants, undertook the activity without chewing any gum the average time taken to tie the shoelaces was 17 seconds.
Calculate the percentage decrease in the average (mean) time taken to tie shoelaces when not chewing gum. Show your calculations. (4 marks)
Decrease = original number – new number
Decrease = 20 – 17 = 3
% decrease = decrease ÷ original number × 100
% decrease = 3 ÷ 20 × 100
3/20 = 0.15
0.15 × 100 = 15%
Possible exam questions
Name one measure of central tendency. (1 mark)
Which of the following is a measure of dispersion? (1 mark)
A Mean
B Median
C Mode
D Range
Calculate the mode for the following data set. (1 mark) 10,2,7,6,9,10,11,13,12,6,28,10
Calculate the mean from the following data set. Show your workings. (2 marks) 2, 8, 10, 5, 9, 11, 15, 4, 16, 20
Explain the meaning of standard deviation as a measure of dispersion. (2 marks)
Other than the mean, name one measure of central tendency and explain how you would apply this to a data set. (3 marks)
Explain why the mode is sometimes a more appropriate measure of central tendency in comparison to the mean. (3 marks)
Explain one strength and one limitation of the range as a measure of dispersion. (4 marks)
Evaluate the use of the mean as a measure of central tendency. You may refer to strengths and/or limitations in your response. (4 marks)