Graphs and Numerical Summaries
1 Data Types
2 Distributions and Graphs
3 Measures of Center
4 Measures of Spread
Definitions:
Variable
• Variable: any characteristic that takes different values for different individuals
• Categorical (qualitative) variables place an individual into one of several groups
• Examples: gender, race
• Categorical variables can be string (alphanumeric) data or numeric variables that use numeric codes to represent categories (for
example, 0 = Unmarried and 1 = Married).
• There are two basic types of categorical data:
– Nominal. Categorical data where there is no inherent order to the categories. For example, a job category of sales is not higher or lower than a job category of marketing or research.
• Quantitative variables (also referred as scale data) take on numerical values (mostly continuous ).
• Examples: height, age, wages
• Quantitative Data is measured on an interval or ratio scale, where the data values indicate both the order of values and the distance
between values.
• For example, a salary of $72,195 is higher than a salary of $52,398, and the distance between the two values is $19,797.
•
Interval Scale:
Definitions:
Variable
94
56 65 70 65 55 60 66 70 75 56
60 70 61 67 61 71 67 62 71 66
68 72 57 68 72 69 57 71 69 75
72 62 67 73 58 63 66 73 63 65
58 73 74 76 74 80 81 60 74 58
76 82 77 83 77 80 91 78 94 72
79 64 57 79 55 87 64 88 78 62
classes Freq.
Definition: Distribution
•
A
distribution
describes what values a variable
Distribution
1- Frequency Table
• A column for classes contain the data
Steps to form the frequency table for (categories) qualitative or small number of quantitative data:
- each value is itself a class
Steps to form Frequency table for a quantitative variable:
- In case of high number of quantitative data: each class is a range of data which has maximum and minimum values
1. Determine the highest and the lowest values of the data 2. Calculate the range= (the highest value – the lowest one) 3. Determine the number of classes (Relative)
4. Determine the length of each class: 5. Form the table:
Distribution
1- Frequency Table
classes Range L
#
Example: Find the frequency
table for the marks of the statistics class (8 classes)
56 65 70 65 55 60 66 70 75 56
60 70 61 67 61 71 67 62 71 66
68 72 57 68 72 69 57 71 69 75
72 62 67 73 58 63 66 73 63 65
58 73 74 76 74 80 81 60 74 58
76 82 77 83 77 80 91 78 94 72
79 64 57 79 55 87 64 88 78 62
Distribution : 1- Frequency Table
Example: Frequency table for a
quantitative variable
•
ةئفلا لوط يه امهنيب ةفاسملا نيتميق ةرابع ةئف لك
(
5
:)
يمست يلولأا ةميقلا
(
يندلأا دحلا
)
ةيناثلاو ؛
(
ىلعلأا دحلا
.)
•
ثيحب اهلبق يتلا ةئفلا ءاهتنا دنع أدبت ةئف لك
:
•
عقت تادرفملا نم ةدرفم لك
؛طقف ةدحاو ةئف لخاد
لاثم
:
•
يلولأا ةئفلا
:
–
يندلأا دحلا
=
ةجرد لقأ
(
55
)=
–
يلعلأا دحلا
=
يندلأا دحلا
+
ةئفلا لوط
=
55
+
5
=
60
–
نوكت يلولأا ةئفلا اذا
"
نم
55
نم لقأ ىلإ
60
"
–
ةجردلا نإ ظحلا
(
60
)
عقت لا
يلولأا ةئفلا لخاد
classes frequency
(f)
Relative frequency
55 - 60 10 0.143*
60 - 65 12 0.171
65 - 70 13 0.186
70 - 75 16 0.229
75 - 80 10 0.143
80 - 85 4 0.057
85 - 90 3 0.043
90 - 95 2 0.028
70 1
Distribution: 2- Graphs
•
Qualitative Data: we can graph the distribution
using
bar plots
and
pie charts
•
Quantitative Data:
Histograms
,
Frequency
Barplots and Pie Charts (Qualitative Data)
•
Pie charts are generally not as useful as barplots
•
Need to have all categories to make a pie chart
harder to compare subsets of categories
Sales
ةرايس ةنيفس ةرئاط ةلفاح
classes frequency
ةرايس 12
ةنيفس 10
ةرئاط 15
ةلفاح 5
sum 42
16
12
8
4
0
Barplots and Pie Charts (Qualitative Data)
Histograms (For distribution of Quantitative
Data)
•
Histograms emphasize
frequency
of different values
in the distribution
• X-axis: Values are divided into intervals (the length of a
classes)
• Y-axis: Height of each class is the frequency that values from
Histogram
يراركتلا جردملا
)
1
(
•
تانايبلاب صاخلا طيسبلا يراركتلا لودجلل ينايب ليثمت
ةيمكلا
•
ةقصلاتم ةينايب ةدمعأ نع ةرابع
•
ريغتملا ميق امنيب ،يسأرلا روحملا ىلع تاراركتلا
(
دودح
تائفلا
)
يقفلأا روحملا ىلع
•
هتدعاق لوطو ،ةئفلا راركت وه هعافترا ،دومعب ةئف لك لثمت
ةئفلا لوط وه
.
•
اهمجح ،مارجلاب نجاودلا نم ةنيع نازولأ يلاتلا يراركتلا عيزوتلا انيدل
100
( ةجاجد :)
– (Graph the Histogram) .
Classes Frequency (f)
600- 10
620- 15
640- 20
660- 25
680- 20
700-720 10
Sum 100
Relative frequency table and its relative
histogram
•
Find the relative histogram for the previous example
Classes Frequency(f)
600- 10
620- 15
640- 20
660- 25
680- 20
700-720 10
(F-Polygon)
•
طيسبلا يراركتلا لودجلل اضيأ ينايب ليثمت
•
يسأرلا روحملا ىلع تاراركتلا لثمت
تائفلا زكارمو
ىلع
يقفلأا روحملا
•
ليصوت متي كلذ دعبو ،ةميقتسم طوطخب تايثادحلإا لصوت
يقفلأا روحملاب علضملا يفرط
.
•
يلي امك ةئفلا زكرم بسحي
:
Graph the F-Polygon
classes
Frequency (f)
600- 10
620- 15
640- 20
660- 25
680- 20
700-720 10
Sum 100
• Example: find F-polygon for the previous example
Center
(600+620)/2= 610 (620+640)/2=630
650 670 690
F-Curve
• Similar to the F Polygon but we replace the straight lines
• The distribution of a variable can be described graphically and numerically in terms of:
• Center: where are most of the values located?
• Spread: how variable are the values?
• Shape: is the distribution symmetric or skewed? Are there multiple peaks or just one?
3- Measures of Center
•
Simple examples:
• Numbers: 1, 2, 6, 2, 4, 2, 5
Mean = Median= Mode=
• Numbers: 5.8, 5.7, 5.9, 5,7, 5.5, 5.7, 5.7, 5.7, 5.6
Mean = Median= Mode=
• Throw out the number 5.5 and again find the mean,
Example: suppose we have 5 students , four of them have the following marks: 1.8, 1.72, 1.5, and 1.8. the average of their marks is 1.7. compute the mark of the fifth student.
Measures of Center
In case more than one sample and each sample has its own mean, the weighted mean of these samples is:
Example:
Two groups of Statistics classes, where the first group consists of 50 student and the mean of their grades is 15. the second group consists of 40
students and the mean is 10. compute the weighted mean
n x
mean i
xi meann....
...
2 1 2 2 1 1
n
n
mean
n
mean
n
Mean
wThe weighted mean
• Example: we have two samples with the following results: find the weighted mean.
• Example: A student has taken three courses which have the
credits: 4, 3, and 5 hours. At the end of the semester, the final marks were as follows: 68, 72 and 81, respectively. Find the average mark of this semester.
The weighted mean
3- Measures of Center
1- if the sample consists of n values which are identically equal to a, then
2- the sum of the total deviations of the values from their mean equals to zero.
3- in case of adding a constant value (a) to each value in the sample then:
4- in case of multiplying a constant value (a) with each value in the sample then:
Properties of the mean
3- Measures of Center
a n na n a a a
mean ...
0
1
n i i mean x old new a meanmean
) .( old
new a mean
3- Measures of Center
•
1-
•
2-
Relations between the mean, mode and median:
3- Measures of Center
)
(
3
mod
e
mean
median
mean
Median
a
a
x
Median
x
Relations between the mean, mode and median:
•
Variation=(square of standard deviation)
2
var
iation
4- Measures of spread
1- if the sample consists of n values which are identically equal to a, then
2- in case of adding a constant value (a) to each value in the
sample then the standard deviation would not be changed
3- in case of multiplying a constant value (a) with each value in the sample then:
Properties of the Standard Deviation
4- Measures of Spread
0
S
) .( old
new a S
4- Measures of variation
Coefficient of Variation
• To measure the degree of spread of data
• It is beneficial to compare between the degree of spread of two group of data or more, which have different units.
100
.
X
s
V
C
Example: find the coefficient of variation of the following data 6 . 36 , 3 . 15 X s
Example: which are more homogeneous (have less degree of spread) the weights or the lengths?
lengths اweights data
Example: A group of 20 workers have the following data about their salaries X and the work hours Y:
1- find the st. dev. For the number of work hours. 2- Find the sum of the total salaries of the workers. 3- which has less of spread X or Y?
1 1 2 2 1 1 n i i X x n S 2 1 1 2 2 1 1 X x n S n i i
3982 , 184 16 , 2000 2 Y Y S X X4- Measures of spread
Standard deviation
Example:
Two different samples have been taken from a specific population, and the results were as the below table.
1- which of these two samples is more homogeneous?
2- If we merge the two samples then find the standard deviation of the new sample.
ةيناثلا ةنيعلا ىلولأا ةنيعلا
7560 660 60 1 2 60 1
i i X X 3200 300 30 1 2 30 1
i i Y Y4- Measures of spread
Quartiles
4- Measures of spread
4- Measures of spread
•
Outliers
•
Almost all values are between 5 and 13
•
50% of values are between 7.5 and 10
•
Center (Median) is around 8.5
•
Couple of suspected outliers: 14 and 14.5
Histograms versus Boxplots
• Both graphs give a good idea of the spread
• Boxplots may be a little clearer in terms of the center and
outliers in a distribution
center
outliers
spread of likely values
Associations between Variables
•
Positively associated
if increased values of one
variable tend to occur with increased values of
the other
•
Negatively associated
if increased values of one
variable occur with decreased values of the other
•
Old Faithful: eruption duration is positively
associated with interval between durations
•
Remember that
association is not proof of
Correlation
•
Correlation is a measure of the strength of
linear
relationship between variables X and Y
•
Correlation has a range between -1 and 1
• r = 1 means the relationship between X and Y is exactly positive linear
• r = -1 means the relationship between X and Y is exactly negative linear
• r = 0 means that there is no linear relationship between X and Y
Measure of Strength
Pearson Correlation
•
Correlation of two variables:
•
We divide by standard deviation of both X
and Y, so correlation has no units
r
1
n
1
(
x
i
x
)(
y
i
y
)
s
x
s
yX 4 5 3 4 4
Y 3 2 4 4 2
•
Find the Pearson correlation
y x i is
s
y
y
x
x
n
r
(
)(
)
1
1
X 4 5 3 4 2
Y 3 2 4 4 1
XY Y
X 2, 2,
2 2, ,
,
, Y XY X Y