Histograms

=HISTOGRAMS= ~Taylor & Simir


 * Defintion:** a graphical summary of data in the form of bars--called bins, each bin representing a particular frequency
 * similar to a bar graph, but a bar graph shows an absolute amount in each category on the y-axis ---> histograms show the __frequency of events__ on the y-axis
 * also similar to the way a histogram summarizes data is a boxplot, which is often used coupled with histograms; boxplots include information about the medians and range of the data --> for more in depth information on boxplots see this webpage http://killianhpre2010-11.wikispaces.com/Five+Number+Summary%2C+Box+Plots

Example: let's say we took a poll of how many men and women there are in a class--men would answer A and women would answer B; a histogram would work well to show //how many// men and women there are (the frequency of people who answered A and B)

Still not quite sure what to use a histogram for? For all you nature lovers out there, here's one about black cherry trees.

Due to the great rainfall this past spring, the growth of cherry trees was tremendous. An environmentalist (and cherry pie lover) measured the heights of the cherry trees. The table/ histogram below shows the number of cherry trees falling within given ranges of height.

The numerical ranges would be the **bins** and the number of trees in those ranges would be the frequency of the data. So the histogram would look like this:
 * Height (feet) || Number of Black Cherry Trees ||
 * 60-65 || 3 ||
 * 65-70 || 3 ||
 * 70-75 || 8 ||
 * 75-80 || 10 ||
 * 80-85 || 5 ||
 * 85-90 || 2 ||


 * The bins are reflected on the x-axis and the frequencies are included on the y-axis

-bins (classes): these are the different data groups on the graph; you can think of the bins as containers that collect data and fill at a rate that is equal to the frequency of that data group -unimodal: having one maximum -bimodal: having two maximums -multi-modal: having multiple maxima
 * Important terms:**

-bin width needs to be taken into account when making a histogram depending on the data distribution and the purpose of the analysis --> some bins will have a wider range of data, while others may have a more narrow field of data. In order to be statistically accurate/useful, bins need to be the same width.
 * Bin width:**

-you should test different bin widths and compare the results to see which bin width results in the best shape of the histogram -usually bin width is chosen so that there are between 5 and 20 classes (data groups) -things to take into account:
 * if the data is unimodal, bimodal, or multi-modal: sometimes bin widths that are too narrow or too wide will make it difficult to tell that the graph has more than one maximum

-a cumulative histogram shows the cumulative frequency of an event, or events, occurring in the bins (data groups)
 * Cumulative histogram:**

Example: The two histograms below are for a test given to 300 students. This histogram demonstrates a normal histogram. So as described above, the first bin would represent that about 30 students received between a 41 and a 50 on the test, the fourth bin shows that about 90 students received between a 71 and an 80, and so on.

This is a cumulative histogram. So the first bin now represents that about 30 students received AT LEAST a 50 on the test. The second bin represents that about 70 students received AT LEAST a 60 on the test. Thus, the bins now encompass everything up to and including the range stated--hence, cumulative. Whereas in a regular histogram, the bins only include their set range, bins a cumulative histogram encompass the entire range previous.

-biggest value minus the smallest value EXAMPLES: range= 23-19 range= 4
 * Range:**

range= 11-7 range= 4

For a more detailed explanation of range, see this page http://killianhpre2010-11.wikispaces.com/Range+and+Interquartile+Range

-a histogram can either be symmetrical, skewed to the left, or skewed to the right--> the skewness of a histogram tells about the data
 * Skewness:**

The distribution of this histogram is skewed to the left, meaning (in this case) that the number of individuals with a body mass between 175 and 190 is high. The skewness of a histogram is important in telling where the majority, or mode, of the data lies. To learn more about mode, see this page http://killianhpre2010-11.wikispaces.com/Mean%2C+Median%2C+and+Mode.

Positive and Negative Skew -asymmetrical histograms are positively or negatively skewed
 * a distribution that is positively skewed has its maximum towards the left (or its trail off longer at the right) like the histogram below
 * a distribution that is negatively skewed has its maximum towards the right (or its trail off longer at the left) like the histogram below

=What can we learn from a histogram?=

1) type of frequency distribution
 * there are various shapes that can describe the frequency distribution, the most basic shape is shown in the histogram below
 * normal distribution
 * Depending on bin width, the data can display normal distribution. This histogram is symmetric with a classic bell-curve shape. The frequency counts in such a graph bunch in the middle and taper off at the tails. If a histogram displays such characteristics as seen below, the next step is to perform a normal probability plot in order to see if there is any correlation.
 * Probability plots can be used prior to creating a histogram to see if data is normally distributed. (See Normal Distribution)

2) symmetry of the histogram
 * symmetrical [[image:normah01.gif width="127" height="110"]][[image:norbih01.gif width="127" height="110"]]
 * symmetrical distribution can be unimodal, bimodal, or multi-modal
 * asymmetrical [[image:asym1o.gif width="127" height="110"]]
 * asymmetrical distributions can be unimodal, bimodal, or multi-modal

3) whether the data is unimodal, bimodal, or multi-modal
 * unimodal: one maximum [[image:normah01.gif width="127" height="110"]]
 * bimodal: two maxima [[image:norbih01.gif width="126" height="109"]]

4) probability distribution
 * A histogram shows frequency distribution --> so you can convert that to probability distribution by dividing the frequency of the data in that particular bin by the total number of data
 * This links the probability of a certain bin out of the total bins. (See Probability Distribution)

5) the range or variation of the data
 * Standard Deviation or the variability/diversity of a data set (specifically the average distance from the means) is connected to histograms (See Standard Deviation). Histograms with the greatest variability of frequencies will have a higher standard deviation and range. For example, in the two histograms below, histogram 1 has a greater standard deviation than histogram 2. Standard deviation also relates to the bell-shaped curve of normal distribution; this can be seen in histograms as well.



6) outliers in data set
As is shown in the image below, it is easy to find outliers in a data set when it appears in a histogram. The box to the left that does not match up with the pattern of the rest of the data is an outlier. http://www.itl.nist.gov/div898/handbook/eda/section3/histogr8.htm

How do you read a histogram?
-The bins are labeled with something on the x-axis that represents a quantitative variable. -The height of the bin (corresponding with a point on the y-axis) shows the size of the data class defined by the bin label.

=Practice SAT Questions=

1) This histogram is:

A) Bimodal, Symmetric B) Bimodal, Asymmetric C) Unimodal, Asymmetric D) Multi-modal, Asymmetric

Answer: C because the graph does not display symmetry (see high frequency of the second bin) and has one maximum

2) How many people paid greater than $2?

A) 5 B) 6 C) 7 D) 8

Answer: B because if you find the frequencies of the fourth bin (3), sixth bin (2), and the seventh bin (1) the total is six



3) Approximately how many more people worked between 20-22 hours than 10-12 hours?

A) 50 B) 35 C) 40 D) 25

Answer: A because the 20-22 bin has a frequency of about 95 and the 10-12 bin has a frequency of about 42. (95-42=53)

4) What is the range of hours that the minimum number of people worked?

A) 10-12 B) 12-14 C) 16-18 D) 22-24

Answer: D because bin 22-24 has the lowest frequency.



5) What years were the maintenance costs above $150 dollars?

A) 1998-1999 B) 1996-1997 C) 1994-1995 D) 1995-1996

Answer: A because if you look at the corresponding frequencies for each bin ($150 for 1996-1997) and ($200 for 1998-1999) it is clear that 1998-1999 has a maintenance cost of $150 dollars.

__Sources__ http://www.netmba.com/statistics/histogram/ https://www.msu.edu/user/sw/statrev/strv112.htm http://stattrek.com/statistics/dictionary.aspx http://pirate.shu.edu/~wachsmut/Teaching/MATH1101/Graphs/histograms.html http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm http://en.wikipedia.org/wiki/Histogram http://worksheets.tutorvista.com/histogram-interpretation-worksheet.html?page=1 http://www.spcforexcel.com/explaining-standard-deviation