o Univariate data
o Bivariate data
o Multivariate data
o Learn why a knowledge of statistics is important and helpful
o Recognize and understand the meaning of various key terms
Introduction to Statistics
Statistics is a subject that has earned a certain amount of notoriety because of its misuse in various contexts. Nevertheless, statistics is a tool that, if used properly, can be of tremendous help in math, science, engineering, history, politics, and numerous other fields. As you study this subject, always keep in mind that statistics is more than just math: it is not simply manipulation of numbers through addition, subtraction, multiplication, division, and other mathematical operations. Statistics also involves language and units: when a statistician (or layman) provides a statistic, it involves a number and a label of some sort. For instance, the number 5.3 is not in and of itself a statistical value; "an average age of 5.3 years," however, is a statistical value. This linguistic aspect of statistics sometimes allows a certain amount of ambiguity that can be misleading. By studying statistics, you will equip yourself to identify and understand both uses and abuses of this tool.
Statistics is used for quantifying sets of data such as, for example, attributes of a group of people and measurements taken in a laboratory. Consider, for instance, the population of a particular country. The people who reside in that country have varying heights: some are short, some are tall, some are in between. If we wanted to compare the height of this population with that of some other population in a convenient manner, we would not want to compare individual people. Such a task would be burdensome (the number of people in a country might be in the millions or billions) and would not necessarily be particularly helpful as a means of comparing populations as a whole. Instead, we can use an average or median height as the basis for our comparison. These statistical values are single numbers that quantify the data (the heights of a country's population) and that provide a convenient way to express and compare certain characteristics of those data. Part of the goal is to teach you how to select and use statistical tools like averages and medians, as well as a host of others, in assessing and comparing data.
Simply defined, statistics (sometimes colloquially termed "stats") is the study of collecting, analyzing, interpreting, and representing of sets of numerical data. Thus, virtually any field of study that uses numbers can, at least occasionally, involve statistics. Statistics, because it makes extensive use of numbers, is math-intensive, and a decent grasp of basic arithmetic and algebra is required to study this field.
Types of Data
A set of data can involve a single variable or multiple variables. For our purposes, we will only consider data sets that involve one variable (univariate data) or two variables (bivariate data). The height of persons in a particular country is an example of univariate data, since there is only one variable: height. An example of bivariate data is a person's height with respect to his age; in this case, there are two variables: age and height. Much of how we handle univariate data can also be extended to multivariate data (two, three, or more variables).
Practice Problem: A pollster wants to find out the relationship between age and income for a certain segment of the population. How many variables are involved in the data that the pollster must analyze?
Solution: The pollster is looking at two different aspects of the population: age and income. Thus, he is dealing with two different variables (bivariate data).
Populations versus Samples
Obviously, it is not always possible for a scientist, pollster, anthropologist, journalist, or other professional (or non-professional) to measure or consider every last member of a particular set to calculate a statistic. Ideally, however, measurement of every member is required to get an exact value. Thus, if you wanted to calculate the average gas mileage of all vehicles currently on the road, you would need to go measure (or simply record) the gas mileage of every last vehicle-a daunting task that would almost certainly not be worth the time and trouble. Instead of taking this exhaustive approach, you might instead take the approach of measuring or recording the mileage of a representative subset of the vehicles on the road and using this subset to calculate the average. Such an approach offers obvious benefits, but it also requires more care, since assumptions must be made concerning what constitutes a "representative" subset. The set of all vehicles would be considered a population--the totality of all members of a particular set. A subset of vehicles used for the purposes of statistical analysis is a sample--some portion of the population. Whether or not a particular sample is actually representative may be debatable (inappropriately skewing a sample is one potential way in which statistics can be abused).
It is important to note that the term population need not necessarily refer to people. A population can be the set of all vehicles, the set of all potential outcomes of an event or series of events, or the set of all entities of a given type (for instance, the set of all stars in the universe). The population in a given context is simply the set of all instances from which we might choose a sample for statistical analysis.
To differentiate between statistical calculations for populations versus those for samples, Greek letters are used for the former (for instance, σ for the population standard deviation) and Latin letters are used for the latter (for instance, s for the sample standard deviation). Although these choices are purely conventional, they are helpful in avoiding confusion and ambiguity.
Practice Problem: An investor wants to determine how to diversify his portfolio. To quantify each potential investment area (technology, financial, manufacturing, and so on), he calculates the average growth of the top 20 companies in each area. Is the average growth in each area that of the population or a sample?
Solution: Each area of investment could have numerous companies--probably far more than 20. Thus, the investor is calculating the growth of a sample of companies from each area rather than that of the population.