Opinion Polling ... some of the Math

I keep hearing the results of opinion polls, on TV, and thought it'd be neat to study the methods and ...

>And generate a few formulas?
Well ... uh, yes.
In particular, you ask a question of a thousand people and get what pollsters think is the opinion of the entire country!

>With some error, right?
Yes, with a 3% error.
So here's what we'll do ... as an example to illustrate:

• We ask a question which has a YES or NO answer. (Example: "Do you think the moon is made of green cheese?)
• Let's assume that, were we to ask the question of the entire country, we'd get 25% answering YES.
However, we'll only ask m people this question.
• What are the chances that, of these m people, 25% answer YES?
>But that's the result if we asked the whole country, right? So that's not likely, eh?
I agree. It's like tossing a coin m times. We'd expect half would be heads and half tails, but it's unlikely we'd get exactly half.
In fact, there's a probability distribution associated with tossing coins ... or asking our question.
Do you remember when we talked about the Binomial Distribution?

>No.
Okay, we ask the question of 1000 people and we do this many, many times.
What would we expect to be the percentage of YES answers?
>I'd say 25% of 1000. That's 250 YES answers.
But there will be variations in this percentage. We could, in fact, get all 100% answering YES ... or maybe 0% answer YES.
>Like getting 1000 heads or no heads ... when tossing a coin, right?
Right. We'd expect a distribution of percentages, with a probability associated with each.
>So 1000 heads, the probability is small. For 250 heads, the probability is high. Am I right?
Yes. In fact, the distribution is the ...
>The binomial distribution!
Exactly, and if we ask a lot of people (like 1000 or more) the Binomial distribution is very like our ol' friend the Normal distribution (as seen in Figure 1).

Now the nice thing about the Normal distribution (or the Binomial distribution with m large, like 1000 or more) is that we'd expect 95% of the responses to be within two Standard Deviations of the Mean and, for the Binomial distribution we saw that:
 Mean = mp       Standard Deviation = [mpq]1/2 = [mp(1-p)]1/2 >So, for m = 1000 and p = 25% we'd get ... uh ... We'd expect an average percentage of YES responses of: (1000)(0.25) = 250 and ... >That's what I said! Yes, and we'd expect that the Standard Deviation of these percentages of YES responses would be [(1000)(0.25)(0.75)]1/2 which is about 13.7, so two Standard Deviations from that 250 Mean gives an interval from 250 - 2(13.7) to 250 + 2(13.7) or 223 to 277 so ... >So if you asked 1000 people you'd expect that 95% of them would answer YES? No! You'd expect that between 223 and 277 would answer YES. >Can you be sure of that? Figure 1
No, of course not. You could get all 1000 answering YES. However, if you were to repeat this questioning many, many times you'd expect that, 95% of the time, you'd get between 223 and 277 YES responses to your question.

>But we knew the answer before we started. I mean, we knew that about 25% would answer YES. I mean ...
Okay, we knew p = 25%. Our problem now is to estimate p from the responses we get from our question.

>What if you got, say 270 YES answers? That's 27% of your 1000 people, right? What could you say about ...
About the whole country? We could say that there's a 95% probability that the number 270 lies within two Standard Deviations of the Mean.

>But you don't know the Mean!
Pay attention! This is what we can say:

1. Two Standard Deviations from the Mean means the number of YES responses that we got lies in the interval from
mp - 2[mp(1-p)]1/2       to       mp + 2[mp(1-p)]1/2       with a 95% probability.

2. Hence the fraction answering YES lies in the interval from
p - 2[p(1-p)/m]1/2       to       p + 2[p(1-p)/m]1/2       with a 95% probability
Here we divided by m to get the fraction answering YES
3. Since p is between 0 and 1, the largest value that 2[p(1-p)]1/2 can have is "1" (and this occurs at p = 1/2)
4. Hence (with 95% probability) we can expect our fraction of YES responses to lie between p - 1/m1/2     and     p + 1/m1/2
5. Since p is the country-wide fraction, if we want to be within say 3% of p we'd want 1/m1/2 = 0.03 (that's our 3%).
That'd give us a value for m, namely m = ...

>That's m = (1/.03)2, right?
>I get m = (1/.03)2 = 1111.
Yes, well the Binomial distribution isn't exactly the Normal distribution so we can't expect 95% within exactly two Standard Deviations so we should use the Binomial rather than the Normal and ...
>So we're talking "ballpark", eh?
Well, yes ... but George Gallup would do it properly. Besides, my aim was just to illustrate that, in opinion polls, we need only ask about a thousand people to get an estimate of how the entire country would answer.
>With a 95% probability of being right?
Well, a 95% probability of being within 3% of the country-wide response ... or, as the pollsters say, 19 times out of 20.
>And you believe all this stuff?
Of course! It's statistics and statistics is never wrong!
>Somebuddy told me that a statistician can draw a straight line from an unwarranted assumption to a foregone conclusion.
Go back to sleep.
>zzzZZZ

Note:
Above, we said that 95% of the poll results would be expected to lie within the interval Mean +/- 2 Standard Deviations.
Actually, for a Normal distribution, 95% would lie in Mean +/- 1.96 Standard Deviations so we could (to be fussy about it) change 4, above, to read:

4. Hence (with 95% probability) we can expect our fraction of YES responses to lie between p - (1/2)(1.96)/m1/2     and     p + (1/2)(1.96)/m1/2

Conclusion?
The number of people that must be polled in order to get a Margin of Error (MoE) of X% is given (roughly!) by (0.98/X)2
(where, for a 3% Margin of Error, we'd put X = 0.03) ... and that's shown in this chart:

Number of People: N = (0.98/X)2

If you wanted a 3% Margin of Error,
meaning that X = 0.03 (that's equivalent to X% = 3%),
then you'd poll (0.98/0.02)2 = 1067 people

 Number Polled = Margin of Error = % ... at 95% confidence level
 Margin of Error = % Number to be Polled = ... at 95% confidence level

 the Bush-Kerry Polls ... October, 2004
Here's something interesting:
Suppose:

• There are k polls, with the number of people polled equal to   n1, n2, ... nk   where
n1+ n2+ ... +nk = N   (where N is the total number polled, in ALL k polls)
• In these polls the percentages of YES votes are:
p1, p2, ... pk.
• Then the number of YES votes in each of these polls is:
p1 n1, p2 n2, ... pk nk.
• Then the total number of YES votes for all N people polled is:
p1 n1+ p2 n2+ ... +pk nk.
• Hence, percentage of YES votes (for all N people polled) is:
[ p1 n1+ p2 n2+ ... +pk nk ] / N = [ p1 n1+ p2 n2+ ... +pk nk ] / [n1+ n2+ ... +nk ]
which is a weighted average of the various polls percentages p1, p2, etc.
• This will give a polling result that involves not just n1 or n2 etc. ... people, but N people !
... and the Margin of Error is then 1/SQRT(N), not just 1/SQRT(n1) or 1/SQRT(n1) etc.

>So?

So here's some recent stuff (where people are asked which fellow they'd vote for):

The Margin of Error runs from about 2.4% to 3.3% (depending upon how many people were polled).
But the TOTAL number of people polled was:
n1+n2+n3+n4 = 1666 + 943 + 1195 + 881 = 4685.

The total number of people who say "Bush" (from among these 4685) was then:
p1 n1+p2 n2+p3 n3+p4 n4 = 2284.

Hence the percentage of those who say "Bush" is:
2284 / 4685 = 49%.

>Yeah, so?
Aaah, but now the Margin of Error is something like 1 / SQRT(4685) = 1.5%

>Hey! Neat! What what about Kerry?
That's left as an excercise.
>And does anybuddy actually do this?
Yes. I think hope that's what CNNs
Poll of Polls does.
(CCN describes it as ... an average of selected national public polling data ...")
(The underlining is mine. I hope they don't just take a garden-variety "average" but rather the weighted average !)

 the Bush-Kerry Polls ... November, 2004

A day before the 2004 election, CNN released their Poll of Polls (for the national vote):

>And the actual result?

That was   Bush 51%