As an applied mathematician I found the issue of contradictory polls fascinating and by using the variation between polls, I was able to extract the most likely voting distribution of those not being polled. A key point in that analysis is realization that many rapid polls can skew the demographic of those polled, giving the false impression that views held by a precious few are applicable throughout Toronto.

The recent successful prediction of the US presidential election in contraction to many pundits is a testament to the predictive power of a well posed model and the luxury of a vast number of polls that can be systematically weighted according to their historically proven reliability. Unfortunately the mathematical theory of this approach falters when applied to the mayoral race in the city of Toronto due to a lack of data. With a significantly smaller number of polls, reconstruction of the true voting distribution is still possible but it must be done in a smarter way.

In my quest to attempt to build a prediction model for the mayoral race I have made some progress and had some insight as to some of the components that would be required. With respect to municipal politics in Toronto, one must contend with 44 virtually independent wards with their own unique set of issues. Prediction schemes that do not take this into account will simply not capture the multifaceted viewpoints presented at city council. If we also assume that voters are reasonable and only change their allegiance at isolated times then we nearly have a well-posed problem. What remains is to model a mechanism that instills changes in voting patterns. For this, inferring voter agency is key and is nuanced through how well an individual believes their issues are represented at city council contrasted with the block voting patterns of city councillors. The challenge is to treat the voting prediction as a hidden distribution that is simultaneously able to optimally recapture polling results while remaining faithful to the social-political reality of Toronto.

Direct evidence of just how contentious the voting public can be was made abundantly clear to me by watching a poll about policies be subverted into a conspiracy. Basically a textbook example of the politics of paranoia.

Searching for solutions that optimally resolve seemingly contradictory information rather than focussing on the contractions directly is a common theme in mathematics. With all mathematical models, they are only as good as the quality of the data they hope to model. By being well-informed of the issues and open to all sides of the debate the true voting distribution of Toronto can be revealed.

Please comment, I’d like to hear your thoughts on these issues.

]]>The initial question I had this morning was to try to explain how polling firms can unintentionally influence voters through their proprietary weighting schemes. From a mathematical standpoint, the level of complexity inherited by the voting process when strategic voting is modelled destabilizes the voting process. One ends up with a situation there “the tail of polling” begins to “wag the dog of the voter”. With this analogy, it is safe to infer that the polling infrastructure itself is in crisis.

What I tried to do this morning was to get a sense of how people felt about the policies that they were seeing developed by each of the four front-running candidates for mayor. They were asked to consider the following:

## Ignoring all polls and simply based on the content of the candidates platform, who would you vote for in Toronto’s mayoral election?

Candidates were listed in alphabetical order by last name and I asked everyone I could to spread the word and express what they thought. It was posted to Facebook, Reddit and Twitter. With each posting, it was asked to be reposted, shared, and re-tweeted. The poll remained open from 11:15-1:15 and only used a simple safeguard based on IP address to prevent repeat voting.

What was exceedingly interesting to me was not the final result, but how the final voting distribution developed, and I learned that it is not analysis or quantification of results that is driving voters. It is sociology.

In the first fifteen minutes of the vote Chow and Soknacki garnered nearly all of the votes, but with only about 30 votes shared between them (Chow 6, Ford 3, Tory 2, Soknacki 21, Undecided 1), this was not statistically significant (roughly accurate to 1 in 5). It took about 30 minutes for the internet to realize that this poll was out there and people started to take notice. I was struck by how the distribution, up to this point, was amazingly stable with Soknacki’s platform the clear leader being followed distantly by a smoothly decreasing distribution of Chow then Ford then Tory and finally Undecided. This is when things started to get interesting.

After posting my poll on various websites I was contacted by an ardent and publically well-known anti-Ford supporter. His concern was that other candidates were not included in my poll. Now, keep in mind that I had not yet expressed the purpose of the poll, and I disregarded the question. This tweeter, who was the leader of the anti-Ford “Shirtless Horde,” then went on to attack my poll, as Soknacki’s numbers rose. After about 90 minutes, when it was apparent that Soknacki was maintaining this lead something truly fascinating occurred.

The next tweet from this person was a link to the poll to a vocal and active **pro-Ford group** and at this very point, Ford’s numbers began to rise. Let me repeat this for emphasis.

## An anti-Ford leader joined forces with a pro-Ford organization!

What I was witnessing was a voter, feeling first disenfranchised about a process which was purposely made unclear, who then reacted by ensuring **at any cost** that another candidate would not benefit from this process.

The fear of one person being in the lead, regardless of the fact that the numbers were inconsequential in my poll, was all that it took for this individual to ensure the retention of a mayor and in this action, inadvertently unravelling their months of hard work to prevent exactly this situation!

This single reactionary tweet caused a cascade in the polling where fear and a sense of disenfranchisement replaced a reflective comparison of the various platforms. Putting this into context, the candidates that were not included, are not included in mainstream polling, and most debates and what I had done was not dissimilar to recent polling efforts.

Rather than have another candidate that this person does not hold in high esteem lead a pointless poll, this person was willing to turn to a candidate who according to their actions is vastly different than their political position. That reaction, that path to a strategic voting plan, changed the voting outcomes of the poll.

The rest of the voting distribution remained invariant while Ford’s numbers started to rise. Concurrently, word of the poll started to get some traction on the twitter stream I was monitoring and as the tone became more divisive, Ford slowly closed on the substantial lead that Soknacki had built up over the previous hour. By 12:45 pm (90 minutes elapsed) it was essentially a dead heat. This time coincided with a link to the poll being posted to a pro-Rob Ford site. Soon after this Ford overtook Soknacki and never looked back. At 12:57 pm the counts were Chow 36, Ford 106, Tory 23, Soknacki 81 and Undecided 8 and the final distribution at 1:15 pm is displayed below.

At the end of the day what my experiment has revealed, is that for democracy to truly work, we need to allow ourselves to be active participants in democracy. This poll was presented to voters via social media with no true description of its purpose. Once the link was posted, it automatically triggered a strategic voting response to counteract fear, or sense of loss.

Democracy is not about reaction to a fear. Democracy is not a reaction to the supposed outcome as promoted. What democracy is, is an opportunity for you to ask elected leaders to represent your principles, to represent your vision of the future. To simply vote in reaction to polls which are malleable to various interpretations, requires you to step away from your democratic voice.

There is a very important historical reference to this. Rousseau, the last of the social contract theorists, believed that a democracy, such as that in England in his time, led the English people to believe they were free. He disagreed and felt they were greatly mistaken and that they were only truly free during an election of the members of Parliament. Once officials were elected, the populace was effectively enslaved by their choice. The English people only made paltry use of their moral civil freedom through politics, enacting them only in the brief moments of elections, and Rousseau believed this squandering of liberty warranted their ultimate loss of it.

Rousseau is asking us to use our democratic freedom, not simply in the act of voting, but through the act of being civically engaged. The person whom you vote for is not simply an \(X\) you put on a ballot; they are the person with whom you will work for the next four years to build the city, the province, and the nation in which you live. That person must be someone you can work with, and someone that will represent your voice in those instances when you cannot actively participate yourself.

Perhaps Rousseau is right, that we have a citizenry who only evokes their voice in the general assembly during elections, and are then enslaved through a loss of moral civil freedoms. Perhaps with making such little use of our liberties, we have effectively lost them. Personally, I would like to believe that with a resurgence of activism, of protest, both physically and in the virtual realm, that the voice of our moral freedom is on the rise. Through civil action, and through bringing together a multitude of voices into the public sphere, we are finding our liberty and moral freedom as we find our voice.

In a June 29, 2013 article of the Globe and Mail, Naheed Nenshi, mayor of Calgary, summed this up succinctly with the following

We as citizens have the power to take people from devastation to hope.

The story does not really end here for the poll this afternoon. It seems that in the wake of of the tipping point experienced in the polling exercise and in the afterglow of slaying imaginary dragons a conspiracy has taken root. As this is evolving by the minute, I leave it as an amusing homework—I’m a prof, I can’t help myself—to go check it out for yourself.

If I could just take one final moment of your attention. Reflect on my initial question in the poll. This is important, this will be on the test…

## Ignoring all polls and simply based on the content of the candidates platform, who would you vote for in Toronto’s mayoral election?

Repeat it, repeat it to your kids, repeat it to your significant others, repeat it to yourself. When this is the question you ask yourself when you are staring at that blank voting sheet, and only then will you be practicing democracy. Think on that.

For more on the social and political analysis of what transpired this afternoon refer to the Philosopher of Write.

I have attempted here hide the underlying mathematics since let’s be honest, I’m in a minority for the joy I feel with respect to this. At any rate, there is a rich mathematical structure at play here that predicts that this poll would have ended up in a two-way race no matter how the participants behaved. If you would like to read more about this then consider the follow extra readings.

1) David P. Myatt (2001) Strategic Voting Incentives in a Three Party System

2) Mark Frey (2007) Duverger’s Law Without Strategic Voting

3) Ken Kawai and Yasutora Watanabe (2012) Inferring Strategic Voting

When we attempt to find a solution that matches the polling results exactly a solution is possible only if \(p=0\). To include other values of \(p\) we have to relax this notion and instead consider solutions that match as close as possible in some sense. That is, to find \(\vec{\alpha}\) such that some measure of distance between the Ipsos Reid and Forum Research results are minimized. Mathematically we can say that we are looking for\[\vec{\alpha}^* = \underset{\vec{\alpha}}{\operatorname{argmin}}\| \vec{b}_{\textrm{I}}-(p\vec{b}_{\textrm{F}}+(1-p)\vec{\alpha}\|,\]where \(\vec{b}_{\textrm{I}}=(0.36,0.20,0.28,0.13,0.03,0)^\top\) and \(\vec{b}_{\textrm{F}}=(0.31,0.31,0.27,0.16,0.02,0.03)^\top\) are the Ipsos Reid and Forum Research data from the table. The double vertical bars mean that we are taking a norm and there are a number of choices that could be made. For convenience we take the Euclidean norm. Other notions of distance such as the Manhattan norm or the max norm could be used. In a finite dimensional vector space (in our case 6-dimensional for the 6 lines in the table) all these these norms are equivalent but they are typically much more computationally intensive and may require the use of sub-differential algorithms (quantifying change at points of non-differentiability). Finally the notation \(\operatorname{argmin}\) denotes that we want to minimize something and that rather than know what this minimum is, we are interested in extracting where the minimum occurs.

However, there are also a number of constraints. In particular, each of the \(\alpha_i\) is a proportion and taken together, they should give 100%. This translates into the constraints that \(0 \le \alpha_i\le 1 \ \forall i\) and \(\alpha_1+\alpha_2+\alpha_3+\alpha_4+\alpha_5+\alpha_6=1\). Each of these constaints, all 13 of them give a separate condition. Defining \begin{align*}f(\vec{\alpha};p) &= \| \vec{b}_{\textrm{I}}-(p\vec{b}_{\textrm{F}}+(1-p)\vec{\alpha}\|_2^2,\end{align*}and\begin{align*}g_{1,i}(\vec{\alpha}) &= -\alpha_i \le 0,&g_{2,i}(\vec{\alpha}) &= \alpha_i – 1 \le 0,&

h(\vec{\alpha}) &= \sum_{i=1}^6\alpha_i – 1 = 0

\end{align*}allows us to state the problem in a standard form.

For any \(0\le p\le 1\) find \[\vec{\alpha}^* = \underset{\vec{\alpha}}{\operatorname{argmin}}f(\vec{\alpha};p)\] subject to\begin{align*}g_{1,i}(\vec{\alpha})&\le 0, & g_{1,i}(\vec{\alpha})&\le 0, & h(\vec{\alpha})&=0.

\end{align*}This is known as a nonlinear programming problem in the optimization literature and there are a number of algorithms to efficiently solve this problem numerically. In our case the function \(f\) is strictly convex (being a norm), \(g_{i,j}\) is linear and therefore convex and \(h\) is affine. These conditions ensure that for each value of \(p\) there is a unique optimal distribution that can be found.

The necessary condition for optimality are known as the Karush-Kuhn-Tucker (KKT) conditions and they take the following form. If \(\vec{\alpha}\) is a nonsingular optimal solution of our problem, then there exist multipliers \(\mu_{1,i}, \mu_{2,i}, \lambda\) such that \begin{align*}\nabla_{\vec{\alpha}}f(\vec{\alpha};p) + \sum_{i=1}^6\mu_{1,i}\nabla_{\vec{\alpha}}g_{1,i}(\vec{\alpha}) + \sum_{i=1}^6\mu_{2,i}\nabla_{\vec{\alpha}}g_{2,i}(\vec{\alpha}) + \lambda\nabla_{\vec{\alpha}}h(\vec{\alpha})&=0,\\ \textrm{for} \ i = 1,2,\ldots,6: \quad \mu_{1,i}g_{1,i}(\vec{\alpha})=0,\quad \mu_{2,i}g_{2,i}(\vec{\alpha})=0,\\ \textrm{for} \ i = 1,2 \ \textrm{and} \ j = 1,2,\ldots,6: \quad \mu_{i,j}\ge 0, \quad g_{i,j}(\vec{\alpha}) \le 0, \quad h(\vec{\alpha})&=0.\end{align*}The first line is a condition that will pick up any minimum and maximum values, the second line chooses which of the constraints are active and the third line filters out only those that are feasible. One of the roadblocks to a solution is the number of possible combinations of constraints that can be chosen. In this case there are \(2^{12} = 4096\) possibilities, although this can be reduced by using the structure of the problem. For example, if one of the \(\alpha_i = 1\) then all the other \(\alpha_i\) must be zero. A naive method would be to try all 4096 possibilities and then patch them together as \(p\) increases from 0 to 1.

Rather than sacrificing myself on this alter of 4096 possibilities, I first attempted to find a solution where \(0 < \alpha_i < 1\) (no equalities) so that \(\mu_{1,i}=\mu_{2,i}=0 \forall i\) but as was found earlier, this solution is only feasible if \(p=0\). What it also revealed is that for small positive \(p\), it was \(\alpha_6\) that became negative so I moved to the constraint \(\mu_{1,6}=0\) to force \(\alpha_6=0\) and deflate the problem to finding the remaining 5 \(\alpha_i\) values. This yields the partial solution \begin{align*}\vec{\alpha}(p) &= \frac{1}{1-p}\begin{pmatrix}0.36-0.316p\\0.20-0.316p\\0.28-0.276p\\0.13-0.066p\\0.03-0.026p\\0\end{pmatrix}, & 0\le p&\le \frac{0.20}{0.316}\simeq 0.6329,\end{align*} with \(\alpha_2=0\) being the terminating condition. This behaviour implied that the next patch should result from setting \(\mu_{1,2}=\mu_{1,6}=0\) to force \(\alpha_2=\alpha_6=0\) and continue the deflation process. Continuing,\begin{align*}\vec{\alpha}(p) &= \frac{1}{1-p}\begin{pmatrix}0.41-0.395p\\0\\0.33-0.355p\\0.18-0.145p\\0.08-0.105p\\0\end{pmatrix}, & 0.6329\le p&\le \frac{0.08}{0.105}\simeq 0.7619,\end{align*} with \(\alpha_5=0\) defining the upper extent of the domain,\begin{align*}\vec{\alpha}(p) &= \frac{1}{3(1-p)}\begin{pmatrix}1.31-1.29p\\0\\1.07-1.77p\\0.62-0.54p\\0\\0\end{pmatrix}, & 0.7619\le p&\le \frac{1.07}{1.17}\simeq 0.9145,\end{align*} with \(\alpha_3=0\) at the upper limit, \begin{align*}\vec{\alpha}(p) &= \frac{1}{1-p}\begin{pmatrix}0.615-0.625p\\0\\0\\0.385-0.375p\\0\\0\end{pmatrix}, & 0.9145\le p&\le \frac{0.615}{0.625}= 0.984,\end{align*} terminated by \(\alpha_1 \to 0\) and finally\begin{align*}\vec{\alpha}(p) &= \begin{pmatrix}0\\0\\0\\1\\0\\0\end{pmatrix}, & 0.984\le p&\le 1.\end{align*}

Concatenating all these cases together results in the figure displayed below. This mathematical technique is commonly used in inverse problem concerning deblurring, tomography and super-resolution. Essentially we can think of this as taking an x-ray of the total voting public and not just those that appear on the surface through their landline.

]]>A mid-November Ipsos Reid poll has Chow at 36 percent, Tory 28 percent, Ford 20 percent, Stintz 13 percent, Soknacki 3 percent, undecided 0 percent. While in late February, a Forum Research poll has Chow at 31 percent, Tory 27 percent, Ford 31 percent, Stintz 6 percent, Soknacki 2 percent, Undecided 3 percent. Summarized in a table we have the following data.

Individual | Ipsos Reid | Forum Research |
---|---|---|

Olivia Chow | 36% | 31% |

Rob Ford | 20% | 31% |

John Tory | 28% | 27% |

Karen Stintz | 13% | 6% |

David Soknacki | 3% | 2% |

Undecided | 0% | 3% |

In part 1, I mentioned that what is really being measured in each one of these polls is not the support of the candidate by the whole population, but rather the support of the candidate by those that have been polled. If we suppose that the Ipsos Reid values represent the true distribution and that the Forum Research values are uncorrected due to an insufficient representation of individuals without a landline then a very interesting question to pose is

Although much has happened in the intervening period, from mid-November to late February, I will assume that the underlying voting distribution of the population has remained essentially the same. Recall the recipe from part 1 to find the true proportion for a given candidate that

\[

P(\textrm{candidate}) = P(\textrm{candidate}|\textrm{has landline})P(\textrm{landline})+P(\textrm{candidate}|\textrm{no landline})P(\textrm{no landline}).

\]
(English translation: The probability of support for a candidate is the probability they are supported by an individual with a landline weighted by the probability of having a landline together with the support by an individual without a landline weighted by the probability of not having a landline.) With our suppositions, the ingredients to the recipe differ than what we had in part 1. For part 2 they are as follows:

- \(P(\textrm{candidate}|\textrm{has landline})\) is the uncorrected (Forum Research) values in the table for a given candidate;
- \(P(\textrm{candidate}|\textrm{no landline})\) is unknown proportion that we are attempting to extract;
- \(P(\textrm{landline}) = p\) is taken as a variable \( 0 \le p \le 1\) and from the discussion in part 1, \(p\) is becoming smaller over time and is expected to be less than 0.67 in 2014;
- \(P(\textrm{no landline}) = 1-p\) depends on the value of \(p\).

This gives 6 equations (one for each candidate) in seven variables so the solution has a single parameter which we choose to be \(p\). Denoting \(\alpha_1, \alpha_2, \alpha_3, \alpha_4, \alpha_5, \alpha_6\) as the probabilities of support, for those without a landline, for each of the candidates: Chow, Ford, Tory, Stintz, Soknacki, and undecided respectively, then a naive attempt is to simultaneously solve

\begin{align*}

0.36 &= 0.31p + \alpha_1(1-p)\\

0.20 &= 0.31p + \alpha_2(1-p)\\

0.28 &= 0.27p + \alpha_3(1-p)\\

0.13 &= 0.06p + \alpha_4(1-p)\\

0.03 &= 0.02p + \alpha_5(1-p)\\

0 &= 0.03p + \alpha_6(1-p)

\end{align*}for each \(\alpha_i\). Not all solutions are valid since we would like \(0\le p\le 1\) and \(0 \le \alpha_i\le 1\) for \(i = 1,2,\ldots,6\). With this constraints in place, the only solution is to set \(p=0\) corresponding to everyone having only a cell phone. In this degenerate case, the unobserved distribution is simply what the Ipsos Reid data indicates and the Forum Research results play no role. Essentially this degenerate solution is a result of asking for an exact match between the two polls.

The details of how I solved this problem (for the mathies) can be found elsewhere. For everyone else, let me tell you the solution. First we move from looking for an exact result to looking for a result that most closely matches the two polls while still observing all the constraints. The figure summarizes the results and there are two very interesting scenarios that show up.

Individual | Ipsos Reid | Forum Research | \(p=76.2\%\) | \(p=63.3\%\) |
---|---|---|---|---|

Olivia Chow | 36% | 31% | 45.8% | 43.6% |

Rob Ford | 20% | 31% | 0% | 0% |

John Tory | 28% | 27% | 25.0% | 28.7% |

Karen Stintz | 13% | 6% | 29.2% | 24.0% |

David Soknacki | 3% | 2% | 0% | 3.7% |

Undecided | 0% | 3% | 0% | 0% |

If we suppose that \(p=76.2\%\) of people are represented by a landline then the voting distribution of the non-polled that best approximates the Ipsos Reid results when combined with the Forum Research data is: Chow at 45.8 percent, Tory 25.0 percent, Ford 0 percent, Stintz 29.2 percent, Soknacki 0 percent, Undecided 0 percent. We suspect though that \(p\) is actually lower than this and taking \(p=63.3\%\) gives a slightly different result of: Chow at 43.6 percent, Tory 28.7 percent, Ford 0 percent, Stintz 24.0 percent, Soknacki 3.7 percent, Undecided 0 percent.

I was personally very surprised that there is no support at all for Rob Ford from the non-polled provided that \(p \ge 63.3\%\). I would have expected there to be some small residual but this is simply not borne out of the analysis. Of course for lower values of \(p\) (\(p < 63.3\%\)) support is found for Rob Ford but it is curious that this support is systematically included in the Forum Research data and not found at all within the optimal distribution of the non-polled until \(p\) is quite low. Furthermore, a low value of \(p\) simply confirms an inappropriate bias towards those with landlines in the Forum Research values. What does this mean? Well first, take the polling results with a grain of salt and second, it's fairly clear that the voting distribution of those with a landline and without a landline are significantly different especially concerning Rob Ford, Karen Stintz and Olivia Chow. The reduction in the Ford proportion in the non-polled is effectively evenly split between Stintz and Chow. Rob Ford in particular may have a Karl Rove moment where the numbers simply do not support the hype within the campaign bubbles. To those in the non-polled, get out and vote, your voice is clearly not being represented and is an integral part of the future of Toronto. Now let’s see how this optimal distribution could be found.

One of the issues that has been bothering me recently is how many of these stories simply quote the numbers without question and appeal to the forums that generated them as being accurate. In fact the latest poll referred above states:

The Forum Research automated voice response telephone poll of 1310 residents is considered accurate within 3 percentage points, 19 times out of 20.

This statement on its own should not cause much concern but at the end of the article is the following statement.

Ford has fared worse in polls conducted by Ipsos Reid than polls conducted by Forum. A mid-November Ipsos Reid poll had Chow at 36 per cent, Tory 28 per cent, Ford 20 per cent, Stintz 13 per cent, Soknacki 3 per cent. In a mid-December Ipsos Reid poll, 61 per cent said they would not consider voting for Ford.

At this point I became curious and wanted to see just how reliable the numbers being quoted are and how it is possible that a poll that claims to be so accurate (within 3 percent) could be so at odds with other polling firms. Let’s find out!

**Question #1: Where does this 3% come from?**

For any poll, the population is asked a question. For example: “Will you vote for Rob Ford in the next Toronto mayoral election?” Say that there were 1310 people that you asked this question of and 406 of them said “Yes, yes I will vote for Mr Ford”. If this was the case then in the particular poll that you conducted, the proportion of responses that said “Yes” is 406/1310 or 0.3099. Let’s just call that 31 percent (percent is just a fancy way to say “divide by 100”). This is not really the true proportion since there is no possible way that every voter can be asked this question and so we must estimate the true proportion with a smaller sample and hope that this sample is large enough to give a reasonable estimate of the true proportion. Now suppose you do the poll three more times and get 32 percent, then 30 percent and finally 25 percent. Each time you poll, the randomness in selecting who is asked the question is reflected in the variability of observed proportion for that particular poll. However, all is not lost, and in fact doing the poll over and over will reveal a structure in the observed percentages. They will be centred at the true proportion but have some spread about that centre value. As it turns out, most of the variability in the observed 406 is between 406 – 36 = 370 and 406 + 36 = 442 and this number 36 is the square root of the sample size. So the proportion can be anything from 370/1310 = 0.2824 to 442/1310 = 0.3374 or about plus and minus 3% = 0.03 from the reported value of 31%.

For a poll with \(n\) people being sampled, the percentage error is simply \(100/\sqrt{n}\) so for a required accuracy of 1% one would need \(n = 10000\).

**Question #2: How can polls differ by more than this “accuracy”**

There are a number of ways that variability in polling results can occur and primarily these are due to sampling from a group that is not a faithful representation of the population as a whole. This is becoming a increasing concern with telephone polls that only sample from homes with landlines. In fact there is an alarming trend in individuals replacing their landline in favour of cell phone. A bit of digging reveals that 2000 about 97% of all households in Canada had a landline and this has been steadily decreasing (2006: 91%, 2010: 67%. Houses that only had a cell phone have correspondingly been on the increase (2008: 8%, 2010 13%). An ever more exacerbated situation has been observed in the US and there have been some effort to account for this effect. What is of concern is that this opting out of having a landline is falling along demographic lines with data from 2010 indicating that only 7% of households with a landline were in the age group 18-29. For those who consider numbers amongst their close and personal friends, the full report of Secondary Research into Cell Phones and Telephone Surveys is available.

Consider a toy model first with just six people that are polled: \(Y_1, Y_2, Y_3\) are young people, only one of them having a landline (say \(Y_1\)) and \(O_1, O_2, O_3\) are older, all of them having a landline. Consider a candidate for an election that is preferential to the older voters with support from \(Y_1, O_1\) and \(O_2\). Some other candidate is supported by \(Y_2, Y_3\) and \(O_3\). The table summarizes this situation.

Individual | Has a landline | Supports candidate |
---|---|---|

\(Y_1\) | Yes | Yes |

\(Y_2\) | No | No |

\(Y_3\) | No | No |

\(O_1\) | Yes | Yes |

\(O_2\) | Yes | Yes |

\(O_3\) | Yes | No |

The question is, how much support does that candidate have? If you were able to ask each of the six then three of them support the candidate and this gives 3/6 = 50% support. What would a landline based telephone survey reveal? Only four of the six candidates can be polled. And of these four, ‘poll-able’ individuals, three of them would support the candidate giving 3/4 = 75% support. The difference here lies in the coverage of the sampling and the fact that those not being sampled may have significantly different views than those being sampled.

What is really being measured is not the support of the candidate by the whole population, but rather the support of the candidate by that part of the population that has a landline. We could correct for this effect if two extra pieces of data are known. First, the proportion of people that do not have a landline and second, among these non-landline individuals, what proportion would support the candidate.

The probability of the candidate being supported comes from two sources, those that support them and have a landline and those that support them without a landline, each weighted by the probability of having a landline and not having a landline respectively. Suppose we use the 2010 estimate of 67% of households having a landline (it is most likely less than this in 2014) and let \(\alpha\) denote the probability of a candidate being supported by the individuals without a landline. In the most recent survey mentioned at the start of the blog, Rob Ford was said to command 31% of the support of the people polled. How does all this combine to find the true support of the candidate? By conditioning on having a landline,

\[

P(\textrm{candidate}) = P(\textrm{candidate}|\textrm{has landline})P(\textrm{landline})+P(\textrm{candidate}|\textrm{no landline})P(\textrm{no landline})

\]
or in English: “The probability of support for a candidate is the probability they are supported by an individual with a landline weighted by the probability of having a landline together with the support by an individual without a landline weighted by the probability of not having a landline.”

Each of the ingredients to this recipe on the right hand side of this expression are:

- \(P(\textrm{candidate}|\textrm{has landline}) = 0.31\) in this case and represents the proportion observed in its unfiltered form;
- \(P(\textrm{candidate}|\textrm{no landline}) = \alpha\) is most likely unknown but could be significantly different than the 0.31 in the previous item;
- \(P(\textrm{landline}) = 0.67\) is taken as 67% and is mostly likely less than this in 2014;
- \(P(\textrm{no landline}) = 1-0.67 = 0.33\) is rising as individuals opt-out of having a landline.

Putting all of this information together gives the expression

\[

P(\textrm{candidate}) = 0.2077 + 0.33\alpha.

\]

There are three extreme cases that allow one to see just how much the reported numbers could vary.

- If everyone without a landline does not support the candidate then \(\alpha = 0\). In this case, \(P(\textrm{candidate}) = 0.2077\) or about 21% support.
- If everyone without a landline supports the candidate then \(\alpha = 1\). For this case, \(P(\textrm{candidate}) = 0.2077 + 0.33\) or about 54% support.
- If those without a landline just as likely to support the candidate then \(\alpha = 0.31\) and \(P(\textrm{candidate}) = 0.2077 + 0.1023 = 0.31\) reproducing the reported 31%.

What are the take aways here? If you believe that those individuals without a landline have the same average viewpoint as those that have landline then the raw telephone poll surveys will suffice. If you think that perhaps there may be a difference, then this effect will swamp the results. The simple example here gives a range of 21% to 54% depending on the unobserved probability \(\alpha\). With probability it is not only what you can measure that is important, but recognizing what you are not measuring that yields accurate results.

If you got this far then I have a surprise. Since we know what was reported in the Ipsos Reid poll we can compare it to the Research Forum values to extract the most likely voting demographic of this unobserved group. That in itself will prove to be quite interesting and moreover, it can be used to help compensate future polls from the Research Forum. Stay tuned, and hey, isn’t math fun?

]]>Our first stop is the mathematics of zero and a radio episode by Alex Bellos who travels to India in search of absolutely nothing, well the origins of zero in actual fact. I continue to find it fascinating that the ancient religions of Jainism, Hinduism and Buddhism contributed so much to mathematics and yet much of this history remains unattributed and not taught in mathematics classes in the western world.

If I’ve still held your interest then you may also consider looking at the talk given by Robin Wilson of Gresham College entitled ‘Early Mathematics’ which reviews the time period from 2700BCE to 1100CE. This covers results from ancient Egypt, Mesopotamia, Greece, China, India, the Mayans, Islam and early results in Europe leading into the Renaissance period.

As I mentioned above, this is to be the first in a series of blogs that provide a window into what I find fascinating and where I see interesting cross-overs between mathematics and other disciplines. There are many blog entries under construction with topics as varied as “the mathematics of kid toys and carnival rides”, “predicting elections”, and “an insider view on what it is like as a mathematician to work with industry”.

I look forward to their imminent posting with anticipation.

Update: In the same theme of mathematics that has been lost and rediscovered, A Prayer for Archimedes describes a long-lost text by the ancient Greek mathematician showing that he had begun to discover the principles of calculus long before it was developed by Leibniz and Newton many centuries later.

]]> \(A\) is \(m \times n\) with \(m > n\). In this case there are more equations than unknowns, \(A^{\top}A\) is \(n\times n\) and \(AA^{\top}\) is \(m\times m\). The connection with the pseudo-inverse is that

\[

x = (A^{\top}A)^{-1}A^{\top}b = A^+b

\]
is the particular \(x\) that minimizes \(\|Ax-b\|\).

**Example:** Solve the system \(

(2\ 3\ 4\ 6)^\top(x_1) = (4\ 6\ 8\ 10)^\top.\)

Using \(A^{\top}A = (2\ 3\ 4\ 6)(2\ 3\ 4\ 6)^\top=65\),

\[

x = (A^{\top}A)^{-1}A^{\top}b = \frac{1}{65}\begin{pmatrix}2 &3 &4 &6\end{pmatrix}\begin{pmatrix}4\\6\\8\\10\end{pmatrix} = \frac{118}{65}.

\]
Considering the least squares problem, \(\|Ax – b\|^2 = (2x-4)^2+(3x-6)^2+(4x-8)^2+(6x-10)^2\) which has a minimum at \(2(2x-4)2+2(3x-6)3+2(4x-8)4+2(6x-10)6 = 0\) so that

\[

x = \frac{8+18+32+60}{4+9+16+36} = \frac{118}{65}.

\]

\(A\) is \(m \times n\) with \(m < n\). In this case there are more unknowns than equations, and the connection with the pseudo-inverse is that
\[
x = A^{\top}(AA^{\top})^{-1}b = A^+b
\]
is the particular \(x\) that minimizes \(\|x\|\) amongst all of the possible solutions.
**Example:** Solve the system \((1\ -1\ 0)(x_1\ x_2\ x_3)^\top = (2).\)

In this case we find \(AA^{\top} = (1\ -1\ 0)(1\ -1\ 0)^\top = 2\) and

\[x = A^{\top}(AA^{\top})^{-1}b = \begin{pmatrix}1\\-1\\0\end{pmatrix}\frac{1}{2}2 = \begin{pmatrix}1\\-1\\0\end{pmatrix}.

\]

To see the connection with the least squares problem, the system is row reduced (it is already row reduced) and we can choose \(x_1\) as the pivot with \(x_2, x_3\) as free parameters. Letting \(x_2 = s\) and \(x_3 = t\) we have \(x_1 = 2+s\) so that the general solution is

\[x = \begin{pmatrix}2+s\\ s\\ t\end{pmatrix} = \begin{pmatrix}2\\0\\0\end{pmatrix} + s\begin{pmatrix}1\\1\\0\end{pmatrix} + t\begin{pmatrix}0\\0\\1\end{pmatrix}

\]
for any \(s,t\in\mathbb{R}\). Considering \(\|x\|\), we have \(\|x\|^2 = (2+s)^2 + s^2 + t^2\) which has a minimum at \(2(2+s)+2s = 0\) and \(2t = 0\) or \(t = 0, s = -1\) giving \(x = (1\ -1\ 0)^\top\) as before.

Decomposing the structure of the matrix \(A\) can help understand the resulting solutions and in the case of a symmetric matrix, the eigenvectors form an orthogonal set which allows one to expand \(A = UDU^\top\) where \(U\) is the orthogonal matrix \((U^{-1}=U^\top)\) with the eigenvectors as columns and \(D\) a diagonal matrix with the corresponding eigenvalues as the diagonal elements. Also, the eigenvalues could be anything, but if we specify that we want a non-negative definite matrix then the eigenvalues must be greater than or equal to zero.

**Example:** Write \(A_1 = \begin{pmatrix}2 &1\\ 1 &2\end{pmatrix}\) in the form \(A_1 = UDU^\top\).

A quick calculation gives \(\lambda_1 = 3\) with corresponding eigenvector \(\mathbf{\xi}^{(1)} = \frac{1}{\sqrt{2}}(1\ 1)^\top\) and a second pair \(\lambda_2 = 1, \mathbf{\xi}^{(2)} = \frac{1}{\sqrt{2}}(1\ -1)^\top.\) This gives the decompositon \[A_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}\begin{pmatrix}3 &0\\ 0 &1\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}^\top.\]
Using this decomposition the inverse of a matrix is easily computed by replacing the diagonal elements of \(D\) with their reciprocals so that for example \[A_1^{-1} = \frac{1}{3}\begin{pmatrix}2& -1\\ -1& 2\end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}\begin{pmatrix}{\small\frac{1}{3}} &0\\ 0 &1\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}^\top.\]

What happens if one of the eigenvalues is zero? This will not effect the decomposition of \(A\), in fact if we decompose \(A_2 = \begin{pmatrix}1&-1\\-1&1\end{pmatrix}\) with eigenvalues \(\lambda_1 = 2, \lambda_2 = 0\) and corresponding eigenvectors \(\xi^{(1)} = \frac{1}{\sqrt{2}}(1\ -1)^\top\), \(\xi^{(2)} = \frac{1}{\sqrt{2}}(1\ 1)^\top\) then \[ A_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}\begin{pmatrix}2 &0\\ 0 &0\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}^\top.\] Notice that this matrix is not invertible since one of the eigenvalues is zero. But what if we took the reciprocal of all the nonzero diagonal elements to form \[A_3 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}\begin{pmatrix}{\small\frac{1}{2}} &0\\ 0 &0\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}^\top = \frac{1}{4}\begin{pmatrix}1& -1\\ -1 &1\end{pmatrix}.\] Sadly, \(A_3\) is not the inverse of \(A_2\). This would be very surprising since \(A_2\) is not invertible. So what then is \(A_3\)? Well, \(A_2\) and \(A_3\) are pseudo-inverses. This can be generalized later to non-square matrices by constructing the SVD of a matrix. As a refresher please watch the following video:

Here is how the pseudo-inverse is connected to the solution of a least-squares problem. If the linear system \(Ax = b\) has any solutions, then they will have the form\[x = A^+ b + \left(I – A^+ A\right)\xi\] for some arbitrary vector \(\xi\). Multiplying by \(A\) on the left gives that condition that (\(A = A A^+ A\) is a property of \(A^+\))\[Ax = A A^+ b + \left(A – A A^+ A\right)\xi = A A^+ b = b.\] So for a any solution to exist we need the admissibility condition \(AA^+ b = b\). For linear systems \(A x = b\) with non-unique solutions as in the underdetermined case, the pseudo-inverse may be used to construct the solution of minimum Euclidean norm \(\|x\|\) among all solutions. If \(A x = b\) is admissible (\(AA^+ b = b\)), the vector \(y = A^+b\) is a solution, and satisfies \(\|y\| \le \|x\|\) for all solutions.

One final example should tie this all together.

**Example:** Consider finding the solution to \(A_2x = b\) that minimizes \(\|x\|\).

Row reducing \(A_2x = b, b = (b_1\ b_2)^\top\) reveals the \(b_2=-b_1\) to ensure a solution, and

\[

\begin{pmatrix}x_1\\x_2\end{pmatrix} = \begin{pmatrix}b_1\\0\end{pmatrix} + s\begin{pmatrix}1\\1\end{pmatrix}

\]
for any \(s\in\mathbb{R}\). Continuing, \(\|x\|^2 = (b_1+s)^2+s^2\) which is minimized when \(2(b_1+s)+2s=0\) or when \(s = -\frac{b_1}{2}\) so that the solution is

\[

\begin{pmatrix}x_1\\x_2\end{pmatrix} = \begin{pmatrix}b_1\\0\end{pmatrix} – \frac{b_1}{2}\begin{pmatrix}1\\1\end{pmatrix}=\frac{b_1}{2}\begin{pmatrix}1\\-1\end{pmatrix}.

\]
Using the pseudo-inverse of \(A_2\), \(A_3 = A_2^+\) gives the admissibility condition \(A_2A_2^+A_2b = b\) which simplifies to \(b_2 = -b_1\) and the solution

\[

\begin{pmatrix}x_1\\x_2\end{pmatrix} = A_2^+b = \frac{1}{4}\begin{pmatrix}1&-1\\-1&1\end{pmatrix}=\frac{1}{4}\begin{pmatrix}b_1-b_2\\-b_1+b_2\end{pmatrix}=\frac{b_1}{2}\begin{pmatrix}1\\-1\end{pmatrix}.

\]

At the beginning of this post, the Moore-Penrose pseudo-inverse generalized the idea of an inverse to non-square matrices and another notion of pseudo-inverse arose for symmetric matrices that have at least one zero eigenvalue. This second notion can be generalized (using the SVD) to non-square matrices and matrices that are not symmetric where the eigenvectors are not guaranteed to form an orthonormal set. In all cases, the pseudo-inverse is implicitly tied to the notion of finding solutions with minimal norm.

]]>To begin with, let’s start with the basic statement of the theorem.

If \(f(x)\) is continuous on a closed interval \([a,b]\) and \(N\) is any number \(f(a) < N < f(b)\) then there exists a value \(c \in (a,b)\) such \(f(c) = N\).

The illustration corresponding to the theorem is to the right and indicates that there may be more than one possible value for \(c\). The important restrictions are that

- \(f(x)\) be continuous and
- the interval \([a,b]\) is closed.

The primary purpose of this theorem is to indicate when numbers with various properties exist.

**1.** Make sure the function, \(f(x)\) is continuous.

**2.** Create a new function \(g(x)=f(x)-N\), replacing the function in this manner always makes the \(N\) in the theorem with respect to \(g\) equal to zero. So that \(g(c)=0\) when a correct value for \(c\) is determined.

**3.** Using 0 rather than the general \(N\), we need to find an \(a\) and a \(b\) so that either \(g(a)>0\) and \(g(b)<0\) or \(g(a)<0\) and \(g(b)>0\). The point is that the signs need to change.

**4.** Finding a change of sign confirms that there is a number \(c \in (a,b)\) that allows \(g(c)=0\) or \(f(c)=N\).

**A.** Suppose we have the function \(f(x) = x^2 – 4x\) and we wish to show there is a number \(x_*\) such that \(f(x_*) = 1\).

1. Notice that since \(f(x)\) is continuous, the intermediate value theorem can be used.

2. Let \(g(x) = f(x) – 1 = x^2 – 4x – 1\) so that \(g(x) = 0\) when the correct \(x_*\) is determined.

3. Choosing \(a = 4\) gives \(g(a) = -1 < 0\) and choosing \(b=5\) gives \(g(b) = 4 > 0\). There are of course many other possible value of \(a\) and \(b\). Note that \(a<b\).

4. Since a change in sign was found, there is a number \(c \in (4,5)\) such that \(g(c) = 0\) or equivalently, \(f(c) = 1\).

**B.** If \(f(x) = x^3-8x+10\), show there is at least one value of \(c\) for which \(f(c) = -\sqrt{3}\).

Since \(f(x)\) is continuous we just need to redefine the function (to make \(N = 0\)) and find values for \(a\) and \(b\). The new function is

\[

g(x) = f(x) + \sqrt{3} = x^3-8x+10+\sqrt{3}.

\]
We need to find \(a\) and \(b\) so that \(g(x)\) changes sign. Let \(a=-4\) so that \(f(a) = -22+\sqrt{3} < 0\) and \(b=-3\) so that \(f(b) = 7+\sqrt{3} >0\). These choices for \(a\) and \(b\) are found by just trying different values in the function.

At any rate, using the intermediate value theorem we can conclude that there is a value \(c \in (-4,-3)\) such that \(g(c) = 0\) or \(f(c) = -\sqrt{3}\).

]]>A function \(f(x)\) is said to be *continuous* at a point \(a\) in its domain if the following three properties hold.

- \(\displaystyle \lim_{x \to a} f(x)\) exists. This takes three steps to show in itself.
- \(f(a)\) has to exist,
- \(\displaystyle \lim_{x \to a} f(x) = f(a)\).

Continuity connects the behaviour of a function in a neighbourhood of a point with the value at the point.

If the domain of the function is bounded say \([a,b]\) then each end point of the interval can only be approached in one way. The left end point is at \(x = a\) so \(f(x)\) is said to be continuous at the left end point ‘\(a\)’ if \(\displaystyle \lim_{x \to a^+} f(x) = f(a)\). In a similar fashion, \(f(x)\) is said to be continuous at the right end point ‘\(b\)’ if \(\displaystyle \lim_{x \to b^-} f(x) = f(b)\).

Since there are only a few ways that the limit of a function cannot exist at a point there are few ways that a function can fail to be continuous. These are classified into four types.

**1. Jump Discontinuity** (also known as a simple discontinuity)

**2. Removable Discontinuity**

**3. Infinite Discontinuity**

**4. Oscillatory Discontinuity**

An explicit example of each type of discontinuity follows next.

**1. Jump Discontinuity:**

Is \(

f(x) = \begin{cases}

\displaystyle x-1, & 1 \le x \le 2 \\

-1, & -2 \le x < 1

\end{cases}

\) continuous at \(x = 1\)? Where is \(f(x)\) continuous?

To answer this we go back to the definition. By computing \(L^+ = 0\) and \(L^- = -1\) (for \(x = 1\)) we see that they are not equal and consequently, \(\displaystyle \lim_{x \to 1} f(x)\) DNE. Therefore \(f(x)\) is not continuous at \(x=1\). As to where \(f(x)\) is continuous, this is everywhere else in the domain \((-2,1) \cup (1,2)\). For the endpoints, in this case we say that \(f(x)\) is continuous from the right at \(x=-2\) and continuous from the left at \(x=2\).

A graph of the function appears to the right.

**2. Removable Discontinuity:**

Consider \(

g(x) = \begin{cases}

\displaystyle x, & x \ne 2 \\

5, & x = 2.

\end{cases}

\) Is \(g(x)\) continuous at \(x = 2\)? No because even though \(\displaystyle \lim_{x\to 2}g(x) = 2\) exists, \(\displaystyle \lim_{x\to 2}g(x) \ne g(2) = 5\). Since this function can be made continuous by redefining it at \(x=2\), we call this type of discontinuity removable.

To illustrate how a function can be *fixed*, notice that \(

f_1(x) = \displaystyle\frac{\sin x}{x}

\) is not continuous for all \(x\in {\Bbb R}\) since \(f_1(0)\) DNE. However, \(

f_2(x) = \begin{cases}

\displaystyle \frac{\sin x}{x}, & x \ne 0 \\[3mm]
1, & x = 0

\end{cases}

\) is continuous for all \(x\in {\Bbb R}\).

**3. Infinite Discontinuity:**

Consider the function \(y = 1/x\) on any domain that includes \(x=0\). Since the function becomes unbounded continuity fails at \(x=0\) since the limit does not exist there. The domain of the function is very important since the same function on a different domain (\(\displaystyle y = 1/x, 1 \le x \le 2\)), does not have an infinite discontinuity because it does not become unbounded on the given domain.

**4. Oscillatory Discontinuity:**

This type of discontinuity occurs when a function oscillates too much, as in the case of \(y = \sin(1/x)\). As \(x\to 0\), \(f(x)\) does not approach a single value.

We leave it as an exercise to the student to show that

\( f(x) = \begin{cases}

\displaystyle x\sin\left(\frac{1}{x}\right), & x \ne 0 \\[3mm]
0, & x = 0

\end{cases}

\) is a continuous function for all \(x\in {\Bbb R}\).

**A.** Try to plug the value of \(a\) directly into the function.

- If we get a number or the limit ‘blows up’ then we are done!
- You
*should*be so lucky. Typically the value is undefined, having the form \(\displaystyle \frac{0}{0}\) or \(\displaystyle \frac{\infty}{\infty}\).

**B.** If we do not get a number then we need to simplify the expression.

- Use the definition of the limit;
- Use of the limit rules;
- Factoring;
- Multiplying by the conjugate;
- Finding a common denominator;
- Using the Squeeze Theorem;
- Applying some memorized limit such as \(\displaystyle \lim_{x \to 0} \frac{\sin x}{x} = 1.\)

**C.** Try to plug the number into the function once again. If we get a number or \(\pm \infty\) then we are done. Otherwise go back to step **B**.

A quick word of warning…. l’Hôpital’s theorem should not be used at this point since it involves taking derivatives. Most instructors don’t give any credit for limits found using this method at this point.

Like anything else, the best way to get proficient at finding limits is with practice. We conclude with a few examples.

**Examples:**

**1.** Find \(\displaystyle \lim_{x\to 1} f(x)\) where \(f(x) = \displaystyle \begin{cases}x, & x < 1 \\ 0, & x = 1 \\ -x+2, & x > 1.

\end{cases}\)

This one requires the definition of the limit. From the left one has\[

\lim_{x\to 1^-}f(x) = \lim_{x\to 1^-} x = 1 = L^-

\]and from the right,\[

\lim_{x\to 1^+}f(x) = \lim_{x\to 1^+} (-x+2) = 1 = L^+

\]since \(L^+=L^- = 1\), \(\displaystyle \lim_{x\to 1} f(x)=1\).

**2.** Find \(\displaystyle \lim_{x\to -1} \frac{\sqrt{x^2+8}}{2x+4}\).

For this one we just plug in to find \(\sqrt{9}/2 = 3/2\).

**3.** Find \(\displaystyle \lim_{x\to -\infty} \frac{x^4+1}{x^3-1}\).

If we try to plug in we get the indeterminate form \(\infty/\infty\). The trick here is to divide the top and bottom by the highest power in the denominator.\[

\lim_{x\to -\infty} \frac{x^4+1}{x^3-1} =

\lim_{x\to -\infty} \frac{x^4+1}{x^3-1} \frac{(1/x^3)}{(1/x^3)} =

\lim_{x\to -\infty} \frac{x+1/x^3}{1-1/x^3} \to -\infty

\]so the limit does not exist.

**4.** Find \(\displaystyle \lim_{x\to 4} \frac{x-4}{\sqrt{x}-2}\).

This limit requires that we multiply by the conjugate.\[

\lim_{x\to 4} \frac{x-4}{\sqrt{x}-2} \frac{(\sqrt{x}+2)}{(\sqrt{x}+2)} =

\lim_{x\to 4} \frac{(x-4)(\sqrt{x}+2)}{x-4} =

\lim_{x\to 4} \frac{\sqrt{x}+2}{1} = 2+2 = 4

\]allowing us to conclude that the limit is 4.

**5.** Find \(\displaystyle \lim_{x\to 0} \frac{\sec x – 1}{x^2}\).

This one is a bit of a challenge. We start with multiplying by the conjugate, then use the trigonometric identity \(\sec^2 x – 1 = \tan^2 x\) and finally we separate into three pieces.

\begin{align*}

\lim_{x\to 0} \frac{\sec x – 1}{x^2} &=

\lim_{x\to 0} \frac{\sec x – 1}{x^2}

\frac{(\sec x + 1)}{(\sec x + 1)} =

\lim_{x\to 0} \frac{\tan^2 x}{x^2(\sec x + 1)} \\ &=

\lim_{x\to 0} \left(\frac{\sin x}{x}\right)^2

\lim_{x\to 0} \left(\frac{1}{\cos^2 x}\right)

\lim_{x\to 0} \left(\frac{1}{\sec x+1}\right) \\ &=

1^2 \cdot \frac{1}{1} \cdot \frac{1}{2} = \frac{1}{2}.

\end{align*}