By Professor Xiao-Li Meng
Whipple V.N. Jones Professor of Statistics and Department Chair
1. “I keep saying the sexy job in the next ten years will be statisticians.”
Hal Varian, Google’s chief economist, recently was interviewed by McKinsey Quarterly, and was quoted (see www.mckinseyquarterly.com/Strategy/Innovation/):
“I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.”
As a professor of statistics, you guessed it, I of course cannot disagree less (just to check if you have had enough coffee!). But as a statistician, I am obligated to remind you that a professor of any subject can find quotes–tons of them–to demonstrate the importance of his or her beloved subject.
Wait! Does the “reminder” have anything to do with being a statistician? Well, let’s label this question as Puzzle One, and read on. And while we are at it, let me throw in another quote, this time from a recruiter representing Wall Street–yes, they are still hiring–but read this carefully:
“Now more than ever, they are looking for the best and brightest to help get an understanding as to what caused the housing bubble, and how to properly forecast those prices based on all the variables involved (e.g., interest rates, inventories, short sales, foreclosures, delinquencies etc.). I am actively seeking those individuals who have the background and desire to apply their Stat/analytical skills specifically in the Real Estate medium. …… The trend in these unique economic times is that companies want the more scientific/mathematical/engineering backgrounds to help them back solve [sic] these very new and volatile markets. My clients these days are actually shying away from MBA-types because today’s equity markets have much more to do with randomness and psychology than business fundamentals.”
Here, the word randomness is what brings statistics and statisticians into the picture. Statistics, in a nutshell, is a discipline that studies the best ways of dealing with randomness, or more precisely and broadly, variation. As human beings, we tend to love information but hate uncertainty, especially when we need to make decisions. But information and uncertainty actually are two sides of the same coin. If I ask you to go to the airport to pick up a new student you have never met, my description of her is information only because there are variations – if everyone at the airport looks identical, then my description has no value. On the other hand, the same variation causes uncertainty. If all I tell you is to pick up a Chinese female student by name Xiao-Li (meaning “Little Beauty” (小丽) in Chinese, not “Plough at Dawn” (晓犁) as in my Chinese name – an example of uncertainty in translation, or lost in translation!), then my description is not informative enough precisely because it still allows too many “variations”–there may be a substantial number of individuals at the airport who look like a “Chinese female student.” You then need to do something creative on your own in order to pick up the right one, such as making a name sign.
Then again, the name sign is useful for her to identify that you are the one who is picking her up, only because there is variation among names. Indeed, if it happens that there are two “Xiao-Li” name signs outside the terminal, she will need to do something creative on her own in order to find the right one. This is of course a trivial fact, and any of us would recognize and deal with the situation when we encounter it. But we may or may not recognize the deeper principle behind it, that is, information is there for the same reason that uncertainty is there.
While we are at the airport, let me throw in this almost well-known joke. Mr. Skerry needs to take a flight, but he is terrified by the possibility, however small, that someone could bring a bomb onto his plane. So he decides to pack a bomb himself, as he reasons that the chance that two individuals bringing bombs onto the same plane is much smaller than that of one individual bringing a bomb.
You, of course, are chuckling at this. However, which probabilistic/statistical principle is he trying to use, or rather violating? Can you easily explain to your fellow students why Mr. Skerry’s argument is ridiculous? If you cannot, then let’s label this as Puzzle Two.
Regardless of whether you can or cannot, I hope the discussion above has helped you to see more clearly, and fundamentally, why Google and Wall Street, among many others, are increasingly interested in hiring statisticians. We are now squarely in the information age, with almost everything digitized. Each of us are trying to see what all the data (which don’t have to be numerical) out there are telling us, on issues from personal health to the global economic crisis. There is so much variation in almost everything we want to know or study, so what is real information and what is just noise? Mr. Skerry’s reasoning surely is ridiculous, but how many of us have realized that the many “small probabilities” reported in the media and even in scientific publications, such as probabilities of DNA evidence, were based on exactly the same ridiculous reasoning, that is, multiplying probabilities inappropriately?
2. “AP Statistics was the most boring course I took in high school!”
As a professor of statistics, I hear this almost every time I tell someone that I teach statistics: “Oh, that was really a hard course for me!” or, “I really didn’t like my stat course!” And for nearly every one of you (i.e., undergraduates) I have spoken with, the number one reason that you did not even consider majoring (or concentrating, to be true to the Harvard spirit!) in statistics is because the AP statistics you took convinced you that statistics is the most boring subject. We statisticians, of course, are to be blamed for this unfortunate situation. Statistics is an urgently demanded but vastly underappreciated field; urgently demanded for reasons discussed above, and vastly underappreciated because too few statisticians, relatively speaking, have effectively conveyed the excitement of statistics, as a way of scientific thinking for whatever you do, instead of a collection of tools you may or may not need one day. Tremendous efforts have been made, for example, by the Consortium for the Advancement of Undergraduate Statistics Education (CAUSE, http://www.causeweb.org/). But clearly more is needed, as surely any successful educational program requires on-going effort.
At Harvard Statistics, we are fortunate to have several first-class statistical educators who are at the forefront of teaching introductory statistical courses. For example, my colleague, Ken Stanley, who teaches Stat 104, Introduction to Quantitative Methods for Economics, has been so effective that one student wrote in his/her CUE evaluation “It is like taking a course in Christianity, and Jesus himself is teaching.” (If you can come up with a more impressive praise than this, email me at email@example.com!). Another colleague, Joe Blitzstein, has single-handedly doubled the enrollment of Stat 110, Introduction to Probability, from 90 students when he took over in 2005-2006, to 188 students this past fall. He is now an international sensation, so to speak – a student was telling her friend in Germany that she was taking this cool stat course with Joe, and her friend responded “Oh, you mean that YouTube stat professor?” (You can satisfy your curiosity by googling “Stat 110 at Harvard.”)
Last year, we also launched Stat 105, Real Life Statistics: Your Chance for Happiness (or Misery), and I am teaching it again this semester. This course was designed by my Happy Team, which consisted of 8 masters and Ph.D. students from the statistics department, over a period of two years and many happy dinners (not happy meals!) at the best restaurants Boston can offer. The course aims at introducing students to the wonderland of statistics, by showcasing how it is used (and mis-used) in real-life situations every student should be able to relate to, either happily or miserably!
Unlike many traditional statistical courses, which arrange the material by statistical topics in the approximate order of their complexity, Stat 105 arranges the material by what we call “Real-Life Modules.” For last year’s offering, the five modules were (1) Finance (e.g., stock market), (2) Romance (e.g., on-line dating models; not dating on-line models!), (3) Medical Sciences (e.g., Viagra trial; not trying Viagra!), (4) Law (e.g., OJ Simpson trial), and (5) Wine and Chocolate Tasting (depending on your age!). This semester, we are replacing the Law module by an Election module, given the historic election we all just witnessed (and now that OJ is behind bars). More information about the first offering can be found in the Harvard Gazette http://www.news.harvard.edu/gazette/2008/02.14/11-stats.html. For the current offering, check the Stat 105 course website (open to anyone with a Harvard ID) and view the video for the first-day introductory lecture to enjoy a virtual chocolate tasting, with or without wine!
All these efforts are aimed to make “statistics not just palatable, but delicious” (the title of the aforementioned Gazette article) to all of you, who, I am 98% sure (that is the highest assurance any professional statistician would give!), will need statistics not only in your own research, regardless of the subject, but also in your life. Our happiness or misery often literally depends on (but of course is not necessarily determined by) our understanding of statistics, whether we realize it or not. Statistics or, more generally, quantitative evidence is being used everywhere in the media, scientific publications, etc., to persuade us to buy a product, an argument, a theory, etc. Some of the claims are statistically and scientifically sound, and many others are not. A good percentage of them are even deliberate lies, intended to deceive the public in order to make a profit. If you have been one of those flipping channels in the wee hours and have given your credit card number over the phone because of those convincing “infomercial statistics,” the chances are that you would have been much more satisfied by trying out the chocolates or wine offered by our Stat 105! (And of course if you have a relative who had been convinced by the dazzling “return statistics” of Mr. “Made-Off”, then no amount of chocolates or wine could compensate!)
3. “Honey, I know you are in excruciating pain, but which treatment do you want?”
Here is another real-life scenario that literally makes your happiness or misery depending on your understanding of statistics, if you, like me, unfortunately suffer from kidney stones. Two treatments for kidney stones were evaluated in a medical study. Treatment A has a success rate of 78% and treatment B, 83%. Which one should you choose? Surely treatment B, right? Well, what if I tell you that when treatment A and treatment B are applied to those who suffer small stones, the success rates become, respectively, 93% and 87%, and when they are applied to those who carry large stones, the success rate for treatment A is 73% and for treatment B it is 69%? That is, regardless of the sizes of the stones, treatment A has a higher success rate. Surely you then should choose treatment A, right?
Confused? You should be, if you don’t understand Simpson’s Paradox (no relationship with OJ, though there could be a paradox with him too, if he is still looking for himself), one of the most fundamental statistical phenomena, which is responsible for a vast quantity of misinformation in literature and in the public. There is actually no paradox at all in the mathematical sense. The numbers I reported above are from an actual study (Charig et. al., British Medical Journal (Clinical Research Ed), March 1986, 292 (6524): 879–882), and you can verify them yourself: for treatment A, there were 350 patients, 87 carrying small stones, among which treatment A was successful for 81 patients; for the remaining 263 patients with large stones, treatment A was successful for 192 of them. For treatment B, there were also 350 patients, with 270 suffering small stones, and among them 234 were successfully treated by treatment B; for the remaining 80 with large stones, treatment B was found successful for 55 of them.
Now you do the math! And then think statistically–how could this happen? That is, how could treatment A have a better success rate overall than treatment B, and yet a worse rate in each subgroup defined by the stone size? What caused such a “paradox”? What are its general implications? Did it actually happen with some studies you have done or read? Let’s label this Puzzle Three and read on, unless you really suffer from kidney stones, in which case let me distract you by telling you how I was treated by Dr. Coe from The University of Chicago (where I taught from 1991-2001), a world renowned nephrologist, who treats his patients with statistical principles!
Once Dr. Coe learned that I was a statistician, he said, as I recall, “Well, you then should understand this well. The kidney stones are actually formed by a Poisson process, with those crystals bumping into each others. So what you need to do is to drink a lot of liquid, any kind of liquid, water, juice, coffee, even beers and wine, anything that helps to reduce the Poisson rate for crystals to bond with each other.” He was obviously pleased to finally find a patient who understood “Poisson Process,” and surely the feeling was mutual as I was pleased that I was treated by a doctor who understood statistics! Of course I have followed his advice closely, and have not had any episode of kidney stones for the past 15 years or so. And I have never had any surgery for kidney stones, nor am I on any other treatment now other than a lot of drinking–so next time you see me pouring myself a glass of wine, I may be just trying to reduce my Poisson rate for crystals bonding!
Here is another example where Dr. Coe saved me much trouble and worry because of his–and my–understanding of statistics. While I was at The University of Chicago, I suffered for a long period from fatigues and various pains of unknown cause. So my primary-care physician did all sorts of tests on me. One of them was checking my thyroid function. One result came back on the “borderline”–I don’t recall which test and what were the exact values, but for the sake of the story, let’s say my value was 1.1 and the normal range listed was (1.0 – 2.0). Most people would consider this interval (1.0 – 2.0) to imply that values close to 1.5 to be “normal” and a test result close to the boundaries, either 1 or 2, suggests something to watch for. Indeed, my physician asked me to schedule an appointment with an endocrinologist for further studies. This of course is a rational suggestion, given the “normal” interpretation of my test result and the fact that I was having various symptoms, which could have been due to a thyroid disorder of some sort.
Since the quality of doctors matters (obviously!), and it happened that I had a regular follow-up visit with Dr. Coe shortly after that test, I asked him if he could recommend a colleague who is an endocrinologist. He naturally asked me why, and I showed him the test results. He laughed and sighed at the same time: “Well, these doctors really don’t know anything,” (I assume it’s OK for a well-known doctor to say that!) he continued, “for years I have told them that they shouldn’t provide “normal limits” as such when the distribution is highly skewed! You actually have the most typical value in the population! You of course understand that they should have taken a log or something.” As a statistician, I was both happy and sad. I was happy of course that I had no reason to worry about my thyroid (and I still don’t to this date). I was sad to think how many other people had unnecessarily worried and gone through additional tests, simply because of an elementary statistical mistake in setting the “normal limits.” Incidentally, I was told by a medical student that when a patient’s list of test results come back from the lab, abbreviation WNL after the name of a test indicates that the result was “within normal limits.” The inside joke is that it really stands for “we never looked.” Having incorrectly set normal limits could be even worse than “we never looked”!
I hope by now I have distracted you enough from your kidney-stone suffering, and that you understand what Dr. Coe was laughing and sighing about. If not, let’s label this as Puzzle Four, and read on again.
4. “The best thing about being a statistician is that you get to play in everyone’s backyard.”
This quote is attributed to John Tukey, a statistical giant who also coined the terms “software” and “bit” (see http://www.princeton.edu/pr/news/00/q3/0727-tukey.htm or The New York Times, July 28, 2000). This is literately true, as many statisticians, myself included, can personally testify. Other than teaching the delicious Stat 105 and other courses (e.g., I also co-teach, with Joe Blitzstein, Stat 303, The Art and Practice of Teaching Statistics, aimed at training more and better future statistical educators), I am currently conducting–together with researchers from the Harvard-Smithsonian Observatory–a workshop on AstroStat for dealing with astronomical amounts of data from astrophysics; working with a group of geophysicists from the University of Illinois and the National Weather Service on climate change; writing papers with a team of psychiatrists from the Harvard Medical School and Columbia University on estimating disparities in mental health services; collaborating with researchers from Harvard’s engineering school on signal processing, particularly for digital cameras, via wavelets methods; publishing articles with statistical geneticists at The University of Chicago and deCode Genetics in Iceland on how to measure information in genetic studies; preparing reports with my ex-postdoc at the University of Chicago on AIDS reporting delay to the CDC (Center of Disease Control). I of course also play in statistics own backyard, or perhaps I should say front yard, investigating statistical foundational issues, such as to what extent size matters–do more data automatically imply more accurate results? (This one will take more thinking, so let’s consider it the last Puzzle of the Day.)
If you find the range of my “backyard” activities impressive, check out our webpage (stat.harvard.edu), and prepare to be dazzled by a wide range of “front yard” research my colleagues are conducting, such as Sam Kou’s absolutely pioneering work on statistical models for neon-biochemical experiments.
I hope the quotes and stories have provided a snapshot on how practically useful and intellectually fulfilling it is to be a statistician, or at least to be able to reason with good statistical insights. I am certainly having great fun, both professionally and personally, as a statistics professor, and I hope you will be able to share some of the fun by taking at least one statistics course, no matter how much you hated that idea before. You will then, among many other benefits, easily find out the answers to all five puzzles listed above. If you want to think hard about them now to challenge yourself, of course that is part of the fun! But if you start to lose sleep on any of them and feel miserable, email me (firstname.lastname@example.org)–remember, I promised you both happiness and misery!