Economic Policy Group co-organizers, John Cochrane and Valerie Ramey, hosted a talk on “Quantifying Non-Sampling Variation: College Quality and the Garden of Forking Paths.”
Presenter: Jeffrey Smith, the Paul T. Heyne Distinguished Chair in Economics and the Richard Meese Chair in Applied Econometrics at the University of Wisconsin–Madison.
Moderator: John Cochrane, the Rose-Marie and Jack Anderson Senior Fellow at the Hoover Institution
SUMMARY
Empirical economics papers report standard errors that quantify the uncertainty associated with sampling variation but rarely consider non-sampling variation in a systematic way. Nonsampling variation arises from researcher study design choices related to measurement of key variables, functional form, tuning parameters, model selection procedures, and so on. This paper documents the current state of play regarding non-sampling variation and describes approaches from inside and outside of economics to quantify such variation more systematically. We provide a worked example in the form of an analysis of the labor market and educational effects of college quality on degree completion and earnings using data from the National Longitudinal Survey of Youth 1997 cohort (NLSY-97). In our analysis, we consider multiple ways to create our college quality index, multiple ways to deal with item non-response in our conditioning variables, multiple ways to code our earnings outcome measures, and multiple ways to choose a specification for our conditioning variables. We find that in this context, in which sampling variation matters a lot due to the relatively small sample size of the NLSY-97 and relatively high residual variance of our outcomes, these dimensions of non-sampling variability imply uncertainty in our parameter estimates on a par with the sampling variation.
To read the slides, click here.
WATCH THE SEMINAR
Topic: Quantifying Non-Sampling Variation: College Quality and the Garden of Forking Paths
Start Time: October 29, 2025, 12:15 PM PT
Speaker 1:
Jeff, it's wonderful to have you here. Let's go.
Speaker 2:
Wonderful to be here. Thank you John. So the topic today is quantifying non-sampling variation and college quality in the Garden of Porking paths. This paper has two co-authors. The local co-author is Lois Miller, who's a former student of mine who now labors at the University of South Carolina and Heather Little is a current student of mine at Wisconsin. Not everybody here knows me, so I have a brief introductory slide. I did my undergraduate at Washington where I was heavily influenced by Paul Haynes, some older folks in the room may know that name. I'm now the Paul Haine professor of Economics at Wisconsin. I wrote my dissertation in Chicago, that was my committee up there. Pretty good committee. I've had four jobs, Western Ontario, Maryland, Michigan, and now Wisconsin. And I wanted to shout out here especially to the last three names. The first one's kind of obvious because they're not cited in the slides, but they have influenced my thinking about these issues and I think that'll be plain for folks who are familiar with their work.
So I read Rhetoric of Economics in graduate School. It came out I think my second year, first or second year, and it was the hot thing among the students at the time and it kind of stuck with me. I read it again recently and actually didn't like it quite as much, but I sort didn't like the execution quite as much. I still liked the ideas. Andrew Gelman is a statistician who has kind of a public face. He has a blog and he critiques dubious studies and things like that on the blog. And he is actually, I believe the originator of the Garden of Forking Paths analogy that I'll be relying on here. And Chuck Mansky of course has thought a lot, especially in the recent years, but even going back further to his work on bounds before they were elevated to be set identification has thought a lot about uncertainty in estimates that results not from sampling.
And that's going to be our theme today. Now may I want to first want to clear the Grumpy Economist is over there, but you may think after you heard this talk that I am a grumpy economist and I'm not as grumpy as this talk. This is an above average grumpy talk, but this is the most general interest paper I had and so I thought I should present it in here. There's a bunch of epigrams to start. So I played a lot of civilization in my wasted youth and Sid Meyer is the designer of civilization and he wrote a memoir, which I would recommend if you're into Civilization. And he says in that memoir, A game is a series of interesting decisions. And we say, so is an empirical economics paper and a paper that is in this space in a different discipline says empirical results hinge on analytical decisions that are defensible, arbitrary, and motivated.
And that's kind of important to how we want to think about this garden of working paths. Finally, a former PhD student from our department who unfortunately dropped out of the program but who had a Twitter presence for a while said, they're not standard errors, they're fabulous errors. How dare you insult such an icon. And so I will say a little bit more towards the end about fabulous errors. Alright, there's an outline is pretty standard stuff. Three substantive questions for us today. How do current studies characterize the uncertainty due to non-sampling variation? And don't worry, I'm going to give you a bunch of examples of that. How should researchers characterize the uncertainty arising from them? Oh, somehow I've lost one of the questions. Oh no. How do current studies and how should we do it? And I'm going to distinguish those two things because I don't think we do a good job of it at the moment.
And then we're going to do a worked example and that's where the college quality part comes in. We're going to kind of go after my own papers here and demonstrate that in an application where the sampling variation actually looms kind of large because it's an application that's going to use an NLSY 79, which is not a big sample and have a high variance outcome, which is earnings in an application where the sampling variation is pretty big. It is still the case that the non-sampling variation even on a small and finite number of dimensions is going to rival in magnitude the sampling variation. So it's kind of a proof of concept I guess or something.
This seminar has the word policy in the title and I thought that even though last week there was no government in the model and thus I guess no policy, I thought I should add a couple slides about policy and I think they'll stick. Actually, I think they're a good addition to the talk 15 years ago or more. I would have said that the US has no evaluation policy. I want to distinguish your evaluation policy from policy evaluation. So evaluation policy I think of is a set of decisions that we make about what data we're going to collect and who we're going to allow to use it and when we're going to allow it to be combined and by whom and in what ways. And also, so that's kind of the first thing. So they're talking about a new cohort of the NLSY. It'll be the NLSY 27 and it amazingly has not been killed yet and there's a lot of discussion about which administrative data sets should administrative effort be put into linking to the NLSY and consenting the respondents.
And then if we say link in UI earnings, should we still ask the respondents survey questions about their earnings too or should we say that the UI earnings are a fine substitute or we can make the survey shorter or ask about something else. And I think the argument in this paper is going to kind of push towards saying it's good to have two measures of earnings because then you can get a sense of that dimension of non-sampling variability in every study that you do that uses earnings as a dependent variable or as an independent variable for that matter proposal review, right? This is something that is partly set by policy, partly set by norms. How much should we reward people for going farther in thinking about and documenting the sensitivity of their estimates to non-sampling variation than we presently do, which I think is not very much. The federal government runs or ran a bunch of quantitative evidence. Compendia like the what Worth clearinghouse? That's the wwc, which is colloquially known as the nothing worse clearinghouse, which grades in documents studies of educational interventions. Clear is the clearinghouse for labor market evaluation and research. DLL runs that in the grading schemes for those types of sites. And just in the data that is presented about each study, non-sampling variation is not emphasized. Maybe it should be.
So there's some policy. There's also what I call professional policies here. Should we require pre-analysis plans for journal publication for example? So in the moment there's sort of a norm, especially in development economics land, but to some degree in other places that do RCTs that you need to have a pre-analysis plan for an RCT, you need pre-register the RCT and sort of specify all the design choices that you plan to make or an algorithm, a data-driven algorithm that will guide those choices that you can specify in advance. So if we see this, we'll do this, we see this, we'll do this, that kind of stuff. These are not routine for non-experimental studies, which are still the vast majority of studies in applied economics. Should we require pre-analysis plans for those and should editors and reviewers reward studies more than they presently do for the way that they handle non-sampling variation? How should we think about sensitivity analyses? I'm going to make some criticisms of the way that we do sensitivity analyses in empirical papers these days.
We don't talk about it much. We're not sociologists I guess, but there are norms in how we do empirical work. If you look at empirical studies that use progression discontinuity designs, partly I think because that's a pretty new literature, there's very strong norms about certain things you have to do. You have to do a balanced test, you have to do the McCrary test for the density stuff, and if you have a lumpy or discrete running variable, then there's stuff you have to do. There's a paper by David card you have to sign and you have to do that stuff. And maybe we should develop some norms around sensitivity analysis more broadly,
Speaker 1:
Perhaps not just norms requiring internet appendix, DE, FG, and H before you get anywhere but norms that you're allowed to look at existing studies that turned out to be interesting and see if they pass this kind of analysis and maybe publish the result somewhere.
Speaker 2:
Well, actually that's a good point. Well, I'll return to that at a later slide, but yes, that's a very good point. Alright, examples of non assembly variation. So one is different ways of measuring the same variable and so in the variable that we're going to use in our empirical example is college quality. We're going to talk about different ways to estimate college quality earnings is another one. As I was mentioning to Ken beforehand in the data from the National Job Training Partnership Act study that's used in all the heckman at all papers, there are four different earnings measures for various subsets of the data to administrative UI and social security and two survey measures. One A CPS, how much did you earn last year? And another that's cribbed from the NLSY because NORC did the surveys for the study that builds up earnings from information on individual job spells and those measures are correlated, which is good and reasonably, highly correlated, positively like 0.6.
But the correlation is not one by any means. And the mean is quite different from one of the four, which is the one that's built up from the job spells. It has a higher mean. There's an unpublished chapter of my dissertation on this. I want to learn more survey. There's a whole bunch of different ways to deal with survey non-response item non-response, right? So survey non-response means somebody just says, Nope, I'm not doing your survey item. Non-response says, yep, I'm doing your survey but I don't answer certain questions. Maybe they're sensitive. Back in the old days when these things were done on paper, it may just be that the enumerator had written down something that wasn't typeable by the person who was entering it into the machine. There's a bunch of different ways to deal with that. We'll talk about that in the empirical implementation. Functional forms, right? Logit probate linear probability model. Most of the time it doesn't matter. Some people get very exercised about this particular thing. I could point you to an entire seminar or linked presentation by Bill Green about the evils of the linear probability model relative to lotion and probate. I'm not sure it warrants that much attention, but it can make a difference and you have to make a choice.
Speaker 3:
Just want to say in macro you get all these other things up because many things have exponential trends. How do you think about normalizing and should you have first difference or levels and all those sorts of things.
Speaker 2:
I actually have a whole slide. I was anticipating more macro economists then I have received, but I have a slide of macro nons happening variations after this one. Variance estimators. I'm not going to, I'm going to be focusing on point estimates here, but there's choice about variance estimators in many contexts. There's choice about who you study, right? I'm old enough to remember when papers about labor supply would all be, or papers about wages would all be about men because women were complicated if they didn't all work. This is really dating me here and you can choose a data set. Sometimes you have a given question, maybe it could be answered with a CPS. Maybe it could be answered with a sip. Maybe there'd be different issues with each one. You can think about different identifying assumptions. So I was just at Natalie Miller's practice job talk.
She has three different identification strategies in her paper and we had a discussion of which one we might like best. Here's my macro examples. I'm not really a macro economist, but I have a couple, not at all a macro economist, but I have a couple papers that use search models, choice of values for calibration and my paper with Jeremy Lee's and Shannon sites, we sort of agonized about this and about the sensitivity to the calibration choices. You have to make functional form choices. You have to decide what goes in the model. I feel like every seminar on kind of macro labor that I go to at Wisconsin, somebody's like, do you have savings in the model? And they're like, no, it's hard. And well, okay, but if that's a good reason not to have savings in the John
Speaker 1:
Two comments. One non micro it. You left out some of the elephants in the room, who goes to the lefthand side, who goes to the right hand side? What are all these controls doing here? Control for industry in a wage equation? Number one thing you should not do. What are all these stupid fixed effects around for? Why in the world are the instruments even vaguely plausibly instruments? Could you please tell me what is the source of variation in the error term and why it's ortho every, it seems like the elephants in the room of every micro paper. I go to macro, you got a deeper problem, which is how do you evaluate a quantitative parable, which is what? Macro is
Speaker 2:
Quantitative. I like that phrase.
Speaker 1:
That's what our theories are. They are not here is what Newtonian mechanics tells you. The flow of air around the wing looks like they're sort of, well, here's a cute little economy and how it behaves. We don't even pretend that. Yeah, you said it. Most of the models don't have capital so well, but they still kind of see output and consumption go together. Isn't that cute? I don't have an answer to that one, but that's helpful
Speaker 4:
Michael point of view. But feel free to postpone it in model model-based, structure-based applied work, if the asset be a non-parametric identification estimation approach that the reality of it is that it's often infeasible. So you're necessarily bound to make a choice of specification. And I see that as an inescapable practical problems.
Speaker 2:
I think it is. I guess my,
Speaker 4:
But then we learned, lemme say, but then we all learn heck fund and honorary 1990 and we know the perils. Is there a feasible way of answering this question? If the answer the referee, well try very different functional forms anyway, if that's the conversation we're going to have looking forward to it.
Speaker 2:
I feel like if I press in part of my role at Wisconsin, I think I was a diversity hire there because I'm not a structural person, but I am sympathetic. I'm a fellow traveler as it were of the structural program and I have a couple of structural papers since to have structural friends and I will say, well, have you examined the sensitivity of your estimates to variations in model space? And the usual answer is no, and then followed by the model takes three weeks to estimate. So I can't do that because I need to send my G market paper out or the model takes three weeks to estimate. And so the model is sort of packed with stuff up to the point where it's barely feasible to estimate it once. And I sometimes wonder if maybe there would be some, we don't even think about the trade off between, well what does the model have one less thing in it and then that allowed me to do these perturbations in model space and get a sense of the uncertainty. It is also true that I have a sense particularly for well-developed classes of models like search models. There's lore that the key researchers in the field will have some idea of how sensitive things are to perturbations and model space because of stuff that's left on the cutting room floor because they've just read the literature a lot, but I don't know that, and it's not quantified anywhere, it's just in the lore.
So this is going to be a paper that is going to kind be framed around design-based applied micro. One could certainly write, am I with the idea of writing? But we haven't actually finished the draft of this one, a structural labor version of this. I think that'd be interesting and some of the issues are the same. You still face the issue about the multiple earnings measures and stuff, but some of the issues are different.
Speaker 4:
Thank you.
Speaker 2:
So the examples I just listed are very different and there's a sort of a fuzzy boundary between substantive and design choices. In fact, the earnings concept that underlies unemployment insurance earnings records is different than the earnings concept that underlies the CPS earnings question. And so it's not just that they are two different ways of measuring the same thing. It's that there are two different ways of measuring slightly different conceptual ideas. And there's that fuzziness comes up a bunch in this stuff and I think this is part of why we don't go very far always down the road or not at all down the road. We're trying to quantify the non-sampling variation. There's also an issue, a broader issue about well, which things should vary across studies and which things should vary within studies. What should we sort of expect people to do? If you could do your study with both the CPS and the sipp, should we expect for publication that you do both or is it okay that there's a paper with the CCPs and then your student writes the paper with the sim or whatever And then I don't know, papers are getting long.
We're in the season of job market papers now, so we're all downloading these a hundred page PDF files where the student is trying to prove that they know every possible skill and did every possible thing.
This is partly just a question raising talk and a question answering talk, although I'm going to try to answer questions a little bit. So what is the baseline? The baseline is sampling variation, right? Sampling variation results from the use of a simple random or a stratified random sample, some sort of sample from a population rather than the population itself. And the standard errors that we report in underneath our coefficient estimate are supposed to tell us something about the variability of the estimator under repeated random samples. And we care about that. We devote a huge amount of attention to these numbers that go into parentheses. So there's a huge literature that started I guess late when I was in graduate school on heterogeneity off the diagonal. So clustering and robust standard errors and all that stuff. Then there's a huge literature on bootstrapping. I thought bootstrapping was one of the coolest things I learned in graduate school.
I worry that we're not equating margins here in terms of the amount of intellectual effort that we devote to the standard errors and the sort of epicycles on epicycles on epicycles of getting those right versus the amount of attention that we devote to the non-sampling variation and attempting to quantify it in some systematic way. We are so excited about sampling variation that we report standard errors even when we don't have a sample. And I find this, I'm old enough to remember papers that worried about what happened to the standard air estimator as the size of the sample got close to the size of the population because things go a little bit wrong if your sample is 99% of the population. Many studies that we do now using administrative data or studies that use cross state variation, state level data, that kind of stuff, you have the population.
If you took the sampling theory literally there'd be no number in parentheses, but you're not going to publish that. I can tell you that's not going to happen. So I have this nice quote here to sort of, I'm not a real econometrician. Alberto Abdi is a real econometrician. He says this is common even in applications where it's difficult to articulate what the population of interest is and how it differs from the sample. So if you press people, the students often have never even heard of this point, but if you press people who are older, they will mumble something about super populations.
There's this real demand for this number in parentheses and this real worry about sampling variation even when we don't have a sample, a related point here. So I'm going to focus mainly on variability, but there's a related point here about point estimates. So most applied micro papers generate multiple point estimates, right? It's framed as here's my preferred path through the garden and then I have a section towards the end of the paper where I do a few one step away changes in the path that I took through the garden and I get a different number from those exercises. And I conclude without having ever defined what I mean by the word robust, that my results are robust to whatever that one step away change was. And then I continue to focus on the one number that I got out of my preferred path through the garden.
Now a Bayesian and I am sort of a, I like to say casual Bayesian, Andrew Gelman would say spiritual Bayesian Bayesian would say, which means I don't want to spend time fully specifying my multivariate prior because that's hard. But I sort of try to think that way. Would say, well here you've got five different estimates, let's say from your preferred path and your four one step away estimates. Why is the right summary number that goes in the conclusion, the number from the preferred path and not some weighted average of the five numbers. And so there are people in the literature who are worried about this. So my almost colleague Steven Durlach, he left as I was showing up, right? He's written these papers with my intellectual sibling, Salvador Navarro and David Brewer is on Bayesian model averaging. People do meta-analysis, blah blah, blah. Economists tend not to do these things. We really like to have the one number that comes from our preferred path through the garden.
Speaker 3:
I think first John, oh, I just want this idea of the model averaging is something that I heard from Clive Granger who is my colleague years ago, that it turns out the best forecast is the one that averages a bunch of other people's forecasts. Even if some of them are weak, you actually get more from averaging than picking what you think might be the best at the time.
Speaker 1:
So let's not forget the lesson of clustered standard errors. A hundred different studies can all come to the same result. They're all cheating in the same way. So you don't get much better by averaging. I think one way of putting your point is every month 10 top five journals come into our mailboxes. Each one of them has 10 papers in them with three stars on every article. How much does any of that shift our beliefs about whether what they say is right on say whether minimum wages raise or lower employment 0.01, whereas of course every three star thing would be a fair me described miracle if it were wrong. So obviously nobody believes this stuff and something else is going on.
Speaker 2:
So there was a book that Heckman put me onto when I was a student called Scientific Reasoning, the Bayesian approach by Hausen and ach. And they sort of make the argument that you were making that basically even people who would self-identify as frequentist act like casual bayesians when they interpret evidence. And I think that's exactly right,
Speaker 1:
But even a Bayesian just reinterpret the standard error as how much my prior should shift enormously. One econometric article with three stars, whatever I had before should go boom over here. Doesn't why? Because I know all the skullduggery and how many fixed effects and controls and all the rest of it went
Speaker 2:
In some very informal and uncertain way.
Speaker 1:
Informal in certain way.
Speaker 2:
Yeah, agreed.
Speaker 3:
Actually the difference in forecasting is the objective functions different. So professional forecasters are trying to get it right whereas there's too much hacking and all the other business going on.
Speaker 1:
Well a hard job is to distinguish customer effect, not just a forecast.
Speaker 3:
No, but I'm saying this averaging of forecast, the good properties you get is you made a good point implicitly, these people are actually trying to forecast, well they don't have some other agenda getting a top five publication by P hacking.
Speaker 1:
That's a very good point.
Speaker 4:
But then the question Jeff is if you go the bay way casual speaking, then you're going all the more parametric way.
Speaker 2:
When I say casual Bayesian, I am thinking more just I'm integrating over all this evidence that isn't, some of which is not in the paper that I know what people do. I do it. I know there's all this stuff on the cutting room floor that isn't mentioned in the paper. And so it's not really that this
Speaker 4:
Or reported among the controls,
Speaker 2:
Right? And I think hopefully not too many people still pick specifications based on the stars. But back in the day people did stepwise regression and stuff, which was sort of picking the specification based on the stars.
Speaker 1:
It's called machine learning. Now it's very popular given that
Speaker 3:
The
Speaker 1:
Federal reserve conduct policy based on the stars, it's all going to head downhill in a hurry.
Speaker 4:
But seriously, the question is also there's Annunu America convenience and invasions routines, but from the point of view of being wary, I am already imposing a specified structure. I maybe don't want to go that extra step.
Speaker 1:
Well bayesianism sounds beautiful, but nobody keeps the posterior from the last 10,000 papers and then updates. It is one shot we do Beijing one shot. Well that didn't help us much.
Speaker 4:
There's a beautiful paper by Bomar, I'm getting the author Linsky and restart 2010 about another debate on returns to tenure's experience showing that what they proposed would shift a prior concentrated on the back of the existing estimates. So the modified estimator that they proposed that account for the mobility decisions across jobs and the stars that they report change in your prior about what the returns should be back to the old top numbers.
Speaker 2:
Okay, so as I mentioned, what people currently do, I'm switching now from, I'm continuing guess the current practice. What do people do? As I said, most papers that I read, there's one path through the garden and then some one step away sensitivity analysis. And then we declare the student slide. Now there's usually a whole slide of sensitivity analysis and there's a click through for everyone, but we're already behind schedule on the seminar by that time. And so we don't actually click through on any of them. We're just assured by the speaker that it is robust to all those things, whatever that might mean. So this is just a pretty slide that Heather made up to try to illustrate this business about the one step away. This is something that we're going to try to flesh out more in the next round of empirical work when the government shutdown is over and the analyst Y is not available.
It turns out. And the secret part, we're using the restricted use because we have to know where people went to college and that's part of the restricted use data. So here is a garden with four forks, each of which is binary and illustrated in green is the researcher's preferred path. So if you think about one step away sensitivity analysis, so at the top you chose to go to the right instead of to the left, but then you would stick with the same choices on the other three. There's huge numbers of endpoints or exits from the garden that you're going to miss if you do that. And so one of the things that we want to look at is, at least in our application, how much does it matter that you only do the one step away?
Speaker 1:
There's still elephants in the room. So I'm reminded of Tom Coon's example. We're discussing the theory of heat. Is it Ian or is it work that causes heat? And one side says, well look at cannons. The more you rotate 'em, the more they get hot without limit must be the work therapy. This says, well dung piles are hot. What do you got to say about that? I dunno what you say about that. There's no F test or P test or those are perfectly established facts that much of what we do seems to have that character. We're doing clean versus monetarism. Nobody's P value of some test or whatever is ever going to get anywhere on that one. Or certainly historically, that's not what convinced people.
Speaker 2:
I am sympathetic to what you
Speaker 1:
Said. I'm just asking for more generalizations and you're going to actually do something and I'm just going
Speaker 2:
To, well no, and you write in a public space right now and I think arguments like that carry more weight in policy than they do always. I mean it's not saying they receive no weight in the seminar room, but I think they're relative weight they've received as larger out in the broader public discussion. We have implicit variation across studies, but it's not ideal because typically if we think about two studies that are going to go through this garden, they're going to make all different choices and usually there's a lot more than four decision points. And so you're varying a whole lot of stuff at once and oftentimes you're not explaining why you're wearing it often.
So that doesn't entirely solve the problem. And as was suggested before, I think we're not very good at, oh, the mixed findings point, so this is something I also observe at student presentations a lot. In particular, they'll say, I'm really interested in looking at the effect of exon y. Here's a list of papers that have looked at the effect of exon y. Some of them find a positive effect, some of them find a null effect, some of them find a negative effect. It's a mixed literature. So I'm going to add to the mix, and it doesn't seem to even occur to them that maybe it would be interesting to write a paper that tried to reconcile the existing results and figure out where the variation across the studies comes from instead of just adding to the cacophony. But we don't reward that, right? I think that's why it doesn't happen. It's also hard, but I think we don't reward it enough and so we just end up piling on more mix instead of trying to figure out,
Speaker 5:
You would reward it if you could do it. I mean most of the times you can't actually reconcile.
Speaker 2:
Well, I remember the story I like to tell. I have an anecdote for that. Robert Moffitt was writing a literature survey. This was some 15 years ago or 20 years ago about the effect of A FTC benefit levels or welfare benefit levels on, I don't remember Brad Woodlock works or something. And there was one study that was kind of an outlier, and so he actually went and got the data and redid the one study that was outlier. It turned out it was a programming error and it wasn't a mixed literature After you got rid of the study that had the programming errors, we could do that more than we do.
Speaker 3:
David Hendry used to make this argument about encompassing and it said that anytime you come up with a new study that contributes to a literature and there are other estimates out there, it should be your responsibility to figure out why you're getting different estimates. Now I know that some of the econometricians weren't happy with what he did, but I think philosophically that's a good idea. And then my research over the last few years has been reconciling what micro parameters are implying about the macro and why it's not showing up in the macro data. So some people are listening,
Speaker 1:
It's praiseworthy, but suppose I figured out how to do something right? I really have to spend three years through 77 papers that all screwed up. Well, PA screwed up this way and Valerie screwed up the us. It takes forever. I'm just sure how I did it right.
Speaker 3:
Well except that if referees, rather than making you do 10 robustness checks and doing instead said, tell us why your estimate is different, at least from this leading estimate.
Speaker 1:
Like I said, I'm arguing for knowledge in smaller chunks. So why did Valerie get her result and why is Valerie's result different from Palo's result is her two different chunks. They don't have to all be in the same paper,
Speaker 3:
But see, then you get this thing. No, and I see all these people that say, oh, well so-and-so got this result and Raymond Berry got this result, dah. And I said, but we have this long section saying why those other results weren't right, but nobody reads the papers and problem number
Speaker 1:
One, nobody reads the papers.
Speaker 2:
I think this is the right grumpy group to present a grumpy paper. One thing that people have done in literature, I just was kind of entranced by this Goen Berger paper, which she presented as a job talk at Michigan, although he ended up not coming that tried to actually incorporate the model selection process into the standard errors. So he's writing in the context of what in heckman land we call Durbin w Haman tests, I think at MI t are just called hausman tests. And if you remember back in the day, people would say, okay, I am interested in the effect of xon y. I think X might be endogenous and I'm going to do a statistical test. I have a kind instrument. And so the DWH test is going to tell me relative to the null of Exogen 80, which is a very friendly null, whether I need to use the instrument or not based on how different the OLS and IV estimates are. And then if I reject the null of Exogen 80, I report the IV estimates. If I do not reject the null of Exogen 80, I report the OLS estimates. And so a Google room wants to do is to sort of draw a circle around that process. I call that whole thing the estimator and figure out its sampling distribution and that's a mess, right? It includes the model selection step.
Speaker 1:
A lot of these in macro, there was a fashion for pretesting for unit roots and then imposing that functional form. And some of us wrote essays on why that's a terrible idea, but at least put that all in one big thing, which nobody does. But I thought don't econometricians know that pretesting and then imposing a functional form is a bad idea.
Speaker 3:
This is why we always run everything in log levels now. And so we encompass cointegration unit roots, everything because of this pretesting issue.
Speaker 1:
Well, we don't always do that, but yes,
Speaker 3:
My handbook chapter tells people they should be,
Speaker 1:
Kristen is always right.
Speaker 2:
I just chose that as an example to illustrate in the machine learning literature. This is a thing that people are trying to do this because machine learning is, I like to call it algorithmic model selection because I think that's a more descriptive way of talking about it, but if you can incorporate the model selection algorithms activities into the ultimate standard error, then you have partly solved this non-sampling variation problem.
Speaker 1:
That's what bootstraps are great for.
Speaker 2:
Yeah, there's a little literature that is bigger outside of economics, but it is a little literature in economics where people do the following exercise. They take a data set and they send it to a bunch of different teams of researchers who are asked not to communicate with one another, and they're told to come up with an estimate of the effect of X and Y and then to report back. And also sometimes they'll be asked to record their decision process about how they made all the non-sampling choices. And there's three examples of papers here that do that. I think that's pretty cool. And I think we learned from that that people make different choices that everybody's prior about which way to go in the garden is not the same. And indeed, you could think about these surveys that sort of giving you weights for the paths through the garden, sort of empirically generated weights.
What do people, what choices do they make at the different forks in the garden? There are papers, and this is the direction we're going as you probably figured out, we're going to do all the paths in the garden or at least a large number of paths in the garden in our empirical application. There are some other papers that do that. Again, this is also more common outside of economics and part that is incomplete of this paper is trying to find all the bits of literature in different places where people have sort of thought about these ideas and tried to solve this problem because scattered all around in different literatures, some studies call this a metaverse analysis. If you go to the Wikipedia page, there's a Wikipedia page on metaverse analysis and we can have a discussion about whether we like gardens or metaverse better. And then they plot typically, and I'm going to show you a specification curve.
I'm going to show you a PDF of estimates from doing things a whole bunch of different ways. And that's basically what a specification curve is. And you can kind of get a sense of is it that there's some outlier, some estimates over here and the rest over here, is it a smooth blah blah blah blah. So this is not a new idea and there's issues with this that have probably already come to your mind, but it's very rare in economics to think about it this way. And so partly this is a paper that's saying some folks in other places have thought about this harder than we have and who we should start to think and do more about it. Alright, let me now turn to some things that can ask your question. Yes, absolutely.
Speaker 6:
To what extent that strategy is the, I think Susan at one point had this machine learning method to do robust sensitivity analysis. They were sort of using the machine, the computer rather than DRA. Her idea was very much like that RA makes limited choices, but in principle you could imagine having all the choices. Is that the same thing? I mean except she was suggesting how to do it practically on the
Speaker 4:
Yeah,
Speaker 6:
That's the same thing.
Speaker 4:
Yeah. Can I ask one more question?
Speaker 2:
And you should.
Speaker 4:
About approaches that are now very popular, this clustering approaches or a hidden model selection type, even in stick routines if you want to fit a curve through multivariate adaptive regression methods or multinomial fractional polynomials, which are not poly after all because the powers are not like what data does is fits 30,000 models for possible combinations. It's the Combinator of all the possible chosen powers variables, but then it reports under errors of the estimates of the chosen specification and it nothing against data, but it reports the estimates of the position, but it's not taken into account. There was a model,
Speaker 2:
The model selection process.
Speaker 4:
So there's even the practice of commonly used fantastic softwares that we all use that really skews the inference in, I have no idea how to interpret the number.
Speaker 2:
Well, where it came from, that helps. But yeah, it's a standard error after a whole bunch of model selection. And most of our papers are filled with standard error is after a whole bunch of model selection where we don't actually know the process that led to the selected model
Speaker 4:
And where does it fit this type of concern within all of the concern that you have, you would call it a specification of modal selection.
Speaker 2:
I think that's one dimension of non-sampling variation is how you decide, suppose you have 50 variables that you think collectively suffice for conditional independence in a study that's going to presume conditional independence. But you want to use an algorithm to figure out what's the best way, what's the right functional form for these 50 variables? Or in some cases it may be that you just took, don't use the variable in a kind of lasso sense. There's a whole bunch of algorithms out there to do that. And there's lasso based algorithms. There's our boreal algorithms of forest and trees and yada yada, right? There's all these things then you can, in which one you pick is itself a choice and then it's going to pick the specification. That choice is also going to differ across samples. So there's both the algorithm choice and then there's sampling variation in the algorithms choice, which is what Gutenberg is trying to get at. In his thing with the Durbin Haman test is that in repeated random samples, if you're kind of at the margin of passing or failing the Durbin Haman test, sometimes you're going to report the IV and sometimes you're going to report the OLS. And that's sort of sampling variation that is generating the non-sampling variation because the non-sampling choice is based on the data. If that makes,
Speaker 4:
No, just to conclude, if I think about what BLM are doing, those are pushing the frontier. What's non econometrically? So I don't think we know how to do how account for that type. You assume that these type of firms or units are known because I mean the C is moving. I guess what I'm saying is that the CX frontier is always moving, but the practice software wise of what we do off and it's not totally
Speaker 2:
Kosher. Yeah, agreed. So I'm going to have a couple examples now of empirical things that made me interested in this topic. So the first one, this is going way back. So I worked on the job training partnership act in graduate school, and the job training partnership act was the big federal training program in the eighties and nineties. And it had an experimental evaluation in the late eighties. And before that there was, so I've listed the genealogy of federal training programs up there. So the CD program, which was the second one, the original one was MDTA, which was in the sixties and was generated by concerns about the robots taking all our jobs 65 years ago. Yeah. Anyway, the CDA study, the DOL had this idea, which is a reasonable idea. They created this amazing data set of surveys of CA participants and then a bunch of people drawn from the CPS and they matched SSA administrative data to both groups and called that the CA longitudinal manpower survey.
And then they hired a bunch of different research consulting firms to evaluate CDA using the same dataset. And they came out with wildly different answers. And because it's the same dataset, the wildly different answers cannot be due to sampling variation. And in fact, there's a survey by Michael Van Buro and there's this other paper commissioned by DOL after they got the round of estimates that didn't all look alike by Dickens Dickinson, Johnson and West, that show and identify specific non-sampling choices that mattered. So for example, the SSA earnings come in calendar year chunks. If somebody participates in CA in July, is that calendar year in the before period or the after period or should it just be thrown out or what? And it turns out different firms make different choices about that and it matters what the estimates are because of the well-known ton filter dip.
So I was impressed by that as a youth Hackman and I wrote this paper in 2000. It was kind of buried in one of those yellow bureau books back and before they were posted on the internet. But the idea that I had with this paper was that no one had ever really evaluated the same program with an RCT more than once, right? There's a couple examples of similar programs. So the NIT experiments, the DUI bonus experiments, those are similar programs evaluated by RCTs. But when you design an rrc, even in the RCT, right, you think, oh, well there's limited design choices, right? You're not picking a non-experimental strategy or anything like that. There's still a bunch of design choices. And so the idea behind this paper that we wrote was to just using the data we had from the one JDP experiment vary some of the design choices, like which earnings variable do you use?
Which sites do you include? How do you deal with outliers of the earnings? And we show, for example, you may recall that the earnings estimates for the male youth in that experiment were negative and statistically significant, which did not please DOL. It's easy to make them go away by choosing a different method of dealing with outliers. And perhaps it is to AP associates credit that they didn't do that. Perhaps it's not, leave that question to you, blah, blah, blah, right? So that impressed me that we could get the numbers to move around the bot. We could get stars to come and go by doing these fairly simple things that you would've observed if you had done multiple RCTs and handed them out to different firms to evaluate the JTPA program. Finally, and this is getting us closer to our empirical example. So back in the day I wrote some papers on the effects of college quality with my friend Dan Black, who's at the Harris School and my graduate school buddy Kermit Daniel.
And we used the NLSY 79 to do that. Fast forward 20 years, now I'm running papers about college match with my Michigan student, Eleanor Dill. And she did an exercise that I thought was super interesting because it turned out that we had made a bunch of different design choices like relatively recent me and relatively less recent. Me and co-authors had made different design choices. And so for example, I've completely forgotten that the Daniel Black Daniel and Smith paper looks at the effect of the college that you last attend. Whereas Dylan and Smith look at the effect of the college that you first attend. And that's because black Daniel and Smith are thinking about it in a sort of minary equation way. And Dylan and Smith are thinking about it in a treatment effects kind of way where the college you end up at is partly affected by the college you start at. And that actually makes a difference to the estimates turns out. Anyway. So this impressed me, this exercise of going one step at a time from my past self to my current self and our different design choices. Alright, lemme tell you about our worked example here. We're going to look at the effects.
Speaker 7:
Yes, please. I guess I'm thinking you're motivating it as there's all these different choices you can make along the way and what's the kind of variation in the result of get because of that? I guess I'm thinking how this interacts with publication bias and the hacking version of that. Why shouldn't I think of it as kind of, I'm a researcher, I sit down with this dataset, I have these choices I could make. I'm searching for the one that makes publication the most likely. And so what if we thought about the variation in that kind of best single result, which in turn could be thought kind of a consequence of the sampling variation.
Speaker 2:
Hold that thought for like five slides. Alright, so worked example, simple stuff we're we're going to look at the effect of college equality. You may not believe that conditional dependence holds here. That's fine with me. When I presented the Dylan and Smith paper, the 2020 paper at Heckman's seminar at Chicago in the fall of 2017 afterwards, he said, those are really interesting conditional means. And I said, okay, I'll take that. That's fine for the purposes here, it doesn't really matter whether you buy the identification strategy or not. So this is going to be a selection on observed variables or conditional independence exercise. We're going to regress two outcomes on quality. Quality, and I'll talk about how we measure that in a second and a whole bunch of conditioning variables. And then we're going to interpret the coefficient on college quality as the causal effective college quality.
We're going to use analyst Y 97. So this is basically the same data that's used in the Dylan Smith 2020 paper on college match. That means that we're going to restrict the sample to people who have a high school diploma or a GD who start college right away, don't go away for five years and then come back better interviewed at least five years after they start college. They don't have to be five adjacent years. They have a valid college quality index and they have a valid ability measure. So this is NLSY. So one of the features here is that we have the armed services vocational aptitude battery, which is the test that the military uses both to let people in and decide what they do once they get in. And that the military used both LSY 79 cohort and the 97 cohort to sort of norm the ASVAB so they could see how it went with a population, random population, a random sample from the population rather than just people who show up at a recruiting office.
We're not going to focus on college match here. So the Dylan Smith 2020 paper is all about the interaction between student ability and college quality and trying to estimate a surface and learn something about the potentially good or bad effects of mismatch of various sorts. We're not doing that here. We're just estimating the main effect of college quality, just two outcomes graduation within six years as most people don't finish in four years now. And so six years is actually the most common measure that people use in the literature in the 2020 paper. And in other work we show that college quality makes you more likely to finish in four, but not more. But the effect of quality, quality is stronger for finishing in four than it is for finishing in six. If you look at that paper, if you take all the path through the garden that Dylan and Smith choose, but don't include the stuff about match, then what you get are the numbers here.
So if you go from the very worst college that we're thinking about percentiles of college quality, if you go from the first percentile to the a hundred percentile, your probability of graduating six years increases by 22 percentage points is pretty big, but not gigantic. I think people don't always realize that there's, we spend a lot of time in the Obama administration, I think reasonably worrying about fly by night private colleges that were exploiting the design of the student loan system, but there's a lot of public four year directional and vocational schools that have sort of 30 and 40% completion rates that maybe we should worry about too. And then for earnings, you can see that if you go from the bottom to the top, you get $16,000 extra per year, we estimate. And you can see the standard errors. They're not huge, but they're not small either. So we're going to vary time permitting and maybe we'll have time for all four.
We're going to think about four different design choices. The first design choice we're going to think about is how we construct the college quality index. So this black and Smith paper, I call it the multiple proxies paper, that's the term that's in the title, thinks about constructing a college quality index using measures of inputs. So college quality is hard and it is sort of harder than estimating elementary school quality because in elementary school, everybody's kind supposed to learn the same thing. And you can give them a test at the beginning and a test at the end and kind get a sense. You could define
Speaker 1:
It by how much they raise the incomes of the people that go to it, but I think then
Speaker 2:
That would be a little circular. Yeah. So we're not going to do that. That's a central problem. And then there's this narrower problem that people go to college to learn different things. So it's not clear and people have tried to do a test at the end of college and you can like that or not. That's not what we're going to do. We're going to try to argue that all these variables from the integrated post-secondary education data system and the US news and World report data, which in our data is those two together, that they are proxies, measurement, error, written proxies for some underlying variate quality. Now we can have a discussion about whether universities have more than one dimension. I think they do. The literature treats 'em as having one dimension. And so we're sort of starting there, but trying to do it better than the literature. Most of the literature on college quality just uses the average SAT score of the entering class. And it's actually not the average. It's, we call it the studio median because what you get from iPads is the 25th and 75th percentiles. And so people take the average of that. Yes, I know. It's kind of a wacky number
Speaker 1:
Matters because the question could be, does attending a college of higher measured quality by X measure raising income done doesn't matter if the measure is imperfect measure of something else?
Speaker 2:
Well, but if you just use the SAT score, then you interpret it as this underlying one dimensional thing, then you have a lot of attenuation bias because there's a bunch of measurement error in the SAT score relative to the one dimensional thing. And so the argument that we make in the 2006 paper is you want to combine them.
Speaker 1:
What's the effect of on my income of attending a college, but the average SAT is seven 50, save 50. Bingo, I'm done. I don't care what it's a measure of.
Speaker 8:
I
Speaker 2:
Think you do.
Speaker 8:
But
Speaker 1:
Yes,
Speaker 8:
A quick question on the share of faculty who are tenure or tenure track. Why are you looking at the faculty rather than the number of courses that are taught by different one or even the number of students? Because normally tenure track professors don't teach introductions here being a big exception thanks to are good at. But normally the big courses that, so the interaction between the students, the faculty tend to be much heavier to the people who are not tenured track than the ones who are tenured track.
Speaker 1:
Obviously the tenure track ones are better at teaching undergraduates.
Speaker 8:
They may be better. I never said so. I just said it'll be interesting to see. It'll be interesting to see.
Speaker 2:
So these are a bunch of measurement error written measures of college quality. That is the argument that we make. And you have just discussed a measurement error inherent in one of the measures, and I'm on board with that. These are what's in the IEDs, which is collected by the feds and or in the US news enrolled report data. And the argument that is made in the black and Smith 2006 paper is that if that's how you want to think about these things as proxies with classical measurement error for the thing you care about, then what you want to do is you want to combine them, right? If you have multiple proxies, title of that paper that the efficient thing in sort of a statistical informational sense is to combine them together. And so we're going to construct indices using principle components of three, combinations of three, four, and five of the proxies for all possible things.
And then we're going to also look at SAT alone and expenditures alone. Those are used in the literature alone. Sometimes that gives us a hundred forty, a hundred eighty four indices. Now you might say, why don't you construct an index with all of them? The reason we don't do that is that there is item nine response in the IEDs. And so if you attempt to do list wise deletion and you don't include any colleges where you're, suppose I wanted to use all, whatever it is, eight measures, there are nine measures, eight measures. I'd throw out a bunch of colleges because there's a whole bunch of colleges they don't have all eight for. That's why we've sort of done this compromise of 3.5.
Speaker 9:
Isn't that a design choice?
Speaker 2:
It's a design choice, that's right. And that illustrates the point that unlike that pretty garden figure that Heather made for this talk, oftentimes not always, but oftentimes it's not a simple binary choice at a given fork. It's actually a sort of pseudo continuous choice.
Speaker 9:
I guess I'm struggling a little bit. It seems like you're taking all of the power away from reasonable judgment or the judgment you could add. Is the school's mascot a wolverine into this list as well? We obviously would do that. We don't think of that as a good measure of college policy, so I guess I'm just, some of my former students who attended that university would say it's a very good measure of college. But again, I'm struggling a little bit to think about what are we gaining from this exercise when I think we expect and ask researchers to use a reasonable amount of fair discretion.
Speaker 2:
So this is what makes this hard, and this is part of why I think historically we have focused so much attention on standard errors is that that it's sort of putting aside the issue with the population, it's sort of always clear what you mean. I draw repeated random samples. The numbers move around. I want to quantify that here. It's a bit less well-defined. There's a bunch of design choices. Some of them are binary, some of them are like log legit probate linear probability model. But you could say, oh wait, but what about mansky maximum score estimator that should be in there. And there was a whole genre of papers of semi parametric binary choice estimators that came out around 1990 when that was the thing that was hot to do. So I agree. Go ahead John.
Speaker 1:
I still think you're focusing on one tiny bit of the big issue. The big issue in the equation you just put down is do the controls make any sense or do they completely ruin the causal estimate? The classic is you don't put industry in a wage equation. That's whole point. The error term is unobserved individual heterogeneity. To what extent are smart people choosing to go to better colleges? Those are the elephants in the room at this equation, which how I measured good college seems to be a tiny issue though.
Speaker 2:
Well, let's see. Let's see. We're going to look at a hundred forty eight eighty four estimates and we can see how important it's in Berkeley.
Speaker 4:
I have a low level of, oh, sorry.
Speaker 1:
But some of those aren't a question of the number. It's just a question of you've destroyed the meaning of the estimate. If I run rat shoe sales on price and hold, right shoe sales constant. That's a great regression. Our square is wonderful, but our T statistics are high, but it's just meaningless.
Speaker 2:
Yes.
Speaker 1:
And that's not just kind of a random choice of specification here
Speaker 2:
Is whether or not the researcher knows what they're doing, a design choice. I don't. Well,
Speaker 1:
Nobody knows what they're doing. You've read these
Speaker 5:
Papers leaving shoes aside, when you talk about college quality, what do you mean by that? I mean, you could think of value added of an institution on people, or you could think of the total output, what the people coming out look like, which is some combination of the inputs they hire and the value added. Or you can think that individual choice necessarily means that smarter people choose better colleges. How do you want to define what's the concept that you're after?
Speaker 2:
I think of a higher quality college as a college that produces more of outcomes that we like. And so we're going to look at two outcomes here that we like earnings and degree completion. And we're going to argue following the literature that there's this one dimensional thing. I don't think that's really right. I think this is a very, if there's students in the room looking for topics, multidimensional college quality as a topic you could do, it'd be hard. There's a reason people haven't done it. And I guess when Dan and I first started our papers being back in the eighties, the original thing we wanted to try to do was we wanted to try to estimate separate effects of expenditures per student and student quality as measured by average test scores. Because that seemed to us a really good question for a university administrator, right? You get some money. Do you want to spend it on facilities and faculty or do you want to spend it recruiting higher SAT score students right at the margin?
Speaker 5:
That's exactly the question mean.
Speaker 2:
And the problem
Speaker 5:
For most purposes, if you're talking about college quality, you'd want to distinguish between pure selection and sort of production in some sense.
Speaker 2:
Well, but I guess I think of there being an underlying sort of production function for student outcomes for which, which faculty quality and facilities and peer quality are all inputs, right? I mean, the discussion on literature sort of says college is a weird industry in the sense that the customers are also an input into the production function.
Speaker 5:
So implicit you're saying that people with higher SAT scores are better able to choose quality faculty?
Speaker 2:
I don't think I'm saying that. I think I'm, people work harder to learn more when they're surrounded by smart people. Yeah, exactly.
Speaker 4:
Can I ask it very, very quickly because this looks econometrically having a natural dynamic factor structure the problem to it. So I'm thinking about the pro measurement error type of model, not the convolution methods, but the idea is that since this is a poly coic quality to it, some of those measurements are discreet. So this in single index representation is to your most natural. I just wanted to think about how you thought about it. Otherwise I would think, okay, I specify a UNA at all, then I factor models and go that way. But of course the measurement typically continues.
Speaker 2:
One could think of the multiple proxies paper as prior to some of that stuff.
Speaker 4:
That's why I'm asking. And
Speaker 2:
A lot of our sites come from that stuff. They sort of say, oh, we're doing something in the spirit of this. This is not fancy relative to some of the measurement models that have come along since, right? It's very simple model. We've got a bunch of proxies that have classical measurement error and we're going to combine them and see if we do better. And in the black and Smith paper, we show that you do do better, that there is attenuation bias. If you use the preferred index, that paper relative to just using average SAT score, the effect of college quality is like 20% higher. And that's a huge, but it's not trivial either. And SAT score is the best, we argue the least measurement error ridden of the proxy. So in that sense, the literature is doing the right thing if you're only going to pick one of them by picking the average SAT score. But there's some gains in reduced attenuation bias by using the index instead, which is consistent with the measurement.
Speaker 3:
Do you want to go, she had a question.
Speaker 10:
Yeah. I'm wondering if to a degree, our ability to derive estimates from a circuit of specifications has kind of exceeded our ability to what it is that we are deriving from these estimates. We don't have 184 different AL mechanisms that we can think of or 184 different interpretations or 184 different certainly policy recommendations that drive for each of these estimates. And that end, I'm kind of thinking about whether this in the public facing sense helps or hurts credibility and in the sort of domain internal sense they Exactly five.
Speaker 2:
Well, I was actually thinking for a moment there like 10 minutes ago that I would get through all four of the dimensions. That's really not going to be the case, but that's fine. Let me get through this one. And I think you see if what I say about it helps to answer the question because that's a good question. Okay, here is a specification curve. This is a density of the 184 estimates using the 184 different indices. That includes two of the individual proxies by themselves for the effect of college quality and graduation within six years. I want to highlight three aspects of this curve. The first aspect is it has a non-trivial variance, right? So just changing this one thing in ways that are not unreasonable, I would argue moves the estimate from 0.06 to 0.33. And again, that 0.3 means the effect of going from the first percentile college to the 99 percentile college is to raise the probability of six year graduation by 33 percentage points.
Speaker 1:
Upper quality measure was not fraction of students who graduate.
Speaker 2:
No, no, no. You saw the list there, John, right on the slide. None of those. Those are all inputs, not outcomes. The second feature that I want to highlight is that if you remember going back here, our estimate was 0.223 from our preferred path through the garden. So 0.2, two, three is a little bit to the right of the center of that distribution, but not far. So this is kind of getting to the question that was just ask, one way that people use this kind of information is they say, I'm redoing somebody's paper. I do this multiverse analysis of it. I find that their estimate is way out in the tail. Then I might be inclined to say, oh, I think they peaked. Whereas if it's nice and in the middle or towards the lower side, then you're kind of happy or you think they didn't peak, blah, blah, blah.
Speaker 1:
We could take that as it's not a measurement question. Oh, now we found the dimensions of college quality that matter for graduation.
Speaker 2:
You could think about it that way too. Yeah, I agree. Oh, the third point I wanted to make, and this is nice for our paper, I think my past papers is that even the lowest estimate is still substantively statistically different from zero, right? So in that sense, in that sort of first order qualitative conclusion from the literature, if you're willing to buy the conditional dependent assumption, blah, blah, blah, college quality has a positive effect on graduation probabilities. So it may be small and it may be pretty big. Does that help answer your question? Alright, so here's the distribution of estimates for earnings. This is a higher variance outcome. Rights have big residual variance goes from 16,000, 18,000. Once again, even the lowest estimate, different from zero in a substantively meaningful way. Again, pretty big heterogeneity.
Speaker 6:
Did you find that the one that explained higher increase in, in the one that I explain, a higher increase in the other one are the same?
Speaker 2:
I know that in the sense that we produce those estimates, but not in the sense that I remember it. Okay.
Speaker 9:
Yes. This exercise is that each of the 184 combinations of inputs combination, which I mean you could also throw junk onto the list and get treated as an equally get to it. But I think the concern that I have is wouldn't the proper exercise be if you could assemble a team of experts to pick 3, 4, 5 input pairs for their indices, like how close is or your chosen estimate to the distribution of what a plausible combination would be. This seems like a bit tiny.
Speaker 1:
Those experts are going to pick things that they are correlated with graduation rates and income as these previous ones are.
Speaker 2:
We have a five minute rule at Wisconsin. I'm going to invite that. We'll talk about, even though we're not at Wisconsin, yes, we talked about it over dinner or something. This does a little bit of what you wanted. So this also shows the summary statistics. If we only use prox, only use indices that have at least four proxies, and that does raise the mean as you'd expect, because they should have less measurement error on average and it reduces the standard deviation. So there you go. Alright. We do three other things that I'm just going to breeze by. We look at item non-response, and there's a bunch of different ways to do that. In literature, we don't do them all. The main thing you learn is that list wise deletion is evil, which is the working paper title of a paper by Gary King. The published title is much less entertaining, but if you go download the file from his website, it's called evil pdf.
So that's what you learn from that. Then we talk about how to deal with the earnings variables, and so we do, winds are rising. We do levels. We include the zeros, we drop the zeros, we do logs, blah, blah, blah, blah. That moves things around. And then in the spirit of the times, we do some algorithmic model selection for how to include the set of conditioning variables that we argue suffices for conditional independence. And that also moves the numbers around, although to be honest, not as much as I was expecting it would. And then this is the our four dimensional multiverse, I guess. So if we take all the paths involving those four choices, this is not all the paths involving four choices. This is all the individual estimates, but not all the paths. Then you can see, so this is understanding what you would get to the full multiverse analysis, that the standard deviation of those estimates from the non-sampling variation is almost the same as the standard error from the preferred path from Dylan and Smith.
And for the earnings outcome, the standard deviation is a little bit smaller from the non-sampling variation, but of the same order of magnitude. And so the point, one of the standard points I wanted you to take away from the work example using college quality is that even in an environment where sampling variation is going to be really big because the sampling the sample is small and the outcome in the case of earnings is a high variance outcome. The non-sampling variation, even with this very limited subset of it that we're displaying here, produces variability of the same magnitude,
Speaker 1:
All the same controls and fixed effects and all the rest.
Speaker 2:
Yeah, except that there.
Speaker 1:
Okay.
Speaker 2:
Yeah. Well, I'm going to end now actually, I don't have more time reflections. We haven't answered this question yet about the one step away. We're going to do that. This paper, and I just, full disclosure, this paper does not have a pre-analysis plan, and in some ways I feel badly about that. Maybe we should. How, and this has come up in some of the questions, given the question over here from Jacob and things, it's easy to write up that pretty picture with all the binary forks, but that's not how it's, there's a lot of alternatives. Maybe we don't like them all the same. Maybe we don't like some of them at all. How do we figure out a kind of norm to get people to do better than we're presently doing? So not just doing four or five, one step away things, or maybe a hundred and only reporting four or five to do something that's sort of more systematic and more quantitative when it's just fundamentally not as well defined as the standard size for the sampling variation. I don't have an answer to that, but it seems like something we should be thinking about. So I guess I will just end then with the last bullet here, which is step one of the 12, maybe some people here are not familiar with 12 step programs. The first step of the 12 is to admit that you have a problem, and if I have done nothing else, I hope today to have convinced you to admit that we have a problem. Thank you very much for your attention.
PARTICIPANTS
Jeffrey Smith, Valerie Ramey, John Cochrane, John Taylor, Annelise Anderson, Richard Coillot, Camille DeJarnett, Nick Gebbia, Siddarth Gundapaneni, Eric Hanushek, Ken Judd, Chris Karbownik, Evan Koenig, David Laidler, Charles Leung, Jacob Light, Megan Liu, Elena Pastorino, Armando Perez-Gea, Rocío Sánchez Mangas, Paola Sapienza, Richard Sousa, Bebel Vieira, Mike Wu