あなたは単独のコメントのスレッドを見ています。

残りのコメントをみる →

[–]IntegraldsI am the rep agent AMA 8ポイント9ポイント  (5子コメント)

WARNING: tl;dr post incoming (700 words)

I want to talk about replication. We've talked about replication before, but my views have evolved slightly.

Suppose that some researcher X has written a paper. What does it mean for you to "replicate" the results of that paper in economics?

Assume that the paper is using observational public-use data like the Penn World Table, or World Development Indicators, or FRED, or something like that. Something that is maintained, easily accessible, updated over time, and possibly subject to revisions.

Some public official (or your advisor) asks you to "replicate" the results in their Figures and Tables. What does that mean? What should it mean?

Here are the levels of replication I would distinguish among. They're not quite nested, but there is a clear theme as you go down the line.

  1. The author supplies you with their data. The author supplies you with their code. You run their do-file.

    • This is mere verification. Does the author's code and data produce the tables they say it does? It's important, but it's not replication.
  2. The author supplies you with their data. You write your own code, based on instructions in the paper plus any print or online appendicies. You run your code on their data to attempt to match their tables.

    • This is verification+. Does the paper contain sufficient instructions to compute the numbers found in the authors' tables?
  3. The author supplies you with their code. You construct the dataset based on instructions in the paper plus any print or online appendicies. You run their code on your data.

    • This is verification+. Does the paper contain sufficient instructions to construct the dataset the author used?
  4. The author dies and can't give you anything. You construct the dataset based on instructions in the paper plus any print or online appendicies. You write code based on based on instructions in the paper plus any print or online appendicies.

    • You try to get as close to the author's data as possible. So if they use GDP from 1960 to 1995, and use the 1997 NIPA tables, you go to ALFRED and get GDP from 1960-1995, using the 1997 vintage.
    • This is replication. You've followed all the instructions in the paper; can you get the same numbers they did?
  5. Same as (4), but:

    • You use the most recent vintage of the data the author used. So if they use GDP from 1960 to 1995, obtained from the 1997 NIPA tables, you go to FRED and get GDP from 1960-1995, using the current vintage.
    • This is replication+. Do the author's results survive data revisions?
  6. Same as (5), but:

    • You use the most recent vintage of the data the author used, plus any more recent data. So if they use GDP from 1960 to 1995, obtained from the 1997 NIPA tables, you go to FRED and get GDP from 1960-2016, using the current vintage.
    • This is replication+. Do the author's results survive data revisions and extending the sample?
    • You could even run the authors' regressions with 1960-1995 data, then 1995-2016 data, then 1960-2016 data, and perform standard Chow tests for parameter stability.
  7. Same as (6), but:

    • You use as close of an analogue as possible to the author's data, but not the same source. So if they use GDP across countries, 1960 to 1985, from the Penn World Table 3.0, you use GDP across countries, 1960 to 1985, from the 2016 edition of the World Development Indicators.
    • This is replication++. Do the author's results survive data revisions and extending the sample and using different measurements of the same underlying data?
    • You could do the same things the authors do, but with the new data source, and perform Chow tests across the two data sources.

Clearly there is a gap between (3) and (4). For (1) to (3), the author gives you something; for (4) onwards, you construct everything based on instructions in the paper and any print or online appendicies. Clearly as you go from (4) to (7), you move into a fuzzy region between "replication" and "robustness checks."

Usually when I'm replicating, I mean something like 5 and 6. Say Mankiw, Romer, and Weil (1992) uses cross-country data, 1960-1985, from the PWT 1.0. My first hunch at a replication would be to go to the current PWT, extract the 1960-1985 data, and run regressions on that data. Then I try 1960-2015 to see if anything changes. Then I may try 1985-2015 to see if the sample split is important.

What am I missing from this schema?

How should I modify the schema for experimental or RCT-derived datasets?

[–]ivansmlhotshot with a theory 0ポイント1ポイント  (0子コメント)

IMO, one should explicitly distinguish between replication and robustness check (c.f. Clemens: The Meaning of Failed Replications). 1-4 are replications in the sense that you're checking whether authors made any errors in their analysis. 5-7 you're extending the analysis and reporting new results. If your results don't agree with originals, obviously the interpretation (and reaction of authors) should depend on which of the two you're reporting. There's a difference between "authors slipped when copying Excel cells and thus their results are garbage" and "original results seem fine, but don't generalize out of sample".

It gets even more complicated when discussing experiments, where the replication often means redoing the experiment from scratch. My understanding is that the discussion about replicability in psychology or medical science really has more to do with publication bias and p-hacking rather than about posting do-files on the web.

[–]besttrousers 1ポイント2ポイント  (0子コメント)

How should I modify the schema for experimental or RCT-derived datasets?

Mostly 7, right? For example, I think that the Duflo et al. Microfinance stuff falls into this.

It's possible that it could be number 8, especially when you start thinking about Mechanism Experiments. There are ways you could be testing the same theory or hypothesis, but using a very different experimental design.

Perhaps "8" is more about experiments that aren't close analogues to the author's data? ie, looking at external validity issues.

Blattman's Evaluation 2.0 might be a good read: http://chrisblattman.com/documents/policy/2008.ImpactEvaluation2.DFID_talk.pdf

As is....Blattman's Impact Evaluation 3.0 http://www.chrisblattman.com/documents/policy/2011.ImpactEvaluation3.DFID_talk.pdf

One of the big changes in the last few years that Blattmann caputres is the movement from "M+E" to "R+D". Hamilton capture s it in Smarter, Better, Faster: The Potential for Predictive Analytics and Rapid-Cycle Evaluation to Improve Program Development and Outcomes.

[–]gorbachev 1ポイント2ポイント  (1子コメント)

I think robustness checks are a perfectly valid part of replication. If it turns out your results, say, only hold given a very precise set of controls -- and your paper hides this fact -- that's a problem. I bring this up because the most interesting replication efforts I've seen were of this variety. Don't really care if #2 or whatever doesn't quite work out -- I'd give the benefit of the doubt and wait for someone else to find Mike LaCour. But if it turns out your results were p-hacked real hard? I care a lot!

Edit: Don't forget literal replications - as in, literally re-running the experiment they ran. That's relevant in an RCT setting among others.

[–]PonderayFollows an AR(1) process 0ポイント1ポイント  (0子コメント)

Is that a qualitatively different thing? Saying a paper doesn't replicate due to a coding error is different then saying a paper has a flawed identification strategy.

[–]IntegraldsI am the rep agent AMA 1ポイント2ポイント  (0子コメント)

Followup because the post was getting too long.

A paper is said to be verifable if (1) can be performed.

A paper is said to be weakly replicable if a graduate student can perform (4) and get the same numbers as the authors. That is, an interested, competent grad student can all alone reproduce the tables using instructions in the paper.

A paper is said to be replicable if a grad student can perform (5) and get the same numbers as the authors.

A paper is said to be strongly replicable if a grad student can perform (6) or (7) and get reasonably close numbers. "Reasonably" may be defined statistically via Chow tests or economically via eyeball tests.