Should you trust your experiment result?

A talk by Amer Diwan

The scope of experiment   << Scope of claim               lead to unsound experiment

Sound experiment should be:

The Scope of claim <= The scope of experiment     means    experiment is sound as respect to the claim

From unsound to sound Two option(1. Reduce claim; 2. Extend experiment), but reducing claim will make research boring;


Four fatal sins cause unsound experiment.

1.Ignorance(Ignore components necessary for claim)

Example: Ignore the diversity or difference of subjects/benchmarks;

Systematically biased the experiment result (for the ignored components).

And it’s not obvious. Example:Ignoring Linux environment variable; Ignoring heap size(Evaluate garbage collector);Ignoring profiler bias(biased sampling method);

2. inappropriateness(Using components irrelevant for claim)

It’s not obvious;

Inappropriate statistics(Choosing best 30 runs to evaluate versus confidence interval method; fooled by the lucky outliers);

Inappropriate data analysis(two distribution’s mean value is the same but such methodology ignore the long-tail latency);(mean time of layered system is meaningless because most of the hits is much faster(cache hit) or much slower(cache miss) than the mean time) Lesson learned:Check the shape of data before deciding using which analysis to it

Inappropriate metric (extra nops instructions make instruction/cycle much higher, but nops instruction does noting.)Should pick up metric that is ends-based. (Use average point-to-set to evaluate pointer analysis, but people might prefer the one with 2/3 accurate result and 1/3 bad result than the one with all mediocre result). Metrics should be inconsistent with “Better”.

3. Inconsistency

Experiment compares A to B in different context.

(compare two systems using different benchmark suite)

Inconsistent workload (Evaluate Gmail optimization over two same amount of time, but the workload in two time period is not the same;)

Inconsistent metric (issued instruction versus retired instruction to evaluate performance)

4. Irreproducibility

Others cannot reproduce your result.

Write down everything may have bias your result 

Very hard to capture and characterize everything may affect your result.

Always look your gift horse in the mouse. Use back of envelope evaluation


Suggestion: Both novel algorithm and sound experiment should stand for itself.