> I think your observations are correct, but at least my experience as a
> gardener/sheriff leads me to a different conclusion. Namely, when I'm
> looking at a newly failing test, it is difficult if not impossible for
> me to know if the existing baseline was previously believed to be
> correct or not, and thus it's hard for me to tell if the new baseline
> should be considered worse, better, or different.

How does the proposal solve this problem? Right now gardeners have two

   - Rebaseline
   - Add a test expectation

Once we implemented the proposal, we have at least three options:

   - Rebaseline correct.png
   - Add/rebaseline expected.png
   - Add/rebaseline failure.png
   - (Optionally) Add a test expectation.

And that's a lot of options to choose from. The more options we have, the
more likely people make mistakes. We're already inconsistent in how
-expected.png is used because some people make mistakes. I'm afraid that
adding another set of expected results result in even more mistakes and a
unrecoverable mess.

This is why I want to test this theory :). It seems like if we got
> experience with this on one (or more) ports for a couple of months we
> would have a much more well-informed opinion, and I'm not seeing a
> huge downside to at least trying this idea out.

Sure. But if we're doing this experiment on trunk, thereby imposing
significant cognitive load on every other port, then I'd like to see us
setting *both* an exact date at which we decide whether this approach is
good or not and criteria by which we decide this *before *the experiment

Like Filip, I'm *extremely* concerned about the prospect of us introducing
yet-another-way-of-doing-things, and not be able to get rid of it later.

