[Haskell-cafe] Testing of GHC extensions & optimizations

Discussion:

Rodrigo Stevaux

2018-08-31 17:53:26 UTC

Ömer Sinan Ağacan

2018-09-01 11:31:32 UTC

Hi,

Here are a few things we do regarding compiler/runtime performance:

- Each commit goes through some set of tests, some of which also check max.
residency, total allocations etc. of the compiler or the compiled program,
and fail if those numbers are more than the allowed amount. See [1] for an
example.

- There's https://perf.haskell.org/ghc/ which does some testing on every
commit. I don't know what exactly it's doing (hard to tell from the web page,
but I guess it's only running a few select tests/benchmarks?). I've
personally never used it, I just know that it exists.

- Most of the time if a patch is expected to change compiler or runtime
performance the author submits nofib results and updates the perf tests in the
test suite for new numbers. This process is manual and sometimes contributors
are asked for nofib numbers by reviewers etc. See [2,3] for nofib.

We currently don't use random testing.

[1]: https://github.com/ghc/ghc/blob/565ef4cc036905f9f9801c1e775236bb007b026c/testsuite/tests/perf/compiler/all.T#L30
[2]: https://github.com/ghc/nofib
[3]: https://ghc.haskell.org/trac/ghc/wiki/Building/RunningNoFib

Ömer

Hi,
For those familiar with GHC source code & internals, how are extensions & optimizations tested? And what are the quality policies for accepting new code into GHC?
I am interested in testing compilers in general using random testing. Is it used on GHC?
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Rodrigo Stevaux

2018-09-02 18:05:04 UTC

Permalink

Hi Omer, thanks for the reply. The tests you run are for regression
testing, that is, non-functional aspects, is my understanding right? What
about testing that optimizations and extensions are correct from a
functional aspect?

Post by Ãmer Sinan AÄacan
Hi,
- Each commit goes through some set of tests, some of which also check max.
residency, total allocations etc. of the compiler or the compiled program,
and fail if those numbers are more than the allowed amount. See [1] for an
example.
- There's https://perf.haskell.org/ghc/ which does some testing on every
commit. I don't know what exactly it's doing (hard to tell from the web page,
but I guess it's only running a few select tests/benchmarks?). I've
personally never used it, I just know that it exists.
- Most of the time if a patch is expected to change compiler or runtime
performance the author submits nofib results and updates the perf tests in the
test suite for new numbers. This process is manual and sometimes contributors
are asked for nofib numbers by reviewers etc. See [2,3] for nofib.
We currently don't use random testing.
https://github.com/ghc/ghc/blob/565ef4cc036905f9f9801c1e775236bb007b026c/testsuite/tests/perf/compiler/all.T#L30
[2]: https://github.com/ghc/nofib
[3]: https://ghc.haskell.org/trac/ghc/wiki/Building/RunningNoFib
Ãmer

Post by Rodrigo Stevaux
Hi,
For those familiar with GHC source code & internals, how are extensions

& optimizations tested? And what are the quality policies for accepting new
code into GHC?

Post by Rodrigo Stevaux
I am interested in testing compilers in general using random testing. Is

it used on GHC?

Post by Rodrigo Stevaux
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Sven Panne

2018-09-02 19:58:42 UTC

Permalink

Post by Rodrigo Stevaux
Hi Omer, thanks for the reply. The tests you run are for regression
testing, that is, non-functional aspects, is my understanding right? [...]

Quite the opposite, the usual steps are:

* A bug is reported.
* A regression test is added to GHC's test suite, reproducing the bug (
https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding).
* The bug is fixed.

This way it is made sure that the bug doesn't come back later. Do this for
a few decades, and you have a very comprehensive test suite for functional
aspects. :-) The reasoning behind this: Blindly adding tests is wasted
effort most of time, because this way you often test things which only very
rarely break: Bugs OTOH hint you very concretely at
problematic/tricky/complicated parts of your SW.

Catching increases in runtime/memory consumption is a slightly different
story, because you have to come up with "typical" scenarios to make useful
comparisons. You can have synthetic scenarios for very specific parts of
the compiler, too, like pattern matching with tons of constructors, or
using gigantic literals, or type checking deeply nested tricky things,
etc., but I am not sure if such things are usually called "regression
tests".

Cheers,
S.

Joachim Durchholz

2018-09-02 20:43:51 UTC

Permalink

That's just the... non-thinking aspect, and more embarrassment
avoidance. The first level of automated testing.

Do this
for a few decades, and you have a very comprehensive test suite for
functional aspects. :-) The reasoning behind this: Blindly adding tests
is wasted effort most of time, because this way you often test things
which only very rarely break: Bugs OTOH hint you very concretely at
problematic/tricky/complicated parts of your SW.

Well, you have to *think*.
You can't just blindly add tests for every bug that was ever reported;
you get an every-growing pile of test code, and if the spec changes you
need to change the tests. So you need a strategy to curate the test
code, and you very much prefer to test for the thing that actually went
wrong, not the thing that was reported.

I'm pretty sure the GHC guys do, actually; I'm just speaking up so that
people don't take this "just add a test whenever a bug occurs" at face
value, there's much more to it.

Catching increases in runtime/memory consumption is a slightly different
story, because you have to come up with "typical" scenarios to make
useful comparisons.

It's just a case where you cannot blindly add a test for every
performance regression you see, you have to set up testing beforehand.
Which is the exact opposite of what you recommend, so maybe the
recommendation shouldn't be taken at face value ;-P

You can have synthetic scenarios for very specific
parts of the compiler, too, like pattern matching with tons of
constructors, or using gigantic literals, or type checking deeply nested
tricky things, etc., but I am not sure if such things are usually called
"regression tests".

It's a matter of definition and common usage, but indeed many people
associate the term "regression testing" with "let's write a test case
whenever we see a bug".

This is one of the reasons why I prefer the term "automated testing".
It's both more general and encompasses all the things that one does.

Oh, and sometimes you even add a test blindly due to a bug report. It's
still a good first line of defense, it's just not what you should always
do, and never without thinking about an alternative.

Regards,
Jo

Sven Panne

2018-09-03 06:29:54 UTC

Permalink

Am So., 2. Sep. 2018 um 22:44 Uhr schrieb Joachim Durchholz <

Post by Joachim Durchholz
That's just the... non-thinking aspect, and more embarrassment
avoidance. The first level of automated testing.

Well, even avoiding embarrassing bugs is extremely valuable. The vast
amount of bugs in real-world SW *is* actually highly embarrassing, and even
worse: Similar bugs have probably been introduced before. Getting some
tricky algorithm wrong is the exception, at least for two reasons: The
majority of code is typically very mundane and boring, and people are
usually more awake and concentrated when they know that they are writing
non-trivial stuff. Of course your mileage varies, depending on the domain,
experience of programmers, deadline pressure, etc.

Post by Joachim Durchholz

Post by Sven Panne
Do this
for a few decades, and you have a very comprehensive test suite for
functional aspects. :-) The reasoning behind this: Blindly adding tests
is wasted effort most of time, because this way you often test things
which only very rarely break: Bugs OTOH hint you very concretely at
problematic/tricky/complicated parts of your SW.

Two things here: I never proposed to add the exact code from the bug report
to a test suite. Bug reports are ususally too big and too unspecific, so of
course you add a minimal, focused test triggering the buggy behavior.
Furthermore: If the spec changes, your tests *must* break, by all means,
otherwise: What are the tests actually testing if it's not the spec? Of
course only those tests should break which test the changed part of the
spec.

Post by Joachim Durchholz
It's just a case where you cannot blindly add a test for every
performance regression you see, you have to set up testing beforehand.
Which is the exact opposite of what you recommend, so maybe the
recommendation shouldn't be taken at face value ;-P

This is exactly why I said that these tests are a different story. For
performance measurements there is no binary "failed" or "correct" outcome,
because typically many tradeoffs are involved (space vs. time etc.).
Therefore you have to define what you consider important, measure that, and
guard it against regressions.

It's a matter of definition and common usage, but indeed many people

Post by Joachim Durchholz
associate the term "regression testing" with "let's write a test case
whenever we see a bug". [...]

This sounds far too disparaging, and a quite a few companies have a rule
like "no bug fix gets committed without an accompanying regression test"
for a good reason. People usually have no real clue where their most
problematic code is (just like they have no clue where the most
performance-critical part is), so having *some* hint (bug report) is far
better than guessing without any hint.

Cheers,
S.

Rodrigo Stevaux

2018-09-03 01:40:19 UTC

Permalink

Thanks for the clarification.

What I am hinting at is, the Csmith project caught many bugs in C compilers
by using random testing -- feeding random programs and testing if the
optimizations preserved program behavior.

Haskell, having tens of optimizations, could be a potential application of
the same technique.

I have no familiarity with the GHC or with any compilers in general; I am
just looking for something to study.

My questions in its most direct form is, as in your view, could GHC
optimizations hide bugs that could be potentially be revealed by exploring
program spaces?

Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux <

Post by Rodrigo Stevaux
Hi Omer, thanks for the reply. The tests you run are for regression
testing, that is, non-functional aspects, is my understanding right? [...]

* A bug is reported.
* A regression test is added to GHC's test suite, reproducing the bug (
https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding).
* The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do this for
a few decades, and you have a very comprehensive test suite for functional
aspects. :-) The reasoning behind this: Blindly adding tests is wasted
effort most of time, because this way you often test things which only very
rarely break: Bugs OTOH hint you very concretely at
problematic/tricky/complicated parts of your SW.
Catching increases in runtime/memory consumption is a slightly different
story, because you have to come up with "typical" scenarios to make useful
comparisons. You can have synthetic scenarios for very specific parts of
the compiler, too, like pattern matching with tons of constructors, or
using gigantic literals, or type checking deeply nested tricky things,
etc., but I am not sure if such things are usually called "regression
tests".
Cheers,
S.

Emil Axelsson

2018-09-03 07:08:12 UTC

Permalink

Have a look at Michal Palka's Ph.D. thesis:

https://research.chalmers.se/publication/195849

IIRC, his testing revealed several strictness bugs in GHC when compiling
with optimization.

/ Emil

Post by Rodrigo Stevaux
Thanks for the clarification.
What I am hinting at is, the Csmith project caught many bugs in C
compilers by using random testing -- feeding random programs and
testing if the optimizations preserved program behavior.
Haskell, having tens of optimizations, could be a potential
application of the same technique.
I have no familiarity with the GHC or with any compilers in general; I
am just looking for something to study.
My questions in its most direct form is, as in your view, could GHC
optimizations hide bugs that could be potentially be revealed by
exploring program spaces?
Am So., 2. Sep. 2018 um 20:05 Uhr schrieb Rodrigo Stevaux
Hi Omer, thanks for the reply. The tests you run are for
regression testing, that is, non-functional aspects, is my
understanding right? [...]
* A bug is reported.
* A regression test is added to GHC's test suite, reproducing
the bug
(https://ghc.haskell.org/trac/ghc/wiki/Building/RunningTests/Adding).
* The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do
this for a few decades, and you have a very comprehensive test
Blindly adding tests is wasted effort most of time, because this
way you often test things which only very rarely break: Bugs OTOH
hint you very concretely at problematic/tricky/complicated parts
of your SW.
Catching increases in runtime/memory consumption is a slightly
different story, because you have to come up with "typical"
scenarios to make useful comparisons. You can have synthetic
scenarios for very specific parts of the compiler, too, like
pattern matching with tons of constructors, or using gigantic
literals, or type checking deeply nested tricky things, etc., but
I am not sure if such things are usually called "regression tests".
Cheers,
S.
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Rodrigo Stevaux

2018-09-03 13:11:51 UTC

Permalink

Ok this is the kind of stuff im looking for. This is great. Many thanks for
the insight.

Post by Emil Axelsson
https://research.chalmers.se/publication/195849
IIRC, his testing revealed several strictness bugs in GHC when compiling
with optimization.
/ Emil

Post by Rodrigo Stevaux
* The bug is fixed.
This way it is made sure that the bug doesn't come back later. Do
this for a few decades, and you have a very comprehensive test
Blindly adding tests is wasted effort most of time, because this
way you often test things which only very rarely break: Bugs OTOH
hint you very concretely at problematic/tricky/complicated parts
of your SW.
Catching increases in runtime/memory consumption is a slightly
different story, because you have to come up with "typical"
scenarios to make useful comparisons. You can have synthetic
scenarios for very specific parts of the compiler, too, like
pattern matching with tons of constructors, or using gigantic
literals, or type checking deeply nested tricky things, etc., but
I am not sure if such things are usually called "regression tests".
Cheers,
S.
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.