[Haskell-cafe] static analysis of a C-like language

Discussion:

Olaf Klinke

2018-11-21 21:01:08 UTC

Dear Cafe,

I want to build a linter for a C-like language. My company uses a UI system that comes with a C-like language that is used to provide functionality to the UI elements. The language is interpreted at runtime. The user, however, has no access to the built-in parser and checking functions. Only a basic syntax check comes with their editor. Therefore many stupid mistakes only show at runtime and can not be tested for statically.
The language is in many respects even simpler than C. No structs, no enumerations, only a handful of basic data types and arrays thereof, plus pointers and the usual control flow structures. On the other hand, there are some features that make static analysis near impossible, e.g. the possibility of dynamic registration of variables. Higher-order types are emulated by passing the name of other functions as strings. There is explicit referencing of other sources (#include statements) as well as implicit referencing within UI elements and within the project hierarchy.

We want to catch basic programming mistakes - misspelled variables, missing return statements, usage of variables before assignment, wrong number of function arguments/types. A project's code is scattered across multiple files and layers - a function declaration can mask a declaration from the layer below. Therefore, a per-file-linter will not suffice.
As a side-effect this may also yield a documentation generator (boy do I miss haddock while working with that), which is also missing in the system.

I have never done something like this and will seek help from local CS departments (If you are a CS student in Paderborn or Bielefeld, Germany, get in touch!). My company would be willing to sponsor a Bachelor or Master's thesis project. Ultimately this will be comercialized as an add-on, so anyone who helps will also profit from the result financially. The user community is not large, but includes big names such as Siemens, London Heathrow, the Metros in NYC and Bilbao, and CERN.

What kind of expertise shall I look for? Compiler theorists? What information shall I request from the language author? I feel that Haskell might be a good language to represent the AST in. But I am overwhelmed by the numerous AST libraries on hackage.

How similar is my problem to what is covered by, say, haskell-src-exts, hlint, alex or syntactic? Can I re-use code from these projects?

What solutions are out there for dependency resolution between source files?

I suppose for informative error messages I need to decorate each semantic bit with information about
- source file and location
- what is in scope at that position, plus the source of the symbols in scope
What data structures can be used to build and represent this information?

Any pointers welcome.
Olaf

Siddharth Bhat

2018-11-22 13:26:38 UTC

Permalink

What is the canonical way to lift an interpreter to an abstract
interpreter, while ensuring convergence ---figuring out how and when to
widen?

Thanks
Siddharth

On Thu, 22 Nov, 2018, 18:52 Holger Siegel via Haskell-Cafe, <

Hello Olaf,
to me that sounds as if you want to do an abstract interpretation with a
forward collecting semantics that employs non-relational abstract
domains for the primitive data types and summarizes the dimensions of
arrays. That is what the Java compiler does when it analyzes whether a
variable is 'effectively final' (except that it doesn't even have to
summarize arrays for that), and it should suffice for the kind of bugs
you want to find.
I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized. You also get a
clear formalization of the semantics, and you can annotate the syntax
tree and implement informative error messages for the bugs you want to
detect. You don't even need to implement every primitive function - it
suffices that they return some random value, provided that the intended
return value is one of the possible values. Lifting the interpreter to
an abstract interpreter can then be done in a canonical way.
Regards,
Holger

Post by Olaf Klinke
Dear Cafe,
I want to build a linter for a C-like language. My company uses a UI

system that comes with a C-like language that is used to provide
functionality to the UI elements. The language is interpreted at runtime.
The user, however, has no access to the built-in parser and checking
functions. Only a basic syntax check comes with their editor. Therefore
many stupid mistakes only show at runtime and can not be tested for
statically.

Post by Olaf Klinke
The language is in many respects even simpler than C. No structs, no

enumerations, only a handful of basic data types and arrays thereof, plus
pointers and the usual control flow structures. On the other hand, there
are some features that make static analysis near impossible, e.g. the
possibility of dynamic registration of variables. Higher-order types are
emulated by passing the name of other functions as strings. There is
explicit referencing of other sources (#include statements) as well as
implicit referencing within UI elements and within the project hierarchy.

Post by Olaf Klinke
We want to catch basic programming mistakes - misspelled variables,

missing return statements, usage of variables before assignment, wrong
number of function arguments/types. A project's code is scattered across
multiple files and layers - a function declaration can mask a declaration
from the layer below. Therefore, a per-file-linter will not suffice.

Post by Olaf Klinke
As a side-effect this may also yield a documentation generator (boy do I

miss haddock while working with that), which is also missing in the system.

Post by Olaf Klinke
I have never done something like this and will seek help from local CS

departments (If you are a CS student in Paderborn or Bielefeld, Germany,
get in touch!). My company would be willing to sponsor a Bachelor or
Master's thesis project. Ultimately this will be comercialized as an
add-on, so anyone who helps will also profit from the result financially.
The user community is not large, but includes big names such as Siemens,
London Heathrow, the Metros in NYC and Bilbao, and CERN.

Post by Olaf Klinke
What kind of expertise shall I look for? Compiler theorists? What

information shall I request from the language author? I feel that Haskell
might be a good language to represent the AST in. But I am overwhelmed by
the numerous AST libraries on hackage.

Post by Olaf Klinke
How similar is my problem to what is covered by, say, haskell-src-exts,

hlint, alex or syntactic? Can I re-use code from these projects?

Post by Olaf Klinke
What solutions are out there for dependency resolution between source

files?

Post by Olaf Klinke
I suppose for informative error messages I need to decorate each

semantic bit with information about

Post by Olaf Klinke
- source file and location
- what is in scope at that position, plus the source of the symbols in

scope

Post by Olaf Klinke
What data structures can be used to build and represent this information?
Any pointers welcome.
Olaf
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

--
Sending this from my phone, please excuse any typos!

Olaf Klinke

2018-11-26 20:37:10 UTC

Permalink

...

I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language author wrote a translation to C++ already), but if that is the way to go, I'll do it. As I understand it, the Haskell semantics should contain just enough substance so that the errors I am after will cause hiccups in the Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting: How could a translation to Haskell preserve information about scope, source position and masking? Can I leverage the ghc API for that?

Regards,
Olaf

Carter Schonwald

2018-11-29 14:31:15 UTC

Permalink

Most parser libraries in Haskell provide facilities for including source
information.

And then you include in your syntax tree extra fields for those positions.

Simple as that

The paper :
http://david.darais.com/assets/papers/abstracting-definitional-interpreters/adi.pdf
Is a great reference for how to add a bunch of program analysis features
after you have a working interpreter . Also itâs citations show a bunch of
ways you can Add different flavored of anayses

Post by Olaf Klinke

...

I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language
author wrote a translation to C++ already), but if that is the way to go,
I'll do it. As I understand it, the Haskell semantics should contain just
enough substance so that the errors I am after will cause hiccups in the
Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting: How
could a translation to Haskell preserve information about scope, source
position and masking? Can I leverage the ghc API for that?
Regards,
Olaf
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Paul Brauner

2018-12-04 13:14:25 UTC

Permalink

If you're interested I have a draft implemetation of that paper
<https://github.com/polux/abstract-interpreters/blob/master/src/Main.hs> in
Haskell that uses monad transformers. There's another one
<https://github.com/josefs/AbstractingDefinitionalInterpreters> online that
uses effect handlers. The original scheme implementation
<https://github.com/plum-umd/abstracting-definitional-interpreters> is also
available.

A caveat: while playing with my implementation I found some issue with the
GC as described in the paper and was able to reproduce on the reference
implementation:
https://github.com/plum-umd/abstracting-definitional-interpreters/issues/82.
I'm not knowledgeable enough to understand whether this is a flaw of the
approach or something that can be easily fixed.

Paul

Post by Carter Schonwald
Most parser libraries in Haskell provide facilities for including source
information.
And then you include in your syntax tree extra fields for those positions.
Simple as that
http://david.darais.com/assets/papers/abstracting-definitional-interpreters/adi.pdf
Is a great reference for how to add a bunch of program analysis features
after you have a working interpreter . Also itâs citations show a bunch of
ways you can Add different flavored of anayses

Post by Olaf Klinke

Hello Olaf,
to me that sounds as if you want to do an abstract interpretation with

forward collecting semantics that employs non-relational abstract
domains for the primitive data types and summarizes the dimensions of
arrays.

...

I would start by writing a simple interpreter for the language to be
analyzed. That way you fix messy details before they bite you, e.g. the
order in which submodules are loaded and initialized.

I was hoping not having to write an interpreter (because the language
author wrote a translation to C++ already), but if that is the way to go,
I'll do it. As I understand it, the Haskell semantics should contain just
enough substance so that the errors I am after will cause hiccups in the
Haskell compiler? That is indeed a compelling approach.
What this does not address is the question about error reporting: How
could a translation to Haskell preserve information about scope, source
position and masking? Can I leverage the ghc API for that?
Regards,
Olaf
_______________________________________________
Haskell-Cafe mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.