Helping to find the usefulness of a proposal

Thu Apr 2 20:59:32 PDT 2009

Comments inline

Bruce

Quoting Joe Darcy <Joe.Darcy at Sun.COM>:

> brucechapman at paradise.net.nz wrote:
> > Good idea,
> > 
> > Should they each be evaluated against the same corpus and what would 
> > be a suitable corpus?
> > 
> > http://en.wikipedia.org/wiki/Corpus_linguistics
> 
> On that front, Alex sent me the following:
> 
> > Analysis of a micro-corpus of your own or your company's code is
> > unscientific.
> > 
> > Ewan Tempero and his colleagues at the University of Auckland have 
> > done excellent, peer-reviewed work on how Java language features are 
> > used in real-world code. Their "Qualitas Corpus" consists of over 
> > 100,000 classes - see http://www.cs.auckland.ac.nz/~ewan/corpus/
> > 

OK,

I was aware of that one because the last JUG meeting here was about visualising
code and was working with that corpus.

The problem then is that the corpus is HUGE. Even the 20090202r version which
only has the latest release of each system (and is probably the appopriate one
for coin use) is 1.2Gb (my monthly broadband limit is 1Gb - I'd probably
sneakernet it from someone locally).  

Maybe this is one of those cases where it is better to take the analysis tools
to the corpus rather than the other way around. For that we'd need someone to
host the corpus and run jobs submitted against it.

Let me do some further research on that.

Bruce

> > If someone showed that, say, a null check occurs on average every 15 
> > lines of code in this corpus, and that null-safe operators could 
> > remove those lines without adverse side effects, then that would be a
> > real contribution to Project Coin.
> 
> I agree that using a standard, large corpus to empirically examine the 
> utility of the Project Coin proposals would be a fine component of their
> 
> evaluation.
> 
> -Joe
>