With the amount of work that's needed to even attempt such a study and get to a point that one can precisely define which variable one is actually measuring, there are indeed compromises needed.
But please feel free to replicate the study for your preferred language. I am happy to discuss more about why we made certain choices.
We report on results for both, with and without just-in-time compilation.
The specific focus for this work was pure interpreter performance in the context of metacompilation systems, but before compilation had a chance to kick in.
For both RPython and Truffle/Graal, it's possible to disable the JIT compilers and measure pure interpreter speed.
So the "baseline" is Java - is that Java compiled or interpreted? And if the latter, is the non-JIT-ted Graal interpreter compiled (as Java) and interpreting the script, or is it interpreted itself?
The figure for the JIT-compiled numbers uses a standard HotSpot JVM, with JIT compilation.
The figures for the interpreter numbers uses a standard HotSpot JVM with the -Xint flag, so, only using the Java bytecode interpreter.
The TruffleSOM interpreter is AOT-compiled, so, it's a native binary, which is then interpreting the SOM code.
We report on results for both, with and without just-in-time compilation.
The specific focus for this work was pure interpreter performance in the context of metacompilation systems, but before compilation had a chance to kick in.
You are free to disagree with the specific design choices of the AST and bytecode interpreters for SOM, but we put quite some effort in to have an as fair as possible comparison.
But the bigger ones are not yet (so, no results yet).
One of the important questions to answer first is how the mapping should be done. A naive version using new/delete/smartpointers is going to have performance issues. Other options would be to use arena allocators and completely remove memory management overhead from the equation. Depending on what comparison/C++ usage scenario is desired, both options would be useful.
I tried Pandoc before reverting back to tex4ht. Unfortunately, it models a rather small subset of the things I was interested in. Specifically around the typesetting of citations and listings, as far a I remember. So, tex4ht and HTML post-processing it was.
It is built on top of tex4ht. It provides merely a few settings for tex4ht and post-processing scripts that beautify the generated HTML. You might ask why post-processing? Well, because it was simpler for me than figuring out how to get tex4ht to do the desired thing. I just find Tex/Latex not pleasant to use as a programming language, but that's personal taste.
If the post-processing stage is useful to others, perhaps it could be upstreamed into tex4ht?
(I sometimes think that the user interface of github puts too much emphasis on cloning and not enough on cooperation. Many useful tools ends up in a dozen forks, all with slightly different features, all equally inactive.)
Yes, that's indeed a none-obvious issue, and looks rather strange on the graph.
That's not the 'partial evaluation' per se. Instead, it is the difference between RPython and HotSpot that surfaces here. RPython does currently generated singled threaded VMs, and neither the GC nor the compilation are done in separate threads. The HotSpot JVM however can do those things in parallel and additionally has other infrastructure threads running. In the end, this increases the likelihood that the OS reschedules the application thread, which becomes visible as jitter.
Oh, makes sense :) Is that mentioned in the paper and I missed it? If not, it might deserve a footnote, as the difference is glaring.
BTW, would Graal really become available as a stock-HotSpot plugin in Java 9 thanks to JEP 243? I see things are ready on Graal's end[1], but are they on HotSpot's?
You are right, with a tracer, you get most of that for free. For a meta-object protocol, you'll need however still a little bit help to avoid creating too many guards. And, the PICs or `dispatch chains' are very useful in the interpreter, which avoids enormous warmup penalties.
But please feel free to replicate the study for your preferred language. I am happy to discuss more about why we made certain choices.