Skip to main content.

Thursday, March 27, 2008

One of the things that needs a review in 0.9 version is the VM stack. To code that fast and safe, I recently used a GenericVector?, but this is suboptimal by far. First of all, representing the stack as a long linear vector poses relocation problems when the depth of the stack grows too much; secondly, the access and copy of single elements is dead slow with respect to the two mov eXX asms that an item copy requires.

Since there was also the problem that last call parameter count was limited to one byte in the stack frame structure, I took that occasion to do a bit of experimentations. An efficient call stack is a must for a functional language, so I took a bit of time to think how to make it better.

I tried a per-frame structure which holds only the local stack frame, and uses a list of frames, one per call. That seemed a good idea, but creating an instance (also, not a light one) for each call is definitely not a way to go.

Also, I took some time to do some experiment with the longtests:
Falcon unit test package.

Version 0.8.9 (piuma)
2g: success. ( 11.749 secs, 2553408.801 ops/sec)
2j: success. ( 15.906 secs, 1886080.724 ops/sec)
2h: success. ( 11.938 secs, 2512983.749 ops/sec)

2g, 2j and 2h tests calls functions respectively with depth of 100, 1000 and 10000 recursions; the total calls in each of them is the same, only the maximum depth of the calls varies. This experiment shows that there is a big performance hit when depth of the stack goes beyond a critical mass; then the application (i.e. faltest) is given some extra memory, and the it even recovers performance in the following steps.

So, while embedded scripts won't probably never break the 128 pre-allocated items in the linear stack, this performance (2.5M calls per seconds, plus some opcodes in each call), seems adequate even for functional based heavier applications in this 0.8 step.

So, experiments on the VM stack are closed till 0.9, where we'll perform massive optimizations. In that occasion, the stack will be turned into a specific wrapped Item*, and we'll experiment again for optimal growth strategies.

I suspect both the call sequence and the access to stack performance can be easily doubled (each item is pushed in the stack through a memcpy instead of a "=" operator that would be optimized in two movs), and reads require extra checks, that is branches, that an Item* won't need).


No comments yet

Add Comment

This item is closed, it's not possible to add new comments to it or to vote on it