Feature #12628
closedchange block/env structs
Description
I will change block/env structures for performance.
I'm not sure who interests about this area. But it will be big change.
Issues¶
Now, MRI has several problems.
(1) we need to clear rb_control_frame_t::block_iseq
for every frame setup. It consumes space (a VALUE
for each frame) and initializing time.
(2) There are several block passing ways by ISeq (iter{...}
), Proc(iter(&pr)
), Symbol(iter(:sym)
). However, they are not optimized (for Symbol blocks, there is only ad-hoc check code).
(3) Env (and Proc, Binding) objects are not WB-protected ([Bug #10212]).
Proposal¶
To solve them, I wrote a big patch.
https://github.com/ruby/ruby/compare/trunk...ko1:block_code
Introduce Block Handler (BH)¶
For Issues (1) and (2), I introduced a concept "Block Handler" (BH).
Current implementation¶
Now, rb_block_t
pointers are passed to represent given blocks.
rb_block_t
has the following types.
(1) A part of current control frame (with block_iseq = iseq
) (iter{...}
)
(2) proc body (iter(&pr)
)
(3) A part of current control frame (with block_iseq = :sym
) (iter(&:sym)
)
(for internal, there are (4) ifunc
, for C implemented block)
They are placed on the frame of passed method (as a local variable (ep[0]
)).
To mark Proc on GC for (2), we prepare rb_block_t::proc
(== rb_control_frame_t::block_iseq
).
Using BH¶
To remove rb_block_t::proc
(== rb_control_frame_t::block_iseq
),
we introduce BH to put Proc or Symbol directly as given block (they are located as a special local variable).
Proc and Symbol are normal objects so that we can put them without any concern.
We need to think about iseq
and ifunc
type ((1) and (4)).
To make it clear, I introduced struct rb_captured_block
to represent a set of self
, local variables (ep
) and iseq
(or ifunc
). (now rb_block_t
represents same set)
Passed blocks with iseq
(iter{...}
) are represented with a pointer of rb_captured_block
.
Such pointers are not managed VALUE, so that we add a tag for such pointers.
-
ptr | 0x01
-> pointer to captured_block contains iseq -
ptr | 0x03
-> pointer to captured_block contains ifunc (for internal)
Tagged pointers are recognized as Fixnum by GC.
(Note that current implementation uses this tagged pointer to represent "local frame" (no previous Env) flag.
Instead of tagged information, we introduce VM_ENV_FLAG_LOCAL
as a frame flag for this purpose.
See next chapter about "ENV_FLAG"s)
We can recognize a type of passed BH with the following rule:
(0) BH == VM_BLOCK_HANDLER_NONE (== 0) -> no block given
(1) (BH & 0x03) == 0x01 -> pointer to captured_block contains iseq
(2) (BH & 0x03) == 0x02 -> pointer to captured_block contains ifunc
(3) SYMBOL_P(BH) -> Symbol
(4) Otherwize -> Proc
This is what vm_block_handler_type(VALUE block_handler)
does.
To invoke passed block represented by BH, we need to check the type of each BH with vm_block_handler_type(VALUE block_handler)
. There are several extra overhead because current implementation only need to check rb_block_t::iseq (this can contains iseq, ifunc and Symbol). However I believe it is more simple and readable.
In fact, "invoke block" benchmark (vm1_yield) is faster.
I renamed rb_block_t
to struct rb_block
to represent a escaped block which is stored by Proc or Binding.
We introduce rb_block::type
to represent a type corresponding BH's type.
rb_block::as
is a union type to represent a block body specified by type
.
We can convert rb_block
<-> BH each others.
struct rb_block {
union {
struct rb_captured_block captured;
VALUE symbol;
VALUE proc;
} as;
enum rb_block_type type;
};
To check the type of block, we should use vm_block_type()
instead of check rb_block_t::type
directly because there are several assertions (when VM_CHECK_MODE > 0).
Short summary¶
(1) Introduce struct rb_captured_block
to represent a set of self
, variables (ep
), and code
(iseq
or ifunc
).
Usually the space of this type are the caller's control frame.
(2) For methods called with block, they receive "Block Handler" (BH) represents a passed block. It should be a tagged struct rb_captured_block
(seems as Fixnum), Proc object or Symbol object.
(3) Caller method with block (== iterator) invokes block by checking given BH type. We can check BH type with vm_block_handler_type()
.
(4) To make Proc, convert BH to struct rb_block
.
Introduce WB for Env objects¶
WB is important for generational and incremental GC (for issues (3)). We can run MRI without WB for all objects because of RGenGC "wb-unprotected" technique. In fact, we don't introduce WBs for RubyVM::Env
(Env) objects because it has performance impact to introduce WB for this objects. This means that all of assignments to local variables should check WB needed or not.
However, there are several performance regression. For example, if an application creates many Proc objects, corresponding Env objects are created and they should be marked each minor GC (because they are wb-unprotected). This is what the ticket [Bug #10212] shows.
So we need to achieve "low latency WB (for Env objects)".
Current MRI's local variable assignment:
/* actual assignment in insns.def, setlocal instruction */
*(ep - idx) = val;
Naive implementation with WB will be:
#define VM_EP_IN_HEAP_P(th, ep) (!((th)->stack <= (ep) && (ep) < ((th)->stack + (th)->stack_size)))
if (VM_EP_IN_HEAP_P(ep)) {
RB_OBJ_WRITE(VM_ENV_EP_ENVVAL(ep), ep-idx, val);
}
else {
*(ep - idx) = val;
}
It is correct, but not so fast code (in fact, it is too slow when Env is in heap (== escaped)).
Approach¶
At first we need to check the local variables are located on the (1) VM stack or (2) Env. We don't need to protect with WB for (1) because VM stacks are root for every GC.
To make it simple, we move rb_control_frame_t::flags
to ep[0]
(as a special local variable) and introduce VM_ENV_FLAG_ESCAPED
. We can easily check "on stack" (flags & VM_ENV_FLAG_ESCAPED == 0
) or "escaped" (== on Env) (flags & VM_ENV_FLAG_ESCAPED != 0
). We don't need to compare with VM stack range.
To locate flags onto ep
(local variables), I cleanup managed data area on local variables.
#define VM_ENV_DATA_SIZE ( 3)
#define VM_ENV_DATA_INDEX_ME_CREF (-2) /* ep[-2] */
#define VM_ENV_DATA_INDEX_SPECVAL (-1) /* ep[-1] */
#define VM_ENV_DATA_INDEX_FLAGS ( 0) /* ep[ 0] */
#define VM_ENV_DATA_INDEX_ENV ( 1) /* ep[ 1] */
#define VM_ENV_DATA_INDEX_ENV_PROC ( 2) /* ep[ 2] */
It means that 3 (== VM_ENV_DATA_SIZE) special local variables are allocated for each frame (index -2 to 0).
(Note that index 1 and 2 is only used by escaped Env)
Current MRI already has 2 special local variables (me_cref and special).
I introduced macro name to avoid magic numbers.
To respect this local variable layout, compile.c requires several fixes and rb_iseq_t::local_size
is no longer needed (we can calculate local variable number with local_table_size
with VM_ENV_DATA_SIZE
.
Another optimization is introducing VM_ENV_FLAG_WB_REQUIRED
flag.
It is very tricky and danger method so we should not use this hack in other places.
This flag is tightly connected to the current GC implementation.
We need WB protection for "non remembered old objects (or gray objects on incremental GC)". When the old objects are remembered, we don't need WB protection any more until next marking. So VM_ENV_FLAG_WB_REQUIRED
shows this status.
(1) At initializing Env objects, VM_ENV_FLAG_WB_REQUIRED
is true.
(2) At first local variable assignment, VM_ENV_FLAG_WB_REQUIRED
is true, so we insert WB protection for this Env object. And turn off this flag.
(3) At next local variable assignment, VM_ENV_FLAG_WB_REQUIRED
is false, so we can ignore WB protection.
(4) At GC marking for this Env object, we turn off VM_ENV_FLAG_WB_REQUIRED
and goto (2).
The time (2) and (4) could be enough long so only a few WB protection is needed.
At last, local variables assignment code is like the following.
NOINLINE(static void vm_env_write_slowpath(const VALUE *ep, int index, VALUE v));
static void
vm_env_write_slowpath(const VALUE *ep, int index, VALUE v)
{
/* remember env value forcely */
rb_gc_writebarrier_remember(VM_ENV_ENVVAL(ep));
VM_FORCE_WRITE(&ep[index], v);
VM_ENV_FLAGS_UNSET(ep, VM_ENV_FLAG_WB_REQUIRED);
}
static inline void
vm_env_write(const VALUE *ep, int index, VALUE v)
{
VALUE flags = ep[VM_ENV_DATA_INDEX_FLAGS];
if (LIKELY((flags & VM_ENV_FLAG_WB_REQUIRED) == 0)) {
VM_STACK_ENV_WRITE(ep, index, v); /* write lvar directly */
}
else {
vm_env_write_slowpath(ep, index, v);
}
}
With these techniques, now RubyVM::Env objects are WB-protected without big performance impact.
Now, Proc, Binding objects are also WB-protected.
Short summary¶
To make Env object wb-protected, I implemented a low-overhead WB technique.
(1) Move frame flags form rb_control_frame_t::flags
to ep[0]
(as a special local variable) and introduce VM_ENV_FLAG_ESCAPED to represent escaped Env.
(2) Introduce VM_ENV_FLAG_WB_REQUIRED to check necessity of WB protection which is tightly coupled with GC implementation.
(3) With this technique and other hacks, now RubyVM::Env, Proc and Binding objects are WB-protected.
Evaluation¶
Introducing WBs for Env/Proc objects, we can improve the throughput of app_lc_fizzbuzz benchmark.
Also method and block invocations are faster.
several results:
trunk modified
app_lc_fizzbuzz 58.277 41.729 (sec) (x 1.397 faster)
vm1_simplereturn* 0.660 0.638 (sec) (x 1.035 faster)
vm1_yield* 0.738 0.650 (sec) (x 1.135 faster)
There are several slower programs.
trunk modified
app_pentomino 14.096 15.241 (sec) (x 0.925 faster == slow)
vm1_lvar_set* 1.893 1.916 (sec) (x 0.988 faster == slow)
lvar_set tries to set local variables many times but not so big impact.
I'm not sure why pentomino puzzle is too slow.
All of benchmarks are here:
https://gist.github.com/ko1/c741cd4b2a5a5012364c0686703052b3
Summary¶
I made a patch to solve issues (1) to (3).
https://github.com/ruby/ruby/compare/trunk...ko1:block_code
A patch is slightly big but it is difficult to separate into small part of code for me,
so I'll commit it soon at once, sorry.