Bug #15316: rb_postponed_job_register not thread-safe - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #15316

closed

rb_postponed_job_register not thread-safe

Added by normalperson (Eric Wong) over 6 years ago. Updated over 6 years ago.

Status:

Closed

Assignee:

k0kubun (Takashi Kokubun)

Target version:

ruby -v:

Backport:

2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN

[ruby-core:89848]

Description

Consider following execution timeline for two threads, t1 and t2.
(should be 2 columns, I only have fixed-width fonts for display)

t1                                      t2 (mjit-worker)
-----------------------------------------------------------------------------

 vm->postponed_job_index increment #1
 pjob[0]->func = ...
 pjob[0]->data = ...
                                        vm->postponed_job_index increment #2

 RUBY_VM_SET_POSTPONED_JOB_INTERRUPT

 rb_postponed_job_flush
 sees result of increment #2 from t2
 tries to access pjob[1] => CRASH
                                       pjob[1]->func = ...
                                       pjob[1]->data = ...

                                       RUBY_VM_SET_POSTPONED_JOB_INTERRUPT

So it looks like we need a THREAD-SAFE (not necessarily async-safe)
version of rb_postponed_job_register. Or a one-off API for MJIT...
job registration for MJIT.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

#1 [ruby-core:89851]

Updated by normalperson (Eric Wong) over 6 years ago

https://bugs.ruby-lang.org/issues/15316

So it looks like we need a THREAD-SAFE (not necessarily async-safe)
version of rb_postponed_job_register. Or a one-off API for MJIT...
job registration for MJIT.

I think I will introduce a new "workqueue" API dedicated to
thread-safety (but not async-signal-safety). We can use the
same interrupt flag, but it will use a separate data structure.

My goal is also to support MJIT in forked processes,
because (AFAIK) some production Ruby instances use fork
with long-running processes.

Actions

Copy link

#2 [ruby-core:89852]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Thanks to report that.

My goal is also to support MJIT in forked processes

Oh by the way, I have a patch which was created yesterday but I was too busy to commit that yet. (I'll commit that shortly.)

I think thread-safety of postponed_job and supporting MJIT in forked Ruby process are independent issues, but maybe your "workqueue" will be helpful for forked child somehow?

Actions

Copy link

#3 [ruby-core:89853]

Updated by normalperson (Eric Wong) over 6 years ago

takashikkbn@gmail.com wrote:

Oh by the way, I have a patch which was created yesterday but
I was too busy to commit that yet. (I'll commit that shortly.)

OK, please do.

I am also testing a minor cleanup:
https://80x24.org/spew/20181118081847.16293-1-e@80x24.org/raw
(but will probably be afk next 12 hours or so)

I think thread-safety of postponed_job and supporting MJIT in
forked Ruby process are independent issues, but maybe your
"workqueue" will be helpful for forked child somehow?

Somewhat related. We need to ensure a consistent state of the
queue (either postponed_job or workqueue).

Right now, we cannot guarantee a consistent state after forking;
either.

Actions

Copy link

#4 [ruby-core:89854]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

OK, please do.

Done, in r65785. Comments from your point of view would be much appreciated.

I am also testing a minor cleanup

looks fine.

Somewhat related. We need to ensure a consistent state of the queue (either postponed_job or workqueue).

In a sense that postponed_job could be in an unexpected state by creating a child Ruby process which suddenly lost MJIT worker thread modifying postponed_job queue, agreed.

Right now, we cannot guarantee a consistent state after forking; either.

Yeah, I found that issue when I was creating r65785. In r65785, I added mjit_pause before fork and mjit_resumes after fork (in different places depending on whether it's child or parent). As long as MJIT worker thread is absent when Ruby is forked, I think the situation became a little better.

Actions

Copy link

#5 [ruby-core:89858]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Well, I seem to have introduced at least two deadlocks by that. I already fixed the first one, and I'm going to fix the second one http://ci.rvm.jp/results/trunk-mjit@silicon-docker/1467162.

I'll take a deeper look when I come home today, but do you know who wakes up rb_sigwait_sleep and/or how the hang there could happen?

Actions

Copy link

#6 [ruby-core:89859]

Updated by normalperson (Eric Wong) over 6 years ago

takashikkbn@gmail.com wrote:

Well, I seem to have introduced at least two deadlocks by
that. I already fixed the first one, and I'm going to fix the
second one
http://ci.rvm.jp/results/trunk-mjit@silicon-docker/1467162.

I'll take a deeper look when I come home today, but do you
know who wakes up rb_sigwait_sleep and/or how the hang there
could happen?

Any signal should wake it up...

If you can't solve it, can you wait until after the 11/22
developers meeting?

It's a small chance, but I really want to get Thread::Light
(auto-fiber) ready and into 2.6 (at least for Linux/FreeBSD),
and I barely have time for Ruby as it is.

It's not looking like I'll be able to afford to hack on Ruby
next year :<

Actions

Copy link

#7 [ruby-core:89860]