Project

General

Profile

Actions

Bug #18900

closed

Missing events(?)/unexpected results using GVL instrumentation API

Added by ivoanjo (Ivo Anjo) about 1 month ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.0dev (2022-07-06T09:08:31Z master 5ef3c7ea2d) [x86_64-linux]
[ruby-core:109157]

Description

I'd like to start by thanking @byroot (Jean Boussier) for his amazing work on the GVL Instrumentation API. This new API is going to be really useful for those of us doing performance analysis of Ruby apps :)

For context, I'm running a recent ruby master build: ruby 3.2.0dev (2022-07-06T09:08:31Z master 5ef3c7ea2d) [x86_64-linux]

The issue

I'm working at Datadog on the ddtrace gem, and I've been prototyping the use of the GVL Instrumentation API on our Profiler (still an experiment for now).

I've built a trivial GVL event handler ( https://github.com/DataDog/dd-trace-rb/blob/9f98c371f114a037d5c73b1d8efbc1220797c350/ext/ddtrace_profiling_native_extension/gvl_tracing.c ) that prints 3 things to a file for every event: thread id, timestamp, and event.

I've noticed that there are events "missing" that I would've expected to have received. For instance, consider the following code:

Datadog::Profiling::GvlTracing._native_start

def fib(n)
  return n if n <= 1
  fib(n-1) + fib(n-2)
end

1.times.map { Thread.new { fib(35) } }.map(&:join)

Datadog::Profiling::GvlTracing._native_stop

and here's the resulting events received from the Ruby VM:

# thread id..........timestamp............event
  [139927701786432, 1657188684855568392, "suspended"],
  [139927701786432, 1657188684855602550, "ready"],
  [139927701786432, 1657188684855603318, "resumed"],
  [139927633524480, 1657188684855761222, "started"],
  [139927633524480, 1657188684855763974, "ready"],
  [139927633524480, 1657188684855764431, "resumed"],
  [139927633524480, 1657188685610511252, "suspended"],
  [139927633524480, 1657188685610517376, "exited"],
  [139927701786432, 1657188685610615925, "ready"],
  [139927701786432, 1657188685610644531, "resumed"],

This seems surprising to me: the main thread that got blocked on #join was never marked as suspended, even though there's a resumed event from a different thread.

When looking at it from a per-thread timeline point-of-view, the events for thread 139927701786432 are "suspended" -> "ready" -> "resumed" -> "ready" which isn't what I was expecting.

This issue doesn't seem to be connected to the #join specifically.

For instance if I add a second thread:

+1.times.map { Thread.new { fib(35) } }.map(&:join)
-2.times.map { Thread.new { fib(35) } }.map(&:join)

here's what I get:

# thread id..........timestamp............event
  [140298252205888, 1657190331866960767, "suspended"],
  [140298252205888, 1657190331866997030, "ready"],
  [140298252205888, 1657190331866997882, "resumed"],
  [140298181732096, 1657190331867188792, "started"],
  [140298181732096, 1657190331867190524, "ready"],
  [140298181732096, 1657190331867190813, "resumed"], # <--- first fibonacci thread started working without main suspending
  [140298183833344, 1657190331867198615, "started"],
  [140298183833344, 1657190331867201182, "ready"],
  [140298183833344, 1657190331967683212, "resumed"], # <-- second fibonacci thread resumed without a suspend from the first
  [140298181732096, 1657190331967765804, "ready"],
  [140298181732096, 1657190332067699266, "resumed"], # <--- first fibonacci again, missing suspend event
  [140298183833344, 1657190332067722234, "ready"],
  [140298183833344, 1657190332167930765, "resumed"],
  [140298181732096, 1657190332167949981, "ready"],
  [140298181732096, 1657190332268129230, "resumed"],
  [140298183833344, 1657190332268153149, "ready"],
  [140298183833344, 1657190332368425743, "resumed"],
  [140298181732096, 1657190332368450799, "ready"],
  [140298181732096, 1657190332468822338, "resumed"],
  [140298183833344, 1657190332468908400, "ready"],
  [140298183833344, 1657190332568760756, "resumed"],
  [140298181732096, 1657190332568807475, "ready"],
  [140298181732096, 1657190332669003133, "resumed"],
  [140298183833344, 1657190332669041747, "ready"],
  [140298183833344, 1657190332769014200, "resumed"],
  [140298181732096, 1657190332769035727, "ready"],
  [140298181732096, 1657190332869415660, "resumed"],
  [140298183833344, 1657190332869531747, "ready"],
  [140298183833344, 1657190332969455281, "resumed"],
  [140298181732096, 1657190332969512054, "ready"],
  [140298181732096, 1657190333069937592, "resumed"],
  [140298183833344, 1657190333069992165, "ready"],
  [140298183833344, 1657190333169832772, "resumed"],
  [140298181732096, 1657190333169860028, "ready"],
  [140298181732096, 1657190333270139615, "resumed"],
  [140298183833344, 1657190333270239955, "ready"],
  [140298181732096, 1657190333323601109, "suspended"],
  [140298181732096, 1657190333323642507, "exited"],
  [140298183833344, 1657190333323736093, "resumed"],
  [140298183833344, 1657190333375825514, "suspended"],
  [140298183833344, 1657190333375832829, "exited"],
  [140298252205888, 1657190333375867753, "ready"],
  [140298252205888, 1657190333375870165, "resumed"],

On my prototype, for instance in https://github.com/DataDog/dd-trace-rb/blob/ivoanjo/experiment-gvl-tracing/gvl_tracing_example1.rb, I've taken to simulating a suspend event when I find this kind of gap in the event stream.

How to reproduce

Here's how to use my tracing experiment code, from scratch:

$ git clone https://github.com/DataDog/dd-trace-rb.git
Cloning into 'dd-trace-rb'...
remote: Enumerating objects: 78965, done.
remote: Counting objects: 100% (1621/1621), done.
remote: Compressing objects: 100% (857/857), done.
remote: Total 78965 (delta 952), reused 1336 (delta 746), pack-reused 77344
Receiving objects: 100% (78965/78965), 50.54 MiB | 10.98 MiB/s, done.
Resolving deltas: 100% (55760/55760), done.
$ cd dd-trace-rb/
/dd-trace-rb$ git checkout ivoanjo/experiment-gvl-tracing
branch 'ivoanjo/experiment-gvl-tracing' set up to track 'origin/ivoanjo/experiment-gvl-tracing'.
Switched to a new branch 'ivoanjo/experiment-gvl-tracing'
/dd-trace-rb$ docker run --network=host -ti -v `pwd`:/working rubocophq/ruby-snapshot:latest /bin/bash
root@docker-desktop:/# ruby -v
ruby 3.2.0dev (2022-07-06T09:08:31Z master 5ef3c7ea2d) [x86_64-linux]
root@docker-desktop:/# cd working/
root@docker-desktop:/working# bundle install
# ...
Bundle complete! 35 Gemfile dependencies, 70 gems now installed.
# ...
root@docker-desktop:/working# bundle exec rake compile
# ...
root@docker-desktop:/working# bundle exec ruby gvl_tracing_example1.rb
Started!
Stopped!
root@docker-desktop:/working# head gvl_tracing_out.json
[
  [140451049381696, 1657190910088942000, "started_tracing"],
  [140451049381696, 1657190910088957670, "resumed"],
  [140451049381696, 1657190910089018233, "suspended"],
  [140450981082880, 1657190910089076610, "resumed"],
  [140451049381696, 1657190910089085748, "ready"],
  [140451049381696, 1657190910089169091, "resumed"],
  [140450981082880, 1657190910143208013, "ready"],
  [140450981082880, 1657190910143257050, "resumed"],
  [140451049381696, 1657190910143547517, "ready"],
root@docker-desktop:/working# head gvl_tracing_ux_2022-07-07T10\:48\:40Z.json
{
  "threads": {
    "140451049381696": [
      {
        "startNs": 1657190910088942000,
        "endNs": 1657190910088957670,
        "label": "started_tracing",
        "durationSeconds": 1.567e-05
      },
      {

Updated by byroot (Jean Boussier) about 1 month ago

This seems surprising to me: the main thread that got blocked on #join was never marked as suspended, even though there's a resumed event from a different thread.

I may have missed some kind of fastpath in which we should trigger that event. I'll have a look.

Actions #3

Updated by byroot (Jean Boussier) about 1 month ago

  • Status changed from Open to Closed

Applied in changeset git|587d2d199b3f783d03266d42d066949f8a4824d3.


thread_pthread.c: call SUSPENDED event when entering native_sleep

[Bug #18900]

Thread#join and a few other codepaths are using native sleep as
a way to suspend the current thread. So we should call the relevant
hook when this happen, otherwise some thread may transition
directly from RESUMED to READY.

Actions

Also available in: Atom PDF