Project

General

Profile

Misc #16234

Enabling ARM 64/32-bit cases by Drone CI

Added by jaruga (Jun Aruga) 9 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
[ruby-core:95201]

Description

Currently ruby project has 4 CIs on GitHub.

  1. Travis CI: linux cases with flags and compilers.
  2. GitHub Actions: macros, windows, ubuntu
  3. Wercker: Ruby JIT cases
  4. Appveyor: windows

I like to suggest 5th CI: Drone CI for ARM 64/32-bit cases.
Drone CI supports native the ARM 64/32 bit environments.
Have you used Drone CI?

I tried to use both Drone CI and Shippable CI supporting ARM.
My impression for Drone CI is quite good. Great user experience and user interface.
Shippable CI was not so good for some reasons.

Drone CI have not only linux ARM 64/32 bit environments on DockerRunner mode (= using container for CI like Wercker), but also freebsd, netbsd, openbsd, dragonfly (?) and solaris environments on ExecRunner (= maybe running commands directly without container) mode according to the following documents.

Is it exciting isn't it?
We can check ARM issue at a pull-request timing.

Here is the example. The content is almost same with wercker.yml except JIT option.
"ruby/3" is failed on the latest master branch, but "ruby/2" arm64 case is succeeded on old master branch.
https://cloud.drone.io/junaruga/ruby/3
https://github.com/junaruga/ruby/blob/feature/ci-arm/.drone.yml
https://cloud.drone.io/junaruga/ruby/2
Here is the pull-request as an example.
https://github.com/ruby/ruby/pull/2520

.drone.yml is the file to manage the CI cases.
But when you see most of the YAML parts between ARM 64-bit and 32-bit cases in .drone.yml is same. In case of .traivs.yml, we are using YAML anchor (&) and reference (*) feature effectively. But in case of .drone.yml I am not sure we can still use it beyond the "---" separator. Luckily Drone CI started providing the alternative .drone.star file by Starlark language.
https://docs.drone.io/starlark/overview/
https://blog.drone.io/create-pipelines-using-starlark/

Enabling Drone CI is quite simple.
Just go to https://drone.io/ , then register and enable target repository. UI is quite good.

Pros

  • We can check ARM 64/32-bit cases, and possibly freebsd and solaris cases too.
  • It's for free.
  • Each developer can debug ARM cases on their forked repository.
  • Customize easily. I see .travis.yml is used effectively.

Cons

  • Have to manage additonal file .drone.yml or .drone.star.

But first, I want to ask you. Are you interested in using Drone CI for Ruby project?

Updated by jaruga (Jun Aruga) 9 months ago

Now ARM 64-bit: success, ARM 32-bit: failed on the latest master branch again. :)
https://cloud.drone.io/junaruga/ruby/4

Updated by k0kubun (Takashi Kokubun) 9 months ago

I've never tried that, but the capability looks good. If the Solaris environment has Oracle Developer Studio, that'd be really nice. Even if not, ARM / FreeBSD / OpenBSD / NetBSD CIs would be nice to have, in addition to RubyCI's ones.

However, Drone CI seems to require owner access to ruby organization to enable CI. I'll wait for ruby organization's owner users to enable it.

Updated by k0kubun (Takashi Kokubun) 9 months ago

  • Status changed from Open to Closed

Naruse enabled the Drone CI and merged your PR bdbf8de4980ef54f466809ee27a9f2a00614b0f0.

Updated by jaruga (Jun Aruga) 9 months ago

Thanks for merging quickly!
I am looking forward to seeing ARM / Solaris / FreeBSD / OpenBSD / NetBSD CIs in .drone.yml or .drone.star :)

Updated by mame (Yusuke Endoh) 9 months ago

Just FYI: There are other CIs that are created and maintained by ourselves.

  • http://ci.rvm.jp/
    • It tests Ruby under a variety of configurations (JIT, assertions enabled, parallel testing, etc.)
    • It is the fastest CI: a notification is in about 30 seconds at the fastest
  • https://rubyci.org/
    • There are many voluntary CI hosts, and this site gathers each result.
    • ARM / Solaris (x86 and SPARC) / FreeBSD / OpenBSD are included.

BTW, I feel that we have too many CIs currently. Too many CIs brings too many notifications. Their formats are not uniform, which makes it harder to grasp the status.
I know that they have pros and cons, so it would be difficult to integrate all of them into one CI. (If Drone CI can do it, it is really great.)
It would be great if anyone creates a "curation" site to gather all the CI results into one site, like RubyCI.

Updated by jaruga (Jun Aruga) 9 months ago

There is "Azure pipelines" CI too.

BTW, I feel that we have too many CIs currently. Too many CIs brings too many notifications. Their formats are not uniform, which makes it harder to grasp the status.

Agree. Many CIs. Is there no duplicated CI cases between them, right?
For example, it seems that windows case is in GitHub Actions, Appveyor and Azure pipelines.

I know that they have pros and cons, so it would be difficult to integrate all of them into one CI. (If Drone CI can do it, it is really great.)
It would be great if anyone creates a "curation" site to gather all the CI results into one site, like RubyCI.

One possibility of Drone CI that can integrate some of them into one CI is using SSH Runner.
For example, it might be possible integrate RubyCI's Solaris server to Drone CI using SSH Runner.
https://ssh-runner.docs.drone.io/configuration/

When I did chat people in Drone CI on the public chatroom (very useful!), they advised it to me.
https://ssh-runner.docs.drone.io/support/

I was told Exec Runner was not available for the cloud. ;<
Instead, they recommend using SSH Runner for the purpose.

By the way, I also sent new PR to replace .drone.yml with .drone.star .
It's just refactoring. The content is same.
https://github.com/ruby/ruby/pull/2536

Maybe the owner of the ruby/ruby can see Setting - Main - Configuration - .drone.yml at https://cloud.drone.io/ruby/ruby/settings page.
They can rename the .drone.yml to .drone.star in the page before merging the PR.

Then the .drone.star becomes available to run for ruby/ruby repository.

Let me introduce Starlark (.star) and tips about .drone.star that I investigated today.
Here is the Starlark language information. It's very similar with Python syntax.
https://docs.bazel.build/versions/master/skylark/language.html

I installed below implementation on Go lang.
https://github.com/google/starlark-go/

Here is drone command line tool to check the result of .drone.star on local.
https://github.com/drone/drone-cli

Follwoing command outputs json format from .drone.star on current directory.

$ drone starlark --stdout --format=false

Following command outputs YAML format from .drone.star on current directory.

$ drone starlark --stdout

I heard .drone.star was converted to YAML format internally in Drone CI, then the YAML data was executed in the public chat room.

Drone CI Bug Tracking System.
https://discourse.drone.io

Updated by jaruga (Jun Aruga) 9 months ago

Just for your information.

This is today's blog from Travis CI. It's about supporting ARM 64-bit.
https://blog.travis-ci.com/2019-10-07-multi-cpu-architecture-support

The syntax is unlike existing "os: linux-ppc64le", but "arch: arm64".

Updated by k0kubun (Takashi Kokubun) 9 months ago

There is "Azure pipelines" CI too.

We already dropped that in favor of GitHub Actions.

This is today's blog from Travis CI. It's about supporting ARM 64-bit.
https://blog.travis-ci.com/2019-10-07-multi-cpu-architecture-support

That's a nice feature. Possibly we can drop Drone to simplify the situation if we can give up arm32 support (not sure).

By the way, I also sent new PR to replace .drone.yml with .drone.star .

I'm neutral about introducing the Starlark. While it simplifies duplications in the build config, obviously current maintainers are not familiar with the language. Other CI systems are achieving the same thing by either YAML alias or a built-in matrix syntax. Doesn't Drone support any of them?

Updated by jaruga (Jun Aruga) 9 months ago

That's a nice feature. Possibly we can drop Drone to simplify the situation if we can give up arm32 support (not sure).

Yeah, it's a nice feature. We can drop Drone CI, if Travis works well with "arch: arm64". We might be able to run arm32 (ARM 32-bit) using multilib on arm64 (ARM 64-bit) on Travis too, as we have already been running i686 (Intel 32-bit) case on x86_64 (Intel 64-bit) on Travis.
I am still thinking about the possibility to check Solaris and FreeBSD and etc case at a pull-request timing with SSH Runner on Drone CI.

if we can give up arm32 support (not sure).

I think arm32 is still popular for users, at least more than i686 (Intel 32-bit).

Because of the supported architectures for Linux distributions, Raspberry_Pi and the market share. Let me explain it one by one.

Linux distributions

For example Ubuntu is supporting arm32, providing the container image.

https://hub.docker.com/_/ubuntu
Supported architectures: (more info)
amd64, arm32v7, arm64v8, i386, ppc64le, s390x

Fedora project is supporting arm32 too.

https://hub.docker.com/_/fedora
Supported architectures: (more info)
amd64, arm32v7, arm64v8, ppc64le, s390x

Raspberry Pi

https://en.wikipedia.org/wiki/Raspberry_Pi
https://www.raspberrypi.org/blog/raspberry-pi-2-on-sale/

According to the Raspberry Pi wikipedia page, the Raspberry Pi version 1.1 is the last model for 32-bit, and the announcement was 5 August 2015.

When I discussed about use cases of ARM 32-bit in Fedora project, someone said "the Raspberry Pi performs quite badly as a 64-bit device for the moment, I've used it with Fedora armv7hl instead of aarch64." according to the email thread: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/Q742AVVBR6W6RTSVRYDSSGVKFOM3XTEF/

Market share

In 2017, armv7 (ARM 32-bit) was 98.1% in total Android device market share.
https://android.stackexchange.com/questions/186334/what-percentage-of-android-devices-runs-on-x86-architecture/202022#202022

I would like to hear other people's opinion about how much ARM 32-bit CPU is used currently.
I assume people working at ARM are in Ruby project, and they have current market share data about ARM 32-bit.

I'm neutral about introducing the Starlark. While it simplifies duplications in the build config, obviously current maintainers are not familiar with the language. Other CI systems are achieving the same thing by either YAML alias or a built-in matrix syntax. Doesn't Drone support any of them?

You are right. The Starlark way is not great for Ruby project. I would close the pull-request.
I have not investigated about the YAML alias in Drone CI yet, and I could not find the matrix syntax in it.

Updated by k0kubun (Takashi Kokubun) 9 months ago

We might be able to run arm32 (ARM 32-bit) using multilib on arm64 (ARM 64-bit) on Travis too, as we have already been running i686 (Intel 32-bit) case on x86_64 (Intel 64-bit) on Travis.

Good call :)
For Arm CI, let's try it when we have a chance, to unify what tools we use.

I am still thinking about the possibility to check Solaris and FreeBSD and etc case at a pull-request timing with SSH Runner on Drone CI.

My original interest to Drone CI mainly comes from the Solaris environment capability. Because of [Misc #15347], having GitHub-integrated CI of Oracle Developer Studio 12.x would be helpful despite the RubyCI existence, compared to other missing environments like FreeBSD / OpenBSD / NetBSD (which might be still worth consideration if we decided to use Drone CI for a long term).

Comparing the maintenance cost and benefits, if Arm on Travis works fine and we cannot easily use Oracle Developer Studio 12 on Drone CI, maybe we should drop Drone CI, given that most of their coverage is well-maintained in RubyCI anyway.

Updated by jaruga (Jun Aruga) 9 months ago

Good call :)
For Arm CI, let's try it when we have a chance, to unify what tools we use.

Yes, first let's try it.

My original interest to Drone CI mainly comes from the Solaris environment capability. Because of [Misc #15347], having GitHub-integrated CI of Oracle Developer Studio 12.x would be helpful despite the RubyCI existence, compared to other missing environments like FreeBSD / OpenBSD / NetBSD (which might be still worth consideration if we decided to use Drone CI for a long term).

Comparing the maintenance cost and benefits, if Arm on Travis works fine and we cannot easily use Oracle Developer Studio 12 on Drone CI, maybe we should drop Drone CI, given that most of their coverage is well-maintained in RubyCI anyway.

Sure, agree. First let's try ARM 64/32-bit cases on Travis. After it works, let's drop Drone CI.

Updated by jaruga (Jun Aruga) 9 months ago

I sent a PR: https://github.com/ruby/ruby/pull/2559 adding a native arm64 environment on Travis.

I also tried to add arm32 case by running gcc-multilib on the native arm64 like "i686-linux" case to Travis too. But so far, I have not succeeded yet.

But I am still stumbling for that. I just like sharing some information.

It seems that gcc-8-multilib:armhf is in ubuntu-toolchain-r-test repository.
https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test/+packages?field.name_filter=gcc-8&field.status_filter=published&field.series_filter=

There is gcc-multilib package on xenial, but there is no arm64 package for that in the standard repository.
https://packages.ubuntu.com/xenial/gcc-multilib

Following document was helpful for me.
https://askubuntu.com/questions/928249/how-to-run-armhf-executables-on-an-arm64-system

Updated by jaruga (Jun Aruga) 8 months ago

I like to share my investigation about below issue (random error) at "make install" on arm64 environment.

https://travis-ci.org/ruby/ruby/jobs/606916890#L2412

cp: cannot create regular file '../../.ext/common/json.rb': Permission denied
...
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

I sent a pull-request that I do not intend to be merged, but I think it can be a clue to fix the issue.
https://github.com/ruby/ruby/pull/2642

First, No output ..." issue comes from "exit 1 message in the result of Travis CI. We can replace it with "false" command for example, because "exit" does not work on arm64 according to following page.
https://travis-ci.community/t/exit-0-cannot-exit-successfully-on-arm/5731

Second, cp: cannot create regular file '../../.ext/common/json.rb': Permission denied
.

I have not found the root cause of the issue. But I found some things.

Comparing the result of id command, there are differences, the travis user's user id and group id are different.

On x86_64

$ id
uid=2000(travis) gid=2000(travis) groups=2000(travis),999(docker)

On arm64

$ id
uid=1000(travis) gid=1000(travis) groups=1000(travis),111(docker)

And, seeing the actual error,
https://travis-ci.org/junaruga/ruby/jobs/607138014#L2421

After failing to copy, .ext/common/fiddle.rb exists at the destination path, right? I do not know why. But I think it is a clue for the issue.

cp: cannot create regular file '../../.ext/common/fiddle.rb': Permission denied
...
{:mtime=>1572874916.7814617, :ctime=>1572874916.7814617, :mode=>"100444", :uid=>1000, :gid=>1000, :path=>".ext/common/fiddle.rb"}
...
{:mtime=>1572874920.841432, :ctime=>1572874920.841432, :mode=>"40755", :uid=>1000, :gid=>1000, :path=>".ext/common"}
...

And I think make COPY='cp -f' install in .travis.yml might be wrong. It is `make CP='cp -f', isn't it?

Updated by jaruga (Jun Aruga) 8 months ago

I sent a pull-request to fix following failure on Travis arm64.
https://github.com/ruby/ruby/pull/2653

1)
Process.spawn joins the specified process group if pgroup: pgid FAILED
Expected (STDOUT): "0"
          but got: "34876"

The reason is the value of Process.getpgid(Process.pid) in the spec/ruby/core/process/spawn_spec.rb file is 0 in the environment.

https://ruby-doc.org/core-2.7.0.preview2/Process.html#method-c-getpgid
The document says "Returns the process group ID for the given process id. Not available on all platforms.".

The process group id (/proc/[pid]/stat 5th field) is 0 in the Travis arm64 environment.
I do not know this is the status of "Not available".

Here is the result I captured on my local (x86_64) environment.

$ cat /proc/4543/stat
4543 (ruby) S 4525 4525 1384 34818 4525 4194304 37443 1754841 0 0 366 105 2291 391 20 0 3 0 1381328 1428127744 11475 18446744073709551615 94195983785984 94195986670225 140728933833312 0 0 0 0 0 1107394127 0 0 0 17 2 0 0 1 0 0 94195987686512 94195987708942 94196017770496 140728933835483 140728933835595 140728933835595 140728933842904 0

Here is the result I captured on Travis arm64 environment. You see the 5th field value is 0.

$ cat /proc/19179/stat
19179 (ruby) S 19160 0 0 0 -1 4194560 37618 1710547 313 163 770 665 5206 1439 20 0 2 0 17529566 1196347392 10319 18446744073709551615 187650811428864 187650815023116 281474602721280 0 0 0 0 4096 1107390031 0 0 0 17 22 0 0 0 0 0 187650815091456 187650815114064 187651414974464 281474602725080 281474602725211 281474602725211 281474602729420 0

Updated by jaruga (Jun Aruga) 8 months ago

Shall we add debug code to investigate a random "make install" error on arm64 to .travis.yml?
At least we can remove current existing debugging code for Mac OSX after "make install" in .travis.yml.

This pull-request is the suggestion.
https://github.com/ruby/ruby/pull/2642

Updated by jaruga (Jun Aruga) 8 months ago

As another issue, there are 2 test failures in spec/ruby/core/process/clock_getres_spec.rb on Drone CI arm32 case.

https://cloud.drone.io/ruby/ruby/619/1/2

1)
Process.clock_getres matches the clock in practice for Process::CLOCK_PROCESS_CPUTIME_ID FAILED
Expected 10000 == 10000
to be falsy but was true
/drone/src/spec/ruby/core/process/clock_getres_spec.rb:30:in `block (4 levels) in <top (required)>'
/drone/src/spec/ruby/core/process/clock_getres_spec.rb:4:in `<top (required)>'

2)
Process.clock_getres matches the clock in practice for Process::CLOCK_THREAD_CPUTIME_ID FAILED
Expected 10000 == 10000
to be falsy but was true
/drone/src/spec/ruby/core/process/clock_getres_spec.rb:30:in `block (4 levels) in <top (required)>'
/drone/src/spec/ruby/core/process/clock_getres_spec.rb:4:in `<top (required)>'

The failed test is here.

https://github.com/ruby/ruby/blob/master/spec/ruby/core/process/clock_getres_spec.rb#L27-L30

# The clock should not be less accurate than reported (times should
# not all be a multiple of the next precision up, assuming precisions
# are multiples of ten.)
times.select { |t| t % (reported * 10) == 0 }.size.should_not == times.

Someone, do you know what "times should not all be a multiple of the next precision up, assuming precisions are multiples of ten." means?

Because adding debug log and comparing the result of my local from the Drone CI arm32 environment, the each value of times is "multiples of ten" on the arm32 environment.

https://github.com/junaruga/ruby/blob/hotfix/drone-ci-arm32/spec/ruby/core/process/clock_getres_spec.rb#L16

On my local x86_64

$ make -s test-spec MSPECOPT=-ff
...
"[DEBUG] name: CLOCK_PROCESS_CPUTIME_ID, value: 2, reported: 1, times0: 3639283628, times1: 3639287782"
...
"[DEBUG] name: CLOCK_THREAD_CPUTIME_ID, value: 3, reported: 1, times0: 3188148716, times1: 3188150159"
...

On Drone CI arm32.

https://cloud.drone.io/junaruga/ruby/18

...
"[DEBUG] name: CLOCK_PROCESS_CPUTIME_ID, value: 2, reported: 1, times0: 9551962040, times1: 9551966600"
  => "multiples of ten"
...
"[DEBUG] name: CLOCK_THREAD_CPUTIME_ID, value: 3, reported: 1, times0: 8280880000, times1: 8280884060"
  => "multiples of ten"
...

Thanks.

References for CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID.

Updated by Eregon (Benoit Daloze) 8 months ago

About the clock_getres() specs, it seems they fail because I guess on Drone CI, clock_getres(CLOCK_PROCESS_CPUTIME_ID) returns 1ns, but it seems CLOCK_PROCESS_CPUTIME_ID results are all in 10ns (last digit is always 0).

Probably we can just ignore that test on ARM, in spec/ruby/core/process/fixtures/clocks.rb.
In fact there is already:

    # These clocks in practice on ARM on Linux do not seem to match their reported resolution.
    platform_is :armv7, :aarch64 do
      clocks = clocks.reject { |clock, value|
        [:CLOCK_PROCESS_CPUTIME_ID, :CLOCK_THREAD_CPUTIME_ID, :CLOCK_MONOTONIC_RAW].include?(clock)
      }
    end

What's the value of RUBY_PLATFORM there?

Updated by Eregon (Benoit Daloze) 8 months ago

RUBY_PLATFORM is armv8l-linux-eabi there, I'll adapt the platform_is guard.

Updated by Eregon (Benoit Daloze) 8 months ago

Fixed in 40e161a61238625e1ef021311759b2159be5b50a

Updated by jaruga (Jun Aruga) 8 months ago

Ah I see. Thank you for your help, Eregon!

Updated by jaruga (Jun Aruga) 8 months ago

jaruga (Jun Aruga) wrote:

Shall we add debug code to investigate a random "make install" error on arm64 to .travis.yml?
At least we can remove current existing debugging code for Mac OSX after "make install" in .travis.yml.

This pull-request is the suggestion.
https://github.com/ruby/ruby/pull/2642

For the permission issue cp: cannot create regular file '../../.ext/common/json.rb': Permission denied in make install, I found a clue.

I did run following commands in both Travis x86_64 and arm32.
The issue is happening in ext/*/Makefile files. COPY and V argument are used in the Makefile files.

ls -l --full-time -t $(pwd)/.ext/.timestamp/.RUBYCOMMONDIR.time $(pwd)/.ext/common/*.rb || true
make --debug --trace V=1 COPY='pwd; (ls -l --full-time -t ../../.ext/.timestamp/.RUBYCOMMONDIR.time ../../.ext/common/*.rb || true); cp -v' install

Then here is the result in Travis x86_64.
The .ext/.timestamp/.RUBYCOMMONDIR.time file has to be older than other .ext/common/*.rb files.
Because when .RUBYCOMMONDIR.time file is newer than a .ext/common/foo.rb file, COPY (= cp as default) from ext/json/lib/json.rb to .ext/common/foo.rb is executed in ext/foo/Makefile. But the destination .ext/common/foo.rb file permission is 444 (read only -r--r--r--), the copy is failed. When COPY='cp -f', it is succeeded.

$ ls -l --full-time -t $(pwd)/.ext/.timestamp/.RUBYCOMMONDIR.time $(pwd)/.ext/common/*.rb
Thu Nov  7 19:56:50 UTC 2019
-r--r--r-- 1 travis travis  2894 2019-11-07 19:56:48.937972000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/digest.rb
-r--r--r-- 1 travis travis 16562 2019-11-07 19:56:48.901990000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/pathname.rb
-r--r--r-- 1 travis travis  2494 2019-11-07 19:56:46.523179999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/ripper.rb
-r--r--r-- 1 travis travis 44702 2019-11-07 19:56:46.399242000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/socket.rb
-r--r--r-- 1 travis travis  1722 2019-11-07 19:56:46.323280000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/fiddle.rb
-r--r--r-- 1 travis travis  6706 2019-11-07 19:56:46.095393999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/monitor.rb
-r--r--r-- 1 travis travis  1036 2019-11-07 19:56:46.027427999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/date.rb
-r--r--r-- 1 travis travis    24 2019-11-07 19:56:45.807538000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/bigdecimal.rb
-r--r--r-- 1 travis travis   368 2019-11-07 19:56:45.507688000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/coverage.rb
-r--r--r-- 1 travis travis  1809 2019-11-07 19:56:45.491696000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/json.rb
-r--r--r-- 1 travis travis   469 2019-11-07 19:56:45.247817999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/openssl.rb
-r--r--r-- 1 travis travis  5861 2019-11-07 19:56:44.947967999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/kconv.rb
-r--r--r-- 1 travis travis 21533 2019-11-07 19:56:44.947967999 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/psych.rb
-r--r--r-- 1 travis travis  2217 2019-11-07 19:56:44.452216000 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/expect.rb
-rw-rw-r-- 1 travis travis     0 2019-11-07 19:56:44.364259999 +0000 /home/travis/build/junaruga/ruby/build/.ext/.timestamp/.RUBYCOMMONDIR.time

In Travis arm64,

Some .ext/common/*.rb files are newer than .ext/.timestamp/.RUBYCOMMONDIR.time file.
That is the reason of the error.
But the problem is why some .ext/common/*.rb files are newer than .ext/.timestamp/.RUBYCOMMONDIR.time file.
This situation does not happen at the 1st CI build by a commit changing the .travis.yml. This happens after 2nd CI build by a commit without changing .travis.yml.

$ ls -l --full-time -t $(pwd)/.ext/.timestamp/.RUBYCOMMONDIR.time $(pwd)/.ext/common/*.rb
Thu Nov  7 19:59:01 UTC 2019
-r--r--r-- 1 travis travis  5861 2019-11-07 19:58:46.029993853 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/kconv.rb
-r--r--r-- 1 travis travis 16562 2019-11-07 19:58:45.737996933 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/pathname.rb
-r--r--r-- 1 travis travis   469 2019-11-07 19:58:44.746007400 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/openssl.rb
-r--r--r-- 1 travis travis 44702 2019-11-07 19:58:43.606019428 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/socket.rb
-r--r--r-- 1 travis travis  2494 2019-11-07 19:58:43.530020230 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/ripper.rb
-r--r--r-- 1 travis travis 21533 2019-11-07 19:58:43.470020863 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/psych.rb
-r--r--r-- 1 travis travis  1036 2019-11-07 19:58:43.294022720 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/date.rb
-r--r--r-- 1 travis travis  1722 2019-11-07 19:58:42.994025885 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/fiddle.rb
-r--r--r-- 1 travis travis    24 2019-11-07 19:58:42.422031920 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/bigdecimal.rb
-r--r--r-- 1 travis travis  2894 2019-11-07 19:58:42.370032469 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/digest.rb
-rw-r--r-- 1 travis travis     0 2019-11-07 19:58:42.326032933 +0000 /home/travis/build/junaruga/ruby/build/.ext/.timestamp/.RUBYCOMMONDIR.time
-r--r--r-- 1 travis travis  2217 2019-11-07 19:58:42.174034537 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/expect.rb
-r--r--r-- 1 travis travis  6706 2019-11-07 19:58:42.130035001 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/monitor.rb
-r--r--r-- 1 travis travis   368 2019-11-07 19:58:41.638040192 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/coverage.rb
-r--r--r-- 1 travis travis  1809 2019-11-07 19:58:41.266044117 +0000 /home/travis/build/junaruga/ruby/build/.ext/common/json.rb

Updated by jaruga (Jun Aruga) 8 months ago

Some .ext/common/*.rb files are newer than .ext/.timestamp/.RUBYCOMMONDIR.time file.
That is the reason of the error.

Sorry above is typo.

To be correct, some .ext/common/*.rb files can be "older" than .ext/.timestamp/.RUBYCOMMONDIR.time file. That is the reason of the error.

I investigated more about this. I might find the root cause of the error.

First, .ext/common/*.rb and ext/.timestamp/.RUBYCOMMONDIR.time files are created by the command $SETARCH make -s $JOBS.

I checked the behavior by $SETARCH make --debug --trace V=1 $JOBS. It runs exts.mk (maybe generated from template/exts.mk.tmpl?).
And it executes extensions to build ext/*/Makefile in exts.mk in parallel with given $JOBS. The default value = -j3 for Travis x86_64, and -j33 for Travis arm64. Yes it is "-j33" on arm64.

exts.mk

extensions = ext/psych/. ext/monitor/. ext/rbconfig/sizeof/. ext/pty/. \
         ext/objspace/. ext/etc/. ext/openssl/. ext/ripper/. \
...
all static: ruby
...
ruby: $(extensions:/.=/all)

Following each Makefile is executed in parallel without a exclusive lock.
As a result, it seems touch .RUBYCOMMONDIR.time command is executed more than 2 times.

ext/*/Makefile

$(TIMESTAMP_DIR)/.RUBYCOMMONDIR.time:
    $(Q) $(MAKEDIRS) $(@D) $(RUBYLIBDIR)
    $(Q) $(TOUCH) $@

Following steps can happen.

  1. touch .ext/.timestamp/.RUBYCOMMONDIR.time in ext/foo/Makefile
  2. Create .ext/common/foo.rb in ext/foo/Makefile
  3. touch .ext/.timestamp/.RUBYCOMMONDIR.time in ext/bar/Makefile
  4. Create .ext/common/bar.rb in ext/bar/Makefile

As a result, some .ext/common/*.rb files (In this case .ext/common/foo.rb) are older than .ext/.timestamp/.RUBYCOMMONDIR.time file. That is the reason of the error.

Possible solution

  1. Add an exclusive lock logic to ext/*/Makefile $(TIMESTAMP_DIR)/.RUBYCOMMONDIR.time task.
  2. Run ruby: $(extensions:/.=/all) in exts.mk without parallel. But the performance is slower.

Right now I am working to run with JOBS= (empty) for Travis arm64 case here.
https://github.com/ruby/ruby/pull/2642

How do you think? I think the solution "1." is better, if we can implement it.

Updated by jaruga (Jun Aruga) 8 months ago

I sent a PR.
https://github.com/ruby/ruby/pull/2669

This fixes the root cause of the arm64 error. Maybe. I tested it 3 times on my forked repository's Travis, and it was okay.

Updated by jaruga (Jun Aruga) 8 months ago

I sent a PR to add arm32 case on Travis CI now.
https://github.com/ruby/ruby/pull/2673

Let me explain the summary.

First, gcc-8-multilib deb package does not exist in the xenial (and bionic) arm64 tool chain repository.

Ref:

$ wget http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu/dists/xenial/main/binary-arm64/Packages.gz
$ gunzip Packages.gz

So, I tried to create ARM 32-bit case without using the tool chain on ARM 64-bit.
I think it is fine for the initial ARM 32-bit test case.

I used this deb package.
https://packages.ubuntu.com/xenial/crossbuild-essential-armhf

As a note, there is no "-m32" flag for aarch64 gcc.
http://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

In Travis x86_64 environment, here is the initial setting for architectures.

$ dpkg --print-architecture
amd64

$ dpkg --print-foreign-architectures
i386

In Travis arm64 environment, here is the initial setting for architectures.

$ dpkg --print-architecture
aarch64

$ dpkg --print-foreign-architectures
  => empty

So, I needed to run sudo dpkg --add-architecture armhf manually.

Here are the available setarch values in Travis arm64.

$ setarch --list
uname26
linux32
linux64
aarch64

The value shows armv8l. It looks good.

$ $SETARCH uname -a
Switching on ADDR_LIMIT_32BIT.
Linux travis-job-junaruga-ruby-610971033 5.3.0-19-generic #20~18.04.2-Ubuntu SMP Tue Oct 22 18:10:05 UTC 2019 armv8l armv8l armv8l GNU/Linux

long: 4 byte, void*: 4 byte look correct.
I am not sure that off_t: 8 byte is correct. Because I saw off_t: 4 byte in arm32v7/ubuntu container environment with QEMU emulation.

$ $SETARCH ../configure -C --disable-install-doc --prefix=$RUBY_PREFIX $CONFIG_FLAG
...
checking size of long... 4
...
checking size of off_t... 8
checking size of void*... 4

There is one issue on the Travis arm32 environment. I added it as allow_failures.

$ travis_wait 50 $SETARCH make -s test-all -o exts TESTOPTS="${TESTOPTS} ${TEST_ALL_OPTS}" RUBYOPT="-w"
...
/home/travis/build/junaruga/ruby/build/.ext/common/fiddle/import.rb:299:in `import_function': cannot find the function: strcpy() (Fiddle::DLError)
    /home/travis/build/junaruga/ruby/build/.ext/common/fiddle/import.rb:172:in `extern'
    /home/travis/build/junaruga/ruby/test/fiddle/test_import.rb:17:in `<module:LIBC>'
    /home/travis/build/junaruga/ruby/test/fiddle/test_import.rb:10:in `<module:Fiddle>'
    /home/travis/build/junaruga/ruby/test/fiddle/test_import.rb:9:in `<top (required)>'
    /home/travis/build/junaruga/ruby/lib/rubygems/core_ext/kernel_require.rb:92:in `require'
    /home/travis/build/junaruga/ruby/lib/rubygems/core_ext/kernel_require.rb:92:in `require'
    /home/travis/build/junaruga/ruby/tool/lib/test/unit/parallel.rb:121:in `run'
    /home/travis/build/junaruga/ruby/tool/lib/test/unit/parallel.rb:208:in `<main>'
running file: /home/travis/build/junaruga/ruby/test/fiddle/test_import.rb
...

Updated by jaruga (Jun Aruga) 8 months ago

There is one issue on the Travis arm32 environment. I added it as allow_failures.

I sent a PR to fix the issue on the Travis arm32 environment.
https://github.com/ruby/ruby/pull/2686

Let me explain about the issue and why this happened.
The issue happens at following line.

ruby/test/fiddle/test_import.rb:17:in `module:LIBC'

...
begin
  require_relative 'helper'
  require 'fiddle/import'
rescue LoadError
end
...
    extern "void *strcpy(char*, char*)" <= This line.

In the helper.rb, Fiddle::LIBC_SO and Fiddle::LIBM_SO were still nil in Travis arm32 case.

test/fiddle/helper.rb

Fiddle::LIBC_SO = libc_so
Fiddle::LIBM_SO = libm_so
...
    def setup
      @libc = Fiddle.dlopen(LIBC_SO)
      @libm = Fiddle.dlopen(LIBM_SO)
    end 
...

Because seeing following page about libc deb packages information for each CPU architecture in Ubuntu, in case of i686 (Intel 32-bit), the package is installed to /lib32, but in case of armhf (ARM 32-bit) it's not installed to /lib32, but only installed to /lib/arm-linux-gnueabihf.

Ubuntu: libc deb packages information for each CPU architecture
https://packages.ubuntu.com/search?mode=filename&suite=xenial&section=all&arch=any&keywords=libc.so.6&searchon=contents

Also ldd #{ruby} outputs "not a dynamic executable" message not printing shared library dependencies information.

$ ldd ruby
not a dynamic executable

$ setarch linux32 --verbose --32bit ldd ruby
Switching on ADDR_LIMIT_32BIT.
not a dynamic executable

Because ldd command can output the message when the checked binary is 32-bit and the host is 64-bit. In Travis arm32 case, the #{ruby} is 32-bit binary, and the host is 64-bit.

Ref: https://unix.stackexchange.com/questions/75054/ldd-tells-me-my-app-is-not-a-dynamic-executable

So, libc_so and libm_so are not set in this step too.

I also checked Drone CI arm32 environment (Debian) too about the issue.
For the Drone arm32 environment, the the ldd command is succeeded. Because the host is 32-bit, and the checked binary is 32-bit.

We can check the libc document here.

Debian: libc deb packages information for each CPU architecture
https://packages.debian.org/search?searchon=contents&keywords=libc.so.6

So, summarizing it.

  • Travis i686 case (host: x86_64)

    • OS: Ubuntu
    • RUBY_PLATFORM: i686-linux
    • Used libc: libc6:i386 (I have not checked it)
    • Installed directory: /lib32
  • Travis arm32 case (host: arm64)

    • OS: Ubuntu
    • RUBY_PLATFORM: armv8l-linux-eabihf
    • Used libc: libc6:armhf
    • Installed directory: /lib/arm-linux-gnueabihf
  • Drone arm32

    • OS: Debian
    • RUBY_PLATFORM: armv8l-linux-eabi
    • Used libc: libc6:armel
    • Installed directory: /lib/arm-linux-gnueabi

I think we can refactor test/fiddle/helper.rb later.

  • Fiddle::LIBC_SO or Fiddle::LIBM_SO is still nil before Fiddle.dlopen is executed in the helper.rb, can we raise an error or output a warning?
  • Can we check the command exit status for ldd #{ruby}?

Also available in: Atom PDF