Feature #21800
closed`Dir.scan` to list directory entires with their type
Added by byroot (Jean Boussier) 2 months ago. Updated 10 days ago.
Description
When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively.
The naive way to do this is to call stat(2) for each children, but this is quite costly.
This use case is common enough that readdir on most modern platforms do expose struct dirent.d_type, which allows to know the type of the child without an extra syscall:
From the scandir manpage:
d_type: This field contains a value indicating the file type,
making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on dirent.d_type instead of stat(2) allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667
Given that recursively scanning directories is a common task across many popular ruby tools (zeitwerk, rubocop, etc), I think it would be very valuable to provide this more efficient interface.
In addition, @nobu (Nobuyoshi Nakada) noticed my prototype, and implemented a nicer version of it, where a File::Stat is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a
In that case the File::Stat is lazy, it's only if you access something other than file type, that the actual stat(2) call is emitted.
I think this API is both more efficient and more convenient.
Proposed API¶
Dir.foreach(path) { |name| }
Dir.foreach(path) { |name, stat| }
Dir.each_child(path) { |name| }
Dir.each_child(path) { |name, stat| }
Dir.new(path).each_child { |name| }
Dir.new(path).each_child { |name, stat| }
Dir.new(path).each { |name| }
Dir.new(path).each { |name, stat| }
Also important to note, the File::Stat is expected to be equivalent to a lstat(2) call, as to be able to chose to follow symlinks or not.
Basic use case:
def count_ruby_files(root)
count = 0
queue = [root]
while dir = queue.pop
Dir.each_child(dir) do |name, stat|
next if name.start_with?(".")
if stat.directory?
queue << File.join(dir, name)
elsif stat.file?
count += 1 if name.end_with?(".rb")
end
end
end
count
end
Updated by byroot (Jean Boussier) 2 months ago
Actions
#1
- Related to Feature #17001: [Feature] Dir.scan to yield dirent for efficient and composable recursive directory scaning added
Updated by byroot (Jean Boussier) 2 months ago
Actions
#2
- Description updated (diff)
Updated by byroot (Jean Boussier) 2 months ago
Actions
#3
- Description updated (diff)
Updated by byroot (Jean Boussier) 2 months ago
Actions
#4
- Description updated (diff)
Updated by fxn (Xavier Noria) about 2 months ago
· Edited
Actions
#5
Hi there!
@byroot (Jean Boussier) knows this, but for anyone reading, Zeitwerk essentially scans the file system and sets autoloads for the expected constants.
This feature is going to be great. As in the general case described in the issue description, loaders scan their project trees, and perform an additional system call per directory entry to figure out their file type (mod details). Scanning is centralized in this ls helper. For consistency across platforms, the helper fetches all the entries with Dir.children first, and sorts them prior to yielding them to the caller.
With this, I would build a collection of pairs (entry, ftype) for sorting using one of those iterators.
This API would save lots of system calls.
Updated by matz (Yukihiro Matsumoto) about 1 month ago
Actions
#6
[ruby-core:124523]
The point is dirent.d_type is not defined in POSIX, so the API may be unreliable.
Matz.
Updated by byroot (Jean Boussier) about 1 month ago
Actions
#7
[ruby-core:124527]
@matz (Yukihiro Matsumoto) yes, but not we're not exposing dirent.d_type to users, but a lazily initialized File::Stat. This means that even on platform where d_type is missing, the Ruby level API is identical, just less performant.
Updated by mame (Yusuke Endoh) about 1 month ago
Actions
#8
[ruby-core:124533]
Discussed at the dev meeting, several points were raised.
- The lack of portability is concerning (as @matz (Yukihiro Matsumoto) already mentioned).
- The lazy
File::Statis concerning about the timing: the file type reflects information at the time of thereaddircall, while other information likemtimereflects the time#mtimemethod was called, which could lead to inconsistencies. - It's unclear whether the lazy
File::Statshould uselstatorstat. - Considering the above, it might be better to simply pass the information returned by readdir (i.e.,
f_type) instead ofFile::Stat. - Yielding
f_typerequires explicitly writing code to callFile.stat(orFile.lstat) forDT_UNKNOWNcases, which isn't very convenient (though it might be acceptable since this method isn't for casual use?). - About the API, checking the block's arity to yield differently is not great. It would be better to simply yield the
f_typeinformation in a separate method (such asDir.each_child_with_file_type(path) {|name, f_type| ... }) or when keyword arguments are provided (such asDir.each_child(path, with_file_type: true) {|name, f_type| ... }. - There are some options how to represent
f_type, including (1) using the raw integer (e.g., comparing against constants likeDir::DT_UNKNOWN), or (2) using a symbol like:UNKNOWN.
Updated by byroot (Jean Boussier) about 1 month ago
Actions
#9
[ruby-core:124534]
The lazy File::Stat
My draft was just yielding a symbol: https://github.com/ruby/ruby/pull/15667, @nobu (Nobuyoshi Nakada) did the File::Stat version, I thought it was elegant, and degraded better on platforms where d_type is not supported.
The lazy File::Stat is concerning about the timing
That doesn't concern me too much to be honest, given all file operation are subject to this sort of race conditions. But I understand why it concerns you.
Yielding f_type requires explicitly writing code to call File.stat (or File.lstat) for DT_UNKNOWN cases, which isn't very convenient (though it might be acceptable since this method isn't for casual use?).
There is also the option of Ruby doing the lstat call to translate DT_UNKNOWN into the actual type.
If we expose DT_UNKNOWN I'm worried users won't handle it correctly. From my understanding all the popular systems do support f_type, so it would be very hard to test for DT_UNKNOWN and would likely be broken.
It's unclear whether the lazy File::Stat should use lstat or stat.
In my opinion it should be lstat because for this sort of code, circular symlinks, and following symlinks in general, is a concern.
But I can be convinced otherwise.
Also if we go toward a new method, we can always make it a keyword argument.
About the API, checking the block's arity to yield differently is not great.
Agreed. I did it for simplicity and to avoid a discussion about naming, but I think it would be OK to make it a new method, e.g.
Dir.scan(path) { |name, type| }Dir.scan(path) [[name, type], ...]
here are some options how to represent f_type, including (1) using the raw integer (e.g., comparing against constants like Dir::DT_UNKNOWN), or (2) using a symbol like :UNKNOWN.
I don't have any strong opinion here.
We a precedent with File::Stat#ftype that returns a String, but I don't think that makes for a very usable API: https://docs.ruby-lang.org/en/master/File/Stat.html#method-i-ftype
On one hand, symbols are easier to discover when playing with the API in irb and such. On the other constants are easier to discover when searching the documentation and are typo-proof.
I may have a slight preference for constants.
Updated by Eregon (Benoit Daloze) about 1 month ago
· Edited
Actions
#10
[ruby-core:124557]
I think yielding Symbols are nicer, returning a plain Integer feels too low level IMO and people might misuse that e.g. DT_UNKNOWN is 0 on some platforms.
It's also more verbose and less clear (due to unclear C naming) for the Ruby caller to case type in File::DT_REG vs case type in :file.
I think we should match File::Stat#ftype for the naming for consistency.
byroot (Jean Boussier) wrote in #note-9:
degraded better on platforms where
d_typeis not supported.
I'm not following this, wouldn't it be the same if d_type is not supported or returns DT_UNKNOWN to then call lstat to find out, so the Ruby caller always gets the type?
I'm +1 on adding such functionality.
FWIW TruffleRuby already has an internal method to achieve the same thing and it's used to implement Dir.glob efficiently with Ruby code.
Updated by byroot (Jean Boussier) about 1 month ago
Actions
#11
[ruby-core:124559]
I think we should match File::Stat#ftype for the naming for consistency.
In principle in make sense, but in practice that other method returns strings, and weird ones at that:
Identifies the type of stat. The return string is one of:
“file”,“directory”,“characterSpecial”,“blockSpecial”,“fifo”,“link”,“socket”, or“unknown”.
If we yield symbols instead, it's already not super consistent, but also I think it would be weird to yield camel cased symbols like :characterSpecial or :blockSpecial.
I'm not following this, wouldn't it be the same if d_type is not supported or returns DT_UNKNOWN to then call lstat to find out, so the Ruby caller always gets the type?
That is what I'm suggesting. for a symlink the d_type will be :link, so in case d_type isn't supported or is UNKNOWN, we should use lstat to return the equivalent type.
Updated by Eregon (Benoit Daloze) about 1 month ago
· Edited
Actions
#12
[ruby-core:124560]
byroot (Jean Boussier) wrote in #note-11:
In principle in make sense, but in practice that other method returns strings, and weird ones at that:
Identifies the type of stat. The return string is one of:
“file”,“directory”,“characterSpecial”,“blockSpecial”,“fifo”,“link”,“socket”, or“unknown”.If we yield symbols instead, it's already not super consistent, but also I think it would be weird to yield camel cased symbols like
:characterSpecialor:blockSpecial.
I noticed as well but those are pretty "special" files so I think it's not a big deal to use camelCase there.
Even with these 2 camelCase names I do think they are much better and clearer than DT_ constants (DT_BLK, DT_CHR, DT_DIR, DT_FIFO, DT_LNK, DT_REG, DT_SOCK, DT_UNKNOWN).
We could do :character_special and :block_special, I think that's fine too.
My point is the DT_ constants are unreadable/cryptic so we shouldn't use that, and the names returned by File::Stat#ftype are much better (sorry, I'm repeating myself, just trying to be clear).
Updated by byroot (Jean Boussier) about 1 month ago
Actions
#13
[ruby-core:124561]
those are pretty "special" files so I think it's not a big deal to use camelCase there.
Yes, they're very unlikely to be used.
My point is the DT_ constants are unreadable/cryptic so we shouldn't use that,
I'm fine either way. My initial prototype was using symbols: https://github.com/ruby/ruby/pull/15667
Updated by Eregon (Benoit Daloze) 18 days ago
Actions
#14
[ruby-core:124679]
I think the spec in https://bugs.ruby-lang.org/issues/21839#note-3 is perfect and IMO good to go as-is:
- I propose:
*Dir.scan(path) { |entry_name, entry_type| }
*Dir.scan(path) # => [[entry_name, entry_type], ...]- The type is just a symbol, similar to
File::Stat#ftype- In case of
DT_UNKNOWN, Ruby issue alstatto obtain the real type (important for portability).
It's simple, doesn't add overhead or complexity in File::Stat and serves its purpose well.
We want to avoid extra syscalls for the case that struct dirent.d_type is available so we should not auto-resolve symlinks.
Also auto-resolving symlinks is a common mistake e.g. when traversing a directories hierarchy as it could go "back up" (and loop infinitely).
So lstat() is the simplest and safest choice there.
People can always resolve the symlink themselves if that's what they want, but typically they'd want to do some validation first anyway.
Updated by byroot (Jean Boussier) 11 days ago
Actions
#15
- Status changed from Open to Closed
Applied in changeset git|3b5ee7488c8064cd3b1afc41bc43afdb907fdf16.
Dir.scan: return or yield children along with their type
[Feature #21800]
There are numerous ruby tools that need to recursively scan
the project directory, such as Zeitwerk, rubocop, etc.
All of them end up listing childs of a directory then for each child
emit a stat call to check if it's a directory or not.
This is common enough for a pattern that on most operating
systems, struct dirent include a dtype member that allows to
check the file type without issuing a any extra system calls.
By yielding that type, we can make these routines twice as fast.
$ hyperfine './miniruby --disable-all --yjit ../test.rb' 'OPT=1 ./miniruby --disable-all --yjit ../test.rb'
Benchmark 1: ./miniruby --disable-all --yjit ../test.rb
Time (mean ± σ): 1.428 s ± 0.062 s [User: 0.342 s, System: 1.070 s]
Range (min … max): 1.396 s … 1.601 s 10 runs
Benchmark 2: OPT=1 ./miniruby --disable-all --yjit ../test.rb
Time (mean ± σ): 673.8 ms ± 5.8 ms [User: 146.0 ms, System: 527.3 ms]
Range (min … max): 659.7 ms … 679.6 ms 10 runs
Summary
OPT=1 ./miniruby --disable-all --yjit ../test.rb ran
2.12 ± 0.09 times faster than ./miniruby --disable-all --yjit ../test.rb
if ENV['OPT']
def count_ruby_files
count = 0
queue = [File.expand_path(__dir__)]
while dir = queue.pop
Dir.scan(dir) do |name, type|
next if name.start_with?(".")
case type
when :directory
queue << File.join(dir, name)
when :file
count += 1 if name.end_with?(".rb")
end
end
end
count
end
else
def count_ruby_files
count = 0
queue = [File.expand_path(__dir__)]
while dir = queue.pop
Dir.each_child(dir) do |name|
next if name.start_with?(".")
abspath = File.join(dir, name)
if File.directory?(abspath)
queue << abspath
else
count += 1 if name.end_with?(".rb")
end
end
end
count
end
end
10.times do
count_ruby_files
end
Updated by byroot (Jean Boussier) 11 days ago
Actions
#16
- Subject changed from `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name to `Dir.scan` to list directory entires with their type
Updated by Dan0042 (Daniel DeLorme) 10 days ago
· Edited
Actions
#17
[ruby-core:124815]
I like the feature, but it would be nice to provide an example that is not simply to reimplement Dir["**/*.rb"].size with 16x the lines of code. Something that demonstrates how it's more versatile than globbing.