Project

General

Profile

Misc #10278

Updated by nobu (Nobuyoshi Nakada) over 9 years ago

Mainly posting this for documentation purposes because it seems like 
 an obvious thing to try given we have ccan/list nowadays. 

 Having shorter code along branchless insert/delete, and using a common 
 linked-list API is very appealing. 

 On the other hand, benchmark results are a mixed bag: 

	 http://80x24.org/bmlog-20140922-032221.13002 

 Also, I may have introduced new bugs the tests didn't catch. 
 The st_foreach* functions get a bit strange when dealing with 
 packed-to-unpacked transitions while iterating. 

 Great thing: bighash is faster (as expected) because of branchless 
 linked list insertion.    However, the major speedup in bighash probably 
 isn't too important, most hashes are small and users never notice. 

	 vm2_bighash* 	 1.222 

 Also, we could introduce rb_hash_new_with_size() for use insns.def 
 (newhash) if people really care about the static bighash case (I don't 
 think many do). 

 Real regressions, iteration seems more complex because loop conditions 
 are more complex :< 

	 hash_keys 	 0.978 
	 hash_values 	 0.941 

 However, hash_keys/values regressions are pretty small. 

 Things that worry me: 

	 vm1_attr_ivar* 	 0.736 
	 vm1_attr_ivar_set* 	 0.845 

 WTF?    I reran the attr_ivar tests, and the numbers got slightly better: 

 ~~~ 
  

  ["vm1_attr_ivar", 
   [[1.851297842, 
     1.549076322, 
     1.623306027, 
     1.956916541, 
     1.533218607, 
     1.554089054, 
     1.702590516, 
     1.789863782, 
     1.711815018, 
     1.851260599], 
    [1.825423191, 
     1.824934062, 
     1.542471471, 
     1.868502091, 
     1.79106375, 
     1.884568825, 
     1.850712387, 
     1.797538962, 
     2.165696827, 
     1.866671482]]], 
  ["vm1_attr_ivar_set", 
   [[1.926496052, 
     2.04742421, 
     2.025571131, 
     2.047656291, 
     2.043747069, 
     2.099586827, 
     1.953769267, 
     2.017580504, 
     2.440432603, 
     2.111254634], 
    [2.365839125, 
     2.076282818, 
     2.112784977, 
     2.118754445, 
     2.091752673, 
     2.161164561, 
     2.107439445, 
     2.128147747, 
     2.945295069, 
     2.131679632]]]] 

 Elapsed time: 91.963235593 (sec) 
 ----------------------------------------------------------- 
 benchmark results: 
 minimum results in each 10 measurements. 
 Execution time (sec) 
 name 	 orig 	 stll 
 loop_whileloop 	 0.672 	 0.670 
 vm1_attr_ivar* 	 0.861 	 0.872 
 vm1_attr_ivar_set* 	 1.255 	 1.406 

 Speedup ratio: compare with the result of `orig' (greater is better) 
 name 	 stll 
 loop_whileloop 	 1.002 
 vm1_attr_ivar* 	 0.987 
 vm1_attr_ivar_set* 	 0.892 
 ~~~ 
 

 Note: these tests do not even hit st, and even if they did, these are 
 tiny tables which are packed so the linked-list implementation has no 
 impact (especially not on lookup tests) 

 So yeah, probably something messy with the CPU caches. 
 I always benchmark with the performance CPU governor, and the 
 rerun ivar numbers are run with CPU pinned to a single core. 
 CPU: AMD FX-8320    Maybe I can access my other systems later. 

Back