Writing Performance Sensitive OCaml Code
1) OCaml Performance
The notes fall into the following categories:
- notes about the cost of commonly used OCaml constructs and
- general recommendations that might help speed up OCaml code
- notes on reading OCaml assembly code so that you may understand what your own OCaml programs do on the CPU.
When possible we have tried to motivate these with snippets of assembly code generated by ocamlopt.
As a caveat, performance optimizations are sometimes at odds with desirable programming styles such as preserving abstraction boundaries and higher order functional patterns. Hand inlining code across abstraction boundaries and avoiding closure allocations buy you some performance, however these also make code brittle and inflexible. One should do these sorts of optimizations sparingly and only in the heart of tight loops.
Possibly the most useful performance tuning tip is to identify the places where your program spends most time and see if there are high level algorithmic approaches and changes in representation of data that can speed up things. These high level approaches will usually buy you more than machine level optimizations. Resort to low level tuning described in this document only after you have tried the former.
1.1) Compiling
Most of the discussion below assumes a 64-bit platform. Assembly code
presented here is x86_64 asm generated by using ocamlopt with the
compiler flags -S
, -inline 20
and -nodynlink
on a 64-bit Linux
platform.
You can add the -S
flag, which instructs the compiler to generate
assembly code, by adding
OCAMLOPTFLAGS += -S
to your local OMakefile
.
It is best to generate assembly code with full inlining and other
optimizations that the compiler supports. Even though inlining and
optimization might make the code a bit harder to read, it will give
you a more accurate picture of what executes on the CPU. Profiling
tool like gprof
where the inlining and such only serves to confuse
the profiler.
2) Reading OCaml Assembly
This section introduces the minimal assembly instruction patterns that one must know to follow some of the cost arguments later. This is not a general introduction to x86 assembly.
2.1) Named functions are mangled but recognizable.
In the file called test_asm.ml the function f
let f a = a+1
The generated asm looks like the following:
camlTest_asm__f_33700:
.L119:
addq $2, %rax
ret
Note that the file/module name and the name of the function appear in the generated label. An anonymous function however will just have some numeric label making it difficult to locate.
2.2) Memory allocation
When OCaml allocates memory it decrements the “heap top” which is typically in %r15 followed by an overflow check. If there is an overflow it calls the GC and the allocation is retried.
The snippet below allocates 24 bytes (three 64-bit words).
.L121: subq $24, %r15
cmpq caml_young_limit(%rip), %r15
jb .L122
This is the code at label .L122 that has the call to the GC followed by the jump to retry allocation with the new %r15 and young limit value after garbage collection.
.L122: call caml_call_gc
.L123: jmp .L121
2.3) Boxed Floats and Immediate ints
OCaml uses IEEE 64-bit floats and are usually heap allocated (ie. boxed). The main reason for this is that OCaml needs to tag its values such that the GC can identify them at runtime. OCaml allocates floats on the heap using two words - the first word is a tag that says that the next is a float.
In asm the following snippet indicates a typical boxing operation where the actual float value is in the register %xmm0. Looking out for these when trying to track down boxed floats in your code.
- First allocate the 2 words.
.L120: subq $16, %r15
cmpq caml_young_limit(%rip), %r15
jb .L121
- Assign the tag value (the first word).
leaq 8(%r15), %rax
movq $1277, -8(%rax)
- Assign the actual value (the second word).
movlpd %xmm0, (%rax)-
NOTE: An important thing to note is that a pointer to a heap allocated value does not point to the tag word but the actual first value. This is why the “load-effective-address” (lea) instruction adds the offset of 8 bytes. This is true for all heap allocated values, not just floats.
Unlike floats, ints are represented as immediates, i.e., they are not allocated on the heap. To tag ints OCaml steals the LSB bit of an int and sets it to 1 always. This means that we get 63 bits of precision in OCaml ints and any int value n will appear in the assembly as 2*n+1 (the LSB is always 1 and the bits of the number shifted one place in).
2.4) Tuples
Tuples are heap allocated. A pair will cause the allocation of 3 words: one word to indicate that it is a pair and two words for the actual values.
If a tuple contains floats, the floats will be boxed. If one wishes to return a pair of floats, a record containing the two floats may be more efficient than using a tuple - more on this later.
2.5) Option types
Option types are heap allocated. Passing a value as an option will cause memory allocation. The value None is however immediate and represented as 1, the same as the representation of the int value 0.
2.6) Write barriers
A write barrier is a basically a call to caml_modify and looks like the following:
call caml_modify
This webpage
(pdf)
and
this one
(pdf)
give a reasonable overview of what caml_modify
does. Any introductory text on
garbage collection should give you some insight into why write
barriers exist.
2.7) Asserts are helpful in reading asm.
If you have a large function and wish to see how a particular part of the function translates into asm, a simple way to track down the fragment in assembly is to put asserts around it. This is just a hack, but in practice it saves you some time reading asm.
Sometimes OCaml tends to optimize away provably correct asserts - this is something to watch out for. In the generated asm, the presence of the assert is indicated by:
leaq caml_exn_Assert_failure(%rip), %rbx
Hence if you are looking for a snippet of code in function f, first find f and then look for the asserts that you put in it. A more general method is the line numbers methods below:
2.8) Approximating line numbers in OCaml assembly
Assembly generated by OCaml does not have source line numbers. Having line numbers is useful when comparing the generated asm back to OCaml source. This hack lets one simulate have source line numbers in generated assembly. This relies on the fact that OCaml does not inline recursive functions.
To start, we define a handy set of recursive definitions to act as line number markers:
let rec _line0 () = if false then _line0 () let rec _line1 () = if false then _line1 () let rec _line2 () = if false then _line2 () let rec _line3 () = if false then _line3 () let rec _line4 () = if false then _line4 ()
And we annotate a part of the source code that we would like to read:
_line0 (); config.script <- Some script; _line1 (); config.source_server <- server; _line2 (); config.source_path <- source_path; _line3 ();
When we compile this code with the -S
flag, this gives us:
movq $1, %rax
.loc 1 207
call camlFile_copier_test___line0_79835
.L224:
.L225: subq $16, %r15
cmpq caml_young_limit(%rip), %r15
jb .L226
leaq 8(%r15), %rsi
movq $1024, -8(%rsi)
movq 0(%rsp), %rax
movq %rax, (%rsi)
movq camlConfig + 432(%rip), %rdi
call caml_modify
movq $1, %rax
.loc 1 209
call camlFile_copier_test___line1_79836
.L228:
movq 8(%rsp), %rax
movq 88(%rax), %rsi
movq camlConfig + 432(%rip), %rdi
addq $24, %rdi
call caml_modify
movq $1, %rax
.loc 1 211
call camlFile_copier_test___line2_79837
.L229:
movq 8(%rsp), %rax
movq 96(%rax), %rsi
movq camlConfig + 432(%rip), %rdi
addq $32, %rdi
call caml_modify
movq $1, %rax
.loc 1 213
call camlFile_copier_test___line3_79838
Here the call to _line0
looks like:
movq $1, %rax
.loc 1 207
call camlFile_copier_test___line0_79835
The $1
in the %rax
register is the ()
argument. The lines
between the calls _line
functions are the asm generated from the
original source. For example, the source line
config.script <- Some script;
produced:
subq $16, %r15
cmpq caml_young_limit(%rip), %r15
jb .L226
leaq 8(%r15), %rsi
movq $1024, -8(%rsi)
movq 0(%rsp), %rax
movq %rax, (%rsi)
movq camlConfig + 432(%rip), %rdi
call caml_modify
Note that the .loc
annotation also gives us information about where
the function being called is defined in the original source file.
This approach to annotating the asm makes the effects of inlining
operations and such done by the compiler evident in the assembly.
3) Cost of OCaml constructs
3.1) Cost of memory allocation
Having seen the OCaml memory allocation instructions its easy to see that memory allocation itself is cheap - its just one addition and a test, in the common case.
The real cost of memory allocation comes during GC time. The more memory you allocate, even though the allocations are cheap, the more time your program will spend in the GC. So a general rule of thumb for speeding up programs is to reduce allocation rates. Useful reading includes:
- http://mirror.ocamlcore.org/ocaml-tutorial.org/garbage_collection.html (pdf)
- http://rwmj.wordpress.com/2009/08/06/ocaml-internals-part-3-the-minor-heap/ (pdf)
- http://rwmj.wordpress.com/2009/08/07/ocaml-internals-part-4-the-major-heap/ (pdf)
- http://rwmj.wordpress.com/2009/08/08/ocaml-internals-part-5-garbage-collection/ (pdf)
3.2) Closure allocation
One significant source of memory allocation is the allocation of closures. Closure allocation basically involves allocating memory for each of the free variables of the function and for a code (memory operations are more expensive than register operations) and pointer.
-
If a function is called just once or not called at all, try and restructure your code to not allocate the closure. For example: if one is going to fold or map over a list that is frequently empty, testing to see if the list is empty before doing the map operation will avoid the closure allocation.
-
Alternately, try to avoid writing functions that have free variables - instead pass all the arguments explicitly.
3.3) Inlining
It is easier to describe when the compiler does not inline rather than when it does.
- It does not inline large functions. I don’t know the exact metric.
- It does not inline “rec” bound functions - even if they are small and even if the function is not really recursive.
- It does not inline local functions
- It does not inline functions containing a structured constant such
as:
let f x = x :: [1]
- Function given as parameters are never inlined: http://caml.inria.fr/mantis/view.php?id=5917
To see if a function will be inlined or not you can look at the output
of ocamlobjinfo
. The inlining of a function is decided at
declaration time, so a function is either always inlined, either
never. Unless the .cmx
file containing it is not available when
compiling a unit using this function.
3.4) Stack allocation of arguments
Only a small number of arguments are passed in registers to functions - the actual number varies according to platform and possibly other optimizations that the compiler performs.
The remainder of the arguments have to passed via the stack. Hence functions that receive a large number of arguments will involve several memory operations for processing the callee saved register and then the stack allocation of formal arguments. In cases where it is feasible, breaking up the called function into smaller functions will allow for the arguments to be passed in registers and in possibly allow for inlining of the resulting smaller functions.
3.5) Polymorphic compare
OCaml provides a polymorphic compare caml_compare
for datatypes. It
works in much the same way the GC does by looking at the tags on
objects to determine their types at runtime and then using the
appropriate compare functions. This is more expensive than using a
compare operation written specifically for the actual type (for
example caml_float_compare
for comparing floats).
let cmp a b = compare a b
in assembly:
camlTest_asm__cmp_33386:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
leaq caml_compare(%rip), %rax
call caml_c_call
Here is an explicit float comparison:
let cmp_float1 (a : float) (b : float) = Float.compare a b
in assembly:
camlTest_asm__cmp_float1_33390:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
call caml_float_compare
addq $8, %rsp
ret
Here is an inferred float comparison:
let cmp_float3 (a : float) (b : float) = compare a b
in assembly:
camlTest_asm__cmp_float3_33679:
subq $8, %rsp
.L107:
movq %rax, %rdi
movq %rbx, %rsi
call caml_float_compare
addq $8, %rsp
ret
3.6) Inlining and polymorphic compare
In the following snippets the compiler can infer that the compare is only float comparison, however the inliner does not specialize the comparison operation.
let cmp_float2 (a : float) (b : float) = cmp a b
Note that the inlining has happened as expected, but no specialization:
camlTest_asm__cmp_float2_33676:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
leaq caml_compare(%rip), %rax
call caml_c_call
addq $8, %rsp
ret
Doing just renaming also does not buy us anything, it is just as bad.
let cmp_compare = compare let cmp_float4 (a : float) (b : float) = cmp_compare a b
As before, inlined but not specialized.
camlTest_asm__cmp_float4_33682:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
leaq caml_compare(%rip), %rax
call caml_c_call
addq $8, %rsp
ret
So when in doubt, manually specify the type specific compare function.
3.7) Float boxing. Time is usually represented as a float.
Why are floats boxed/unboxed? GC, IEEE format, SSE instructions, tagging… this is a long story and one should look for descriptions of this online. Here are some general rules about when floats occur boxed and unboxed:
- Float arguments to (non-inlined) functions are boxed.
- Float return values from (non-inlined) functions are boxed.
- Floats have to be unboxed for doing calculations - they usually use the SSE instructions of the CPU.
- An array of floats uses an immediate representation - ie. each float is not boxed.
- A record of consisting only of floats uses an immediate representation. This does not apply to polymorphic records that get specialized to floats. This does not apply to abstract types that are internally defined to be floats.
- An record that has floats and non-floats has the floats boxed.
- A tuple of floats has the individual floats boxed.
3.8) Float compare
Compare for floats caml_float_compare requires the floats to be boxed. This can be terrible when you have immediate floats.
type t = { x : float; y : float; } let cmp_t1 a b = begin match compare a.x b.x with 0 -> compare a.y b.y x -> x end
Becomes the following. Notice the boxing operations for the floats.
camlTest_asm__cmp_t1_33693:
subq $8, %rsp
.L113:
movq %rax, %rbp
.L114: subq $32, %r15
cmpq caml_young_limit(%rip), %r15
jb .L115
leaq 8(%r15), %rsi
movq $1277, -8(%rsi)
movlpd (%rbx), %xmm0
movlpd %xmm0, (%rsi)
leaq 16(%rsi), %rdi
movq $1277, -8(%rdi)
movlpd (%rbp), %xmm0
movlpd %xmm0, (%rdi)
call caml_float_compare
cmpq $1, %rax
je .L112
addq $8, %rsp
ret
.align 4
.L112:
.L117: subq $32, %r15
cmpq caml_young_limit(%rip), %r15
jb .L118
leaq 8(%r15), %rsi
movq $1277, -8(%rsi)
movlpd 8(%rbx), %xmm0
movlpd %xmm0, (%rsi)
leaq 16(%rsi), %rdi
movq $1277, -8(%rdi)
movlpd 8(%rbp), %xmm0
movlpd %xmm0, (%rdi)
call caml_float_compare
addq $8, %rsp
ret
The floats are boxed and then there are calls to the underlying float comparison. A much more efficient, though tedious function to write would be:
let cmp_t2 a b = if a.x = b.x then begin if a.y = b.y then 0 else if a.y < b.y then -1 else 1 end else begin if a.x < b.x then -1 else 1 end
Which becomes:
camlTest_asm__cmp_t2_33697:
.L124:
movlpd (%rbx), %xmm1
movlpd (%rax), %xmm0
ucomisd %xmm1, %xmm0
jp .L121
jne .L121
movlpd 8(%rbx), %xmm1
movlpd 8(%rax), %xmm0
ucomisd %xmm1, %xmm0
jp .L123
jne .L123
movq $3, %rax
ret
Hence use =, <= and >= when possible in the case of floats. These map directly to the underlying asm instructions. NOTE: It may be useful to rewrite Float.compare using =, <= and >=.
3.9) Mutable fields in records
Mutations have an additional cost when the mutated value is heap allocated - ie. when the value is a boxed float, a list, a tuple, a record etc.
These mutations involve calls to the caml write barrier. A simple benchmark shows that invoking the write barrier repeatedly makes thing about 5-6 times slower. Consider the following snippet: Consider:
type t = { x : int; mutable y : float; } let set_y t y = t.y <- y
Which generates the following. Here there are calls to the write barrier.
camlTest_asm__set_y_33718:
subq $8, %rsp
.L136:
addq $8, %rax
movq %rax, %rdi
movq %rbx, %rsi
call caml_modify
movq $1, %rax
addq $8, %rsp
ret
Contrast the above with the following. Here the only difference is the type of x which causes the float to have an immediate representation (remember that in mixed records floats are boxed, whereas they are not in float-only records).
type t = { x : float; mutable y : float; } let set_y t y = t.y <- y
Which generates:
camlTest_asm__set_y_33727:
.L137:
movlpd (%rbx), %xmm0
movlpd %xmm0, 8(%rax)
movq $1, %rax
ret
Changing representation of your data to avoid calls to the write barrier will help performance. As a side effect, your code may have the additional overhead of introducing unboxing and boxing instructions and so the next result may or may not be advantageous - one has to profile to sure.
3.10) Mutable single-field float records can outperform float refs.
As a direct consequence of the above, float ref represents the float as a boxed value. Mutating the ref causes calls to the write barrier. The following type does not:
type float_ref = { mutable x : float; }
While it is tempting to try and represent floats in an immediate way as much as possible, however this sometimes hinder performance rather than help it. The fact that function arguments/returns are boxed can cause repeated boxing and unboxing as the floats are passed around.
Note: A useful experiment to try is to use the above float_ref
type
everywhere that one uses floats and explicitly cast to float only when
one needs to do calculations. With respect to memory layout,
float_ref
is essentially the same as a boxed float but we explicitly
have to insert the boxing and unboxing operations - making these
explicit can help us optimize away needless cases and save on
mutations.
3.11) Inlining avoids boxing (sometimes).
If the following function is inlined the argument and return value are not boxed.
let f x = x + 0.1
However, in the following function it is possible that x
is used as
a boxed float sometimes, the OCaml compiler will always box the float
on entry to the function.
let f x = if rarely_true_flag then ls := (x :: !ls); x + 0.1
NOTE: This could be something that can be fixed in the compiler by pushing the box allocation into the conditional.
3.12) Local module definitions are not free
Local module definitions have a runtime cost - they are not purely syntactic.
let mod_test x = let module M = T1.T2 in M.f x
Generates:
camlTest_asm__mod_test_33742:
movq camlTest_asm + 120(%rip), %rbx
movq 8(%rbx), %rbx
movq (%rax), %rax
ret
Notice that the value in %rbx has no real purpose. The inlined call to M.f is only the instruction movq (%rax), %rax. If we write:
module M = T1.T2 let mod_test x = M.f x
We have:
camlTest_asm__mod_test_33745:
movq (%rax), %rax
ret
3.13) Functor application is a runtime operation.
This:
module S(T : T_intf) = struct let g x = (T.f x) + 1 end let mod_test2 x = let module M = S(T1.T2) in M.g x
Generates:
call camlTest_asm__S_33743
The name of the functor S shows up as a function name. There is not much one can do about this other than to move functor applications outside function definitions such that they are not repeatedly applied.
3.16) Option.iter is sometimes not inlined
This:
let opt_iter opt = Option.iter opt ~f:(fun x -> x_ref := x)
Becomes:
camlTest_asm__opt_iter_33732:
.L140:
leaq camlTest_asm__18(%rip), %rbx
cmpq $1, %rax
je .L139
movq (%rax), %rax
movq (%rbx), %rdi
jmp *%rdi
.align 4
.L139:
movq $1, %rax
ret
The compiler does not always do this. I don’t know why - one should check.
NOTE: The above code is not as optimal as it could be - in particular the anonymous function itself is not inlined. Even if the inlining is not done it could still be better by moving the functional address lookup inside the conditional and avoiding the assignment to %rax. We should ask the compiler folk to fix this.
One should hand-inline the above code when perf matters.
let opt_iter opt = match opt with Some x -> x_ref := x None -> ()
3.17) Optional arguments are not free even when inlined.
When a function with optional arguments (and default values) is marked as an inlined function, the inlining does not eliminate the cost of allocating the intermediate option type. Details below.
Consider:
let f2 t ?(i1=t.i1) ?(i2=t.i2) ?(sub1=t.sub1) () = let t = { t with i1; i2; sub1; } in t let _f3 i2 = let t = {i1 = 10; i2= 20; sub1={f1=1.0; f2=None}; } in let t = f2 t ~i1:200 ~i2 () in t
At the call site, the call to f2
is inlined. However the asm looks
like this. See annotations inline.
.L128: subq $16, %r15
cmpq caml_young_limit(%rip), %r15 ; (2 words) camlAsm___f3_65118
jb .L129
leaq 8(%r15), %rbx
movq $1024, -8(%rbx) ; cons(tag 0) size 1
movq 0(%rsp), %rax
movq %rax, (%rbx) ; rbx now points to [Some i2] which just got
; allocated.
leaq camlAsm__2(%rip), %rax ; this is the constant ~i1:200
cmpq $1, %rax ; check for None
je .L122
movq (%rax), %rdx ; this path is always taken.
jmp .L121
.align 4
.L122:
movq 8(%rsp), %rax ; this path is never taken.
movq (%rax), %rdx
.L121:
cmpq $1, %rbx ; check for None in rbx (which we had just allocated)
je .L120
movq (%rbx), %rsi ; this path is always taken
jmp .L119
.align 4
.L120:
movq 8(%rsp), %rax ; this path is never taken.
movq 8(%rax), %rsi
.L119:
In summary:
1) The call to f2
has indeed been inlined. However the inlining has
only saved us the cost of making a function call (roughly, the cost of
doing one jump). It has not optimized away the cost of the optional
arguments which can now be fully determined at compile time.
2) The argument ~i1:200
is allocated at compile time to the constant
Some 200
. At runtime the compiler checks this constant to see if
it is a None or Some.
3) The argument ~i2
is allocated to a Some i2
at runtime and is
immediately checked for at runtime to see if it None or Some.
This should really be fixed in the compiler because optional arguments that are properly inlined are a very valuable feature.
4) Misc
Some misc perf notes.
1) glibc pow()
and the OCaml **
operator.
OCaml’s **
operator calls pow()
in glibc. Until glibc version
2.12, the execution time of pow()
from the standard math lib varies
significantly depending on the numeric values of arguments. This
behavior is apparently fixed in glibc 2.14. Consider:
open Core.Std let _ = _squelch_unused_module_warning_ open Core_bench.Std let () = Command.run (Bench.make_command [ Bench.Test.create ~name:"Float ** (1)" (fun () -> ignore (1.0000000000000020 ** 0.5000000000100000)); Bench.Test.create ~name:"Float ** (2)" (fun () -> ignore (1.0000000000000020 ** 0.5000000000000000)); Bench.Test.create ~name:"Float ** (3)" (fun () -> ignore (1.0000000000000020 ** 0.4999999999000000)); Bench.Test.create ~name:"Float ** (4)" (fun () -> ignore (1.0000000000000020 ** 0.4999900000000000)); ])
which produces:
$ ./z.exe -q 1 Estimated testing time 4s (4 benchmarks x 1s). Change using -quota SECS. ┌──────────────┬──────────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────┼──────────────┼─────────┼────────────┤ │ Float ** (1) │ 11_491.57ns │ 2.07w │ 2.87% │ │ Float ** (2) │ 400_987.47ns │ 2.57w │ 100.00% │ │ Float ** (3) │ 11_465.42ns │ 2.07w │ 2.86% │ │ Float ** (4) │ 59.25ns │ 2.00w │ 0.01% │ └──────────────┴──────────────┴─────────┴────────────┘
The typical execution time for this function is 60-70ns, but in some
specific cases it can be as bad as 400us. One workaround for the
problem is to reduce the precision of x
in pow(x, y)
, by doing
(Float.round_nearest (x *. 1e8)) *. 1e-8.
.
There is surprisingly little about this issue on the Internet. Here are two somewhat relevant links:
http://entropymine.com/imageworsener/slowpow/
http://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function
2) Strings, Bigstrings and the OCaml runtime lock.
In the following code, the OCaml runtime lock is being held while the time consuming LZ4 compression routine is run. The OCaml runtime lock forces all calls into the C code to be sequential, since it behaves like a critical section.
CAMLprim value _LZ4_compress_bigstring(value src, value dst, value isize) { CAMLparam3(src, dst, isize); char* src_c = get_bstr(src, 0); char* dst_c = get_bstr(dst, 0); int res = LZ4_compress(src_c, dst_c, Int_val(isize)); CAMLreturn (Val_int(res)); }
The code snippet can be made “faster” by releasing runtime lock as follows:
CAMLprim value _LZ4_compress_bigstring(value src, value dst, value isize) { CAMLparam3(src, dst, isize); char* src_c = get_bstr(src, 0); char* dst_c = get_bstr(dst, 0); caml_enter_blocking_section() int res = LZ4_compress(src_c, dst_c, Int_val(isize)); caml_leave_blocking_section() CAMLreturn (Val_int(res)); }
Note that src and dst are Bigstrings.
The snippet is faster in the sense that it allows the OCaml runtime to schedule other threads that may use the runtime lock. For example, this may be the case with an Async program where many jobs are trying to compress strings and multiple such compression routines can happen on different worker threads.
This does not work with strings. In particular, doing the following is unsafe:
CAMLprim value _LZ4_compress_string(value src, value dst, value isize) { CAMLparam3(src, dst, isize); char* src_c = String_val(src); char* dst_c = String_val(dst); caml_enter_blocking_section() int res = LZ4_compress(src_c, dst_c, Int_val(isize)); caml_leave_blocking_section() CAMLreturn (Val_int(res)); }
The reason is that OCaml strings live on the OCaml heap. By releasing the runtime lock, we allow the OCaml GC to run. If the OCaml GC moves the string in memory while the C code is using it, then things fail.
This is safe to do with Bigstrings since they are allocated on the C
heap. The OCaml GC never moves them. The GC may move the stub that
point to the C heap, but the values of the src_c
and dst_c
pointers
always point to the right data.
Related reading: http://d.hatena.ne.jp/camlspotter/20100309/1268111257 (pdf)
3) Fast, Slow and Incorrect Array blits
Array blits in OCaml, in the general case, are slow – i.e. they are not merely memmove/memcpy calls as one might assume.
Array.blit as provided by Core and also supplied by OCaml call
caml_array_blit
provided by the runtime. In the OCaml runtime,
there are arrays tagged as “double arrays” – these are typically
float arrays. Here are the rules:
1) caml_array_blit
does a memmove for double arrays. (For float
arrays)
2) caml_array_blit
does a memmove for arrays in the minor
heap. (i.e. for arrays created recently).
3) For every other array type there is a for-loop that calls
caml_modify
.
Here, caml_modify
is OCaml’s write barrier (see notes earlier about
this). Blits that fall under case 3 are an order of magnitude
slower. Long lived int arrays, bool arrays and arrays of other simple
immediate value representations will fall under the third case thereby
incurring the cost of calling caml_modify.
1) Compiler patching
It may be possible to patch the compiler to create arrays with representation guarantees at initialization time, such they are not conservatively treated as arrays of mutable references. We don’t have this yet.
2) Use Bigarrays
One can use Bigarrays instead of regular OCaml arrays for immediate values. One approach to blitting Bigarrays is to use the provided blit function:
let blit arr ~len ~offset ~delta = let sub1 = Array1.sub arr offset len in let sub2 = Array1.sub arr (offset+delta) len in Array1.blit sub1 sub2
However, this is not as fast as one would expect for two reasons: (1) creating sub-arrays is slow and (2) creating sub-arrays allocates words on the major heap.
A faster blit for Bigarrays is to (1) write your own in C OR (2) hijack the blit provided by Bigstrings (which are essentially Bigarrays). One would do the later as follows:
let unsafe_blit ~arr1 ~arr2 ~len ~offset1 ~offset2 ~sizeof = let src_pos = offset1 * sizeof in let dst_pos = offset2 * sizeof in let len = len * sizeof in Bigstring.unsafe_blit ~src:(Obj.magic arr1) ~dst:(Obj.magic arr2) ~src_pos ~dst_pos ~len
This is indeed a fast blit and works between Bigarrays of immediate types such as floats, ints, bools, chars etc. Further, for large enough blits one can give up the caml runtime lock, thereby allow other ocaml code to run in parallel with the blit. (see Section on the OCaml runtime lock.)
To blit from a bigarray to an OCaml array one would do something like:
let unsafe_blit_bigstring_string ~arr1 ~arr2 ~len ~offset1 ~offset2 ~sizeof = let src_pos = offset1 * sizeof in let dst_pos = offset2 * sizeof in let len = len * sizeof in Bigstring.unsafe_blit_bigstring_string ~src:(Obj.magic arr1) ~dst:(Obj.magic arr2) ~src_pos ~dst_pos ~len
The above snippet works well for Bigarrays of floats but not for much else, i.e. this approach does not work well if one needs to blit values our from Bigarrays to regular OCaml arrays of any type.
The reason this is so is that the when assigning a value into a Bigarray, the library changes the values representation from its OCaml representation to a representation compatible with native C code. (Bigarrays were after all originally designed for interop with C code). In the case of floats, OCaml does change the word level representation of floats and so the above snippet words fine. Hence the following function works as expected:
let snapshot_to_float_array (arr : float_bigarray) ~offset ~len = if len = 0 then [||] else begin let dest = Array.create ~len 0.0 in unsafe_blit_bigstring_string ~arr1:arr ~arr2:dest ~offset1:offset ~offset2:0 ~len ~sizeof:sizeof_float64; dest end
However for ints and other OCaml immediate types, the OCaml representation involves setting some tag bits. When values are assigned to Bigarrays, “bigarr.{i} <- “, this representation is stripped: (x) » 1. When values are read back using the getter provided, “<- bigarr.{i}”, the tag bits are reconstructed: ((intnat)(x) « 1) + 1.
Hence the following does not work:
let snapshot_to_int_array (arr : int_bigarray) ~offset ~len = if len = 0 then [||] else begin let dest = Array.create ~len 0 in unsafe_blit_bigstring_string ~arr1:arr ~arr2:dest ~offset1:offset ~offset2:0 ~len ~sizeof:sizeof_int; dest end
If one were to blit/memove the values from Bigarrays to OCaml arrays, the tag bits are not reconstructed causing numbers to be garbled and confusing the GC. The following does work however, but it is slow:
let snapshot_to_array bigarr ~offset ~len = if len = 0 then [||] else begin let arr = Array.create ~len bigarr.{0} in for i = 0 to len - 1 do arr.(i) <- bigarr.{offset+i} done; arr end
3) Use FFI and build your own.
The most direct solution is to write your own FFI and call a memmove in C. To ensure that this is not misused for array types that may have references, the types should be fixed to the concrete types that this is allowed for.
CAMLprim value _array_blit64(value src, value dst, value src_pos, value dst_pos, value len) { CAMLparam5(src, dst, src_pos, dst_pos, len); int isrc_pos = Int_val(src_pos); int idst_pos = Int_val(dst_pos); int ilen = Int_val(len); int idst_len = Wosize_val(dst); int isrc_len = Wosize_val(src); // ensure we are dealing with 64-bit values assert(sizeof(value) == 8); //ensure that we are within bounds assert(ilen >=0); assert(isrc_pos >= 0 && isrc_pos + ilen <= isrc_len); assert(idst_pos >= 0 && idst_pos + ilen <= idst_len); memmove(String_val(dst) + idst_pos * sizeof(value), String_val(src) + isrc_pos * sizeof(value), ilen * sizeof(value)); CAMLreturn (Val_unit); }
ml/mli file bindings:
external int_array_blit : src:int array -> dst:int array -> src_pos:int -> dst_pos:int -> len:int -> unit = "_array_blit64" external float_array_blit : src:float array -> dst:float array -> src_pos:int -> dst_pos:int -> len:int -> unit = "_array_blit64"
While this gives us fast blits on immediate arrays, we do lose the ability to give up the caml runtime lock for large blits.
4) Time.now()
and RDTSC
What may arguably be the best approach to timing OCaml code is to manually instrument your code using Time.now. One can experiment with various configurations of the code by commenting out parts of the code to understand the relative costs of particular functions. The downside to using Time.now() is that (1) it is an expensive call and (2) it has limited resolution and is hence not useful in measuring snippets that may only take milliseconds.
RDTSC is a light weight CPU counter which can be used instead of Time.now(). It gives a measure of CPU cycles elapsed between subsequent calls and it useful for timing.
CAMLprim value caml_rdtsc( ) { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return Val_int( ((unsigned long long)lo)|( ((unsigned long long)hi)<<32 )); }