However, sometimes we just want the program to be faster. This chapter introduces some techniques to squeeze the CPU power out.
Finding Bottlenecks
Acquiring Execution Time
The macro time
is very useful for finding out bottlenecks. It takes
a form, evaluates it and prints timing information in
*trace-output*
, as shown below:
* (defun collect (start end)
"Collect numbers [start, end] as list."
(loop for i from start to end
collect i))
* (time (collect 1 10))
Evaluation took:
0.000 seconds of real time
0.000001 seconds of total run time (0.000001 user, 0.000000 system)
100.00% CPU
3,800 processor cycles
0 bytes consed
By using the time
macro it is fairly easy to find out which part of your program
takes too much time.
Please note that the timing information provided here is not guaranteed to be reliable enough for marketing comparisons. It should only be used for tuning purpose, as demonstrated in this chapter.
Know your Lisp’s statistical profiler
Implementations ship their own profilers. SBCL has
sb-profile, a
“classic, per-function-call” deterministic profiler and
sb-sprof, a
statistical profiler. The latter works by taking samples of the
program execution at regular intervals, instead of instrumenting
functions like sb-profile:profile
does.
You might find sb-sprof more useful than the deterministic profiler when profiling functions in the common-lisp-package, SBCL internals, or code where the instrumenting overhead is excessive.
Use flamegraphs and other tracing profilers
cl-flamegraph is a wrapper around SBCL’s statistical profiler to generate FlameGraph charts. Flamegraphs are a very visual way to search for hotspots in your code:
See also tracer, a tracing
profiler for SBCL. Its output is suitable for display in
Chrome’s or Chromium’s Tracing Viewer (chrome://tracing
).
Checking Assembly Code
The function disassemble
takes a function and prints the
compiled code of it to *standard-output*
. For example:
* (defun plus (a b)
(+ a b))
PLUS
* (disassemble 'plus)
; disassembly for PLUS
; Size: 37 bytes. Origin: #x52B8063B
; 3B: 498B5D60 MOV RBX, [R13+96] ; no-arg-parsing entry point
; thread.binding-stack-pointer
; 3F: 48895DF8 MOV [RBP-8], RBX
; 43: 498BD0 MOV RDX, R8
; 46: 488BFE MOV RDI, RSI
; 49: FF14250102 CALL QWORD PTR [#x52100] ; GENERIC-+
; 50: 488B75E8 MOV RSI, [RBP-24]
; 54: 4C8B45F0 MOV R8, [RBP-16]
; 58: 488BE5 MOV RSP, RBP
; 5B: F8 CLC
; 5C: 5D POP RBP
; 5D: C3 RET
; 5E: CC0F BREAK 15 ; Invalid argument count trap
The code above was evaluated in SBCL. In some other implementations such as
CLISP, disassembly
might return something different:
* (defun plus (a b)
(+ a b))
PLUS
* (disassemble 'plus)
Disassembly of function PLUS
2 required arguments
0 optional arguments
No rest parameter
No keyword parameters
4 byte-code instructions:
0 (LOAD&PUSH 2)
1 (LOAD&PUSH 2)
2 (CALLSR 2 55) ; +
5 (SKIP&RET 3)
NIL
It is because SBCL compiles the Lisp code into machine code, while CLISP does not.
Using Declare Expression
The declare expression can be used to provide hints for compilers to perform various optimization. Please note that these hints are implementation-dependent. Some implementations such as SBCL support this feature, and you may refer to their own documentation for detailed information. Here only some basic techniques mentioned in CLHS are introduced.
In general, declare expressions can occur only at the beginning of the bodies of certain forms, or immediately after a documentation string if the context allows. Also, the content of a declare expression is restricted to limited forms. Here we introduce some of them that are related to performance tuning.
Please keep in mind that these optimization skills introduced in this section
are strongly connected to the Lisp implementation selected. Always check their
documentation before using declare
!
Speed and Safety
Lisp allows you to specify several quality properties for the compiler using
the declaration optimize
. Each quality may be assigned a value
from 0 to 3, with 0 being “totally unimportant” and 3 being “extremely
important”.
The most significant qualities might be safety
and speed
.
By default, Lisp considers code safety to be much more important than speed. But you may adjust the weight for more aggressive optimization.
* (defun max-original (a b)
(max a b))
MAX-ORIGINAL
* (disassemble 'max-original)
; disassembly for MAX-ORIGINAL
; Size: 144 bytes. Origin: #x52D450EF
; 7A7: 8D46F1 lea eax, [rsi-15] ; no-arg-parsing entry point
; 7AA: A801 test al, 1
; 7AC: 750E jne L0
; 7AE: 3C0A cmp al, 10
; 7B0: 740A jeq L0
; 7B2: A80F test al, 15
; 7B4: 7576 jne L5
; 7B6: 807EF11D cmp byte ptr [rsi-15], 29
; 7BA: 7770 jnbe L5
; 7BC: L0: 8D43F1 lea eax, [rbx-15]
; 7BF: A801 test al, 1
; 7C1: 750E jne L1
; 7C3: 3C0A cmp al, 10
; 7C5: 740A jeq L1
; 7C7: A80F test al, 15
; 7C9: 755A jne L4
; 7CB: 807BF11D cmp byte ptr [rbx-15], 29
; 7CF: 7754 jnbe L4
; 7D1: L1: 488BD3 mov rdx, rbx
; 7D4: 488BFE mov rdi, rsi
; 7D7: B9C1030020 mov ecx, 536871873 ; generic->
; 7DC: FFD1 call rcx
; 7DE: 488B75F0 mov rsi, [rbp-16]
; 7E2: 488B5DF8 mov rbx, [rbp-8]
; 7E6: 7E09 jle L3
; 7E8: 488BD3 mov rdx, rbx
; 7EB: L2: 488BE5 mov rsp, rbp
; 7EE: F8 clc
; 7EF: 5D pop rbp
; 7F0: C3 ret
; 7F1: L3: 4C8BCB mov r9, rbx
; 7F4: 4C894DE8 mov [rbp-24], r9
; 7F8: 4C8BC6 mov r8, rsi
; 7FB: 4C8945E0 mov [rbp-32], r8
; 7FF: 488BD3 mov rdx, rbx
; 802: 488BFE mov rdi, rsi
; 805: B929040020 mov ecx, 536871977 ; generic-=
; 80A: FFD1 call rcx
; 80C: 4C8B45E0 mov r8, [rbp-32]
; 810: 4C8B4DE8 mov r9, [rbp-24]
; 814: 488B75F0 mov rsi, [rbp-16]
; 818: 488B5DF8 mov rbx, [rbp-8]
; 81C: 498BD0 mov rdx, r8
; 81F: 490F44D1 cmoveq rdx, r9
; 823: EBC6 jmp L2
; 825: L4: CC0A break 10 ; error trap
; 827: 04 byte #X04
; 828: 13 byte #X13 ; OBJECT-NOT-REAL-ERROR
; 829: FE9B01 byte #XFE, #X9B, #X01 ; RBX
; 82C: L5: CC0A break 10 ; error trap
; 82E: 04 byte #X04
; 82F: 13 byte #X13 ; OBJECT-NOT-REAL-ERROR
; 830: FE1B03 byte #XFE, #X1B, #X03 ; RSI
; 833: CC0A break 10 ; error trap
; 835: 02 byte #X02
; 836: 19 byte #X19 ; INVALID-ARG-COUNT-ERROR
; 837: 9A byte #X9A ; RCX
* (defun max-with-speed-3 (a b)
(declare (optimize (speed 3) (safety 0)))
(max a b))
MAX-WITH-SPEED-3
* (disassemble 'max-with-speed-3)
; disassembly for MAX-WITH-SPEED-3
; Size: 92 bytes. Origin: #x52D452C3
; 3B: 48895DE0 mov [rbp-32], rbx ; no-arg-parsing entry point
; 3F: 488945E8 mov [rbp-24], rax
; 43: 488BD0 mov rdx, rax
; 46: 488BFB mov rdi, rbx
; 49: B9C1030020 mov ecx, 536871873 ; generic->
; 4E: FFD1 call rcx
; 50: 488B45E8 mov rax, [rbp-24]
; 54: 488B5DE0 mov rbx, [rbp-32]
; 58: 7E0C jle L1
; 5A: 4C8BC0 mov r8, rax
; 5D: L0: 498BD0 mov rdx, r8
; 60: 488BE5 mov rsp, rbp
; 63: F8 clc
; 64: 5D pop rbp
; 65: C3 ret
; 66: L1: 488945E8 mov [rbp-24], rax
; 6A: 488BF0 mov rsi, rax
; 6D: 488975F0 mov [rbp-16], rsi
; 71: 4C8BC3 mov r8, rbx
; 74: 4C8945F8 mov [rbp-8], r8
; 78: 488BD0 mov rdx, rax
; 7B: 488BFB mov rdi, rbx
; 7E: B929040020 mov ecx, 536871977 ; generic-=
; 83: FFD1 call rcx
; 85: 488B45E8 mov rax, [rbp-24]
; 89: 488B75F0 mov rsi, [rbp-16]
; 8D: 4C8B45F8 mov r8, [rbp-8]
; 91: 4C0F44C6 cmoveq r8, rsi
; 95: EBC6 jmp L0
As you can see, the generated assembly code is much shorter (92 bytes VS 144). The compiler was able to perform optimizations. Yet we can do better by declaring types.
Type Hints
As mentioned in the Type System chapter, Lisp has a relatively powerful type system. You may provide type hints so that the compiler may reduce the size of the generated code.
* (defun max-with-type (a b)
(declare (optimize (speed 3) (safety 0)))
(declare (type integer a b))
(max a b))
MAX-WITH-TYPE
* (disassemble 'max-with-type)
; disassembly for MAX-WITH-TYPE
; Size: 42 bytes. Origin: #x52D48A23
; 1B: 488BF7 mov rsi, rdi ; no-arg-parsing entry point
; 1E: 488975F0 mov [rbp-16], rsi
; 22: 488BD8 mov rbx, rax
; 25: 48895DF8 mov [rbp-8], rbx
; 29: 488BD0 mov rdx, rax
; 2C: B98C030020 mov ecx, 536871820 ; generic-<
; 31: FFD1 call rcx
; 33: 488B75F0 mov rsi, [rbp-16]
; 37: 488B5DF8 mov rbx, [rbp-8]
; 3B: 480F4CDE cmovl rbx, rsi
; 3F: 488BD3 mov rdx, rbx
; 42: 488BE5 mov rsp, rbp
; 45: F8 clc
; 46: 5D pop rbp
; 47: C3 ret
The size of generated assembly code shrunk to about 1/3 of the size. What about speed?
* (time (dotimes (i 10000) (max-original 100 200)))
Evaluation took:
0.000 seconds of real time
0.000107 seconds of total run time (0.000088 user, 0.000019 system)
100.00% CPU
361,088 processor cycles
0 bytes consed
* (time (dotimes (i 10000) (max-with-type 100 200)))
Evaluation took:
0.000 seconds of real time
0.000044 seconds of total run time (0.000036 user, 0.000008 system)
100.00% CPU
146,960 processor cycles
0 bytes consed
You see, by specifying type hints, our code runs much faster!
But wait…What happens if we declare wrong types? The answer is: it depends.
For example, SBCL treats type declarations in a special way. It performs different levels of type checking according to the safety level. If safety level is set to 0, no type checking will be performed. Thus a wrong type specifier might cause a lot of damage.
More on Type Declaration with declaim
If you try to evaluate a declare
form in the top level, you might get the
following error:
Execution of a form compiled with errors.
Form:
(DECLARE (SPEED 3))
Compile-time error:
There is no function named DECLARE. References to DECLARE in some contexts
(like starts of blocks) are unevaluated expressions, but here the expression is
being evaluated, which invokes undefined behaviour.
[Condition of type SB-INT:COMPILED-PROGRAM-ERROR]
This is because type declarations have scopes. In the examples above, we have seen type declarations applied to a function.
During development it is usually useful to raise the importance of safety in order to find out potential problems as soon as possible. On the contrary, speed might be more important after deployment. However, it might be too verbose to specify declaration expression for each single function.
The macro declaim
provides such possibility. It can be used as a
top level form in a file and the declarations will be made at compile-time.
* (declaim (optimize (speed 0) (safety 3)))
NIL
* (defun max-original (a b)
(max a b))
MAX-ORIGINAL
* (disassemble 'max-original)
; disassembly for MAX-ORIGINAL
; Size: 181 bytes. Origin: #x52D47D9C
...
* (declaim (optimize (speed 3) (safety 3)))
NIL
* (defun max-original (a b)
(max a b))
MAX-ORIGINAL
* (disassemble 'max-original)
; disassembly for MAX-ORIGINAL
; Size: 142 bytes. Origin: #x52D4815D
Please note that declaim
works in compile-time of a file. It is mostly
used to make some declares local to that file. And it is unspecified whether
or not the compile-time side-effects of a declaim persist after the file has
been compiled.
Declaring function types
Another useful declaration is a ftype
declaration which establishes
the relationship between the function argument types and the return value type.
If the type of passed arguments matches the declared types, the return value type
is expected to match the declared one. Because of that, a function can have more
than one ftype
declaration associated with it. A ftype
declaration restricts
the type of the argument every time the function is called. It has the following form:
(declaim (ftype (function (arg1 arg2 ...) return-value)
function-name1))
If the function returns nil
, its return type is null
.
This declaration does not put any restriction on the types of arguments by itself.
It only takes effect if the provided arguments have the specified types – otherwise
no error is signaled and declaration has no effect. For example,
the following declamation states that if the argument to the function square
is a fixnum
, the value of the function will also be a fixnum
:
(declaim (ftype (function (fixnum) fixnum) square))
(defun square (x) (* x x))
If we provide it with the argument which is not declared to be of type fixnum
,
no optimization will take place:
(defun do-some-arithmetic (x)
(the fixnum (+ x (square x))))
Now let’s try to optimize the speed. The compiler will state that there is type uncertainty:
(defun do-some-arithmetic (x)
(declare (optimize (speed 3) (debug 0) (safety 0)))
(the fixnum (+ x (square x))))
; compiling (DEFUN DO-SOME-ARITHMETIC ...)
; file: /tmp/slimeRzDh1R
in: DEFUN DO-SOME-ARITHMETIC
; (+ TEST-FRAMEWORK::X (TEST-FRAMEWORK::SQUARE TEST-FRAMEWORK::X))
;
; note: forced to do GENERIC-+ (cost 10)
; unable to do inline fixnum arithmetic (cost 2) because:
; The first argument is a NUMBER, not a FIXNUM.
; unable to do inline (signed-byte 64) arithmetic (cost 5) because:
; The first argument is a NUMBER, not a (SIGNED-BYTE 64).
; etc.
;
; compilation unit finished
; printed 1 note
(disassemble 'do-some-arithmetic)
; disassembly for DO-SOME-ARITHMETIC
; Size: 53 bytes. Origin: #x52CD1D1A
; 1A: 488945F8 MOV [RBP-8], RAX ; no-arg-parsing entry point
; 1E: 488BD0 MOV RDX, RAX
; 21: 4883EC10 SUB RSP, 16
; 25: B902000000 MOV ECX, 2
; 2A: 48892C24 MOV [RSP], RBP
; 2E: 488BEC MOV RBP, RSP
; 31: E8C2737CFD CALL #x504990F8 ; #<FDEFN SQUARE>
; 36: 480F42E3 CMOVB RSP, RBX
; 3A: 488B45F8 MOV RAX, [RBP-8]
; 3E: 488BFA MOV RDI, RDX
; 41: 488BD0 MOV RDX, RAX
; 44: E807EE42FF CALL #x52100B50 ; GENERIC-+
; 49: 488BE5 MOV RSP, RBP
; 4C: F8 CLC
; 4D: 5D POP RBP
; 4E: C3 RET
NIL
Now we can add a type declaration for x
, so the compiler can assume
that the expression (square x)
is a fixnum
, and use the fixnum-specific +
:
(defun do-some-arithmetic (x)
(declare (optimize (speed 3) (debug 0) (safety 0)))
(declare (type fixnum x))
(the fixnum (+ x (square x))))
(disassemble 'do-some-arithmetic)
; disassembly for DO-SOME-ARITHMETIC
; Size: 48 bytes. Origin: #x52C084DA
; 4DA: 488945F8 MOV [RBP-8], RAX ; no-arg-parsing entry point
; 4DE: 4883EC10 SUB RSP, 16
; 4E2: 488BD0 MOV RDX, RAX
; 4E5: B902000000 MOV ECX, 2
; 4EA: 48892C24 MOV [RSP], RBP
; 4EE: 488BEC MOV RBP, RSP
; 4F1: E8020C89FD CALL #x504990F8 ; #<FDEFN SQUARE>
; 4F6: 480F42E3 CMOVB RSP, RBX
; 4FA: 488B45F8 MOV RAX, [RBP-8]
; 4FE: 4801D0 ADD RAX, RDX
; 501: 488BD0 MOV RDX, RAX
; 504: 488BE5 MOV RSP, RBP
; 507: F8 CLC
; 508: 5D POP RBP
; 509: C3 RET
NIL
Code Inline
The declaration inline
replaces function calls with function body,
if the compiler supports it. It will save the cost of function calls but will
potentially increase the code size. The best situation to use inline
might
be those small but frequently used functions. The following snippet shows how
to encourage and prohibit code inline.
;; The globally defined function DISPATCH should be open-coded,
;; if the implementation supports inlining, unless a NOTINLINE
;; declaration overrides this effect.
(declaim (inline dispatch))
(defun dispatch (x) (funcall (get (car x) 'dispatch) x))
;; Here is an example where inlining would be encouraged.
;; Because function DISPATCH was defined as INLINE, the code
;; inlining will be encouraged by default.
(defun use-dispatch-inline-by-default ()
(dispatch (read-command)))
;; Here is an example where inlining would be prohibited.
;; The NOTINLINE here only affects this function.
(defun use-dispatch-with-declare-notinline ()
(declare (notinline dispatch))
(dispatch (read-command)))
;; Here is an example where inlining would be prohibited.
;; The NOTINLINE here affects all following code.
(declaim (notinline dispatch))
(defun use-dispatch-with-declaim-noinline ()
(dispatch (read-command)))
;; Inlining would be encouraged because you specified it.
;; The INLINE here only affects this function.
(defun use-dispatch-with-inline ()
(declare (inline dispatch))
(dispatch (read-command)))
Please note that when the inlined functions change, all the callers must be re-compiled.
Optimizing Generic Functions
Using Static Dispatch
Generic functions provide much convenience and flexibility during development. However, the flexibility comes with cost: generic methods are much slower than trivial functions. The performance cost becomes a burden especially when the flexibility is not needed.
The package inlined-generic-function
provides
functions to convert generic functions to static dispatch, moving the dispatch
cost to compile-time. You just need to define generic function as a
inlined-generic-function
.
Caution
This package is declared as experimental thus is not recommended to be used in a serious software production. Use it at your own risk!
* (defgeneric plus (a b)
(:generic-function-class inlined-generic-function))
#<INLINED-GENERIC-FUNCTION HELLO::PLUS (2)>
* (defmethod plus ((a fixnum) (b fixnum))
(+ a b))
#<INLINED-METHOD HELLO::PLUS (FIXNUM FIXNUM) {10056D7513}>
* (defun func-using-plus (a b)
(plus a b))
FUNC-USING-PLUS
* (defun func-using-plus-inline (a b)
(declare (inline plus))
(plus a b))
FUNC-USING-PLUS-INLINE
* (time
(dotimes (i 100000)
(func-using-plus 100 200)))
Evaluation took:
0.018 seconds of real time
0.017819 seconds of total run time (0.017800 user, 0.000019 system)
100.00% CPU
3 lambdas converted
71,132,440 processor cycles
6,586,240 bytes consed
* (time
(dotimes (i 100000)
(func-using-plus-inline 100 200)))
Evaluation took:
0.001 seconds of real time
0.000326 seconds of total run time (0.000326 user, 0.000000 system)
0.00% CPU
1,301,040 processor cycles
0 bytes consed
The inlining is not enabled by default because once inlined, changes made to methods will not be reflected.
It can be enabled globally by adding :inline-generic-function
flag in
*features*
.
* (push :inline-generic-function *features*)
(:INLINE-GENERIC-FUNCTION :SLYNK :CLOSER-MOP :CL-FAD :BORDEAUX-THREADS
:THREAD-SUPPORT :CL-PPCRE ALEXANDRIA.0.DEV::SEQUENCE-EMPTYP :QUICKLISP
:QUICKLISP-SUPPORT-HTTPS :SB-BSD-SOCKETS-ADDRINFO :ASDF3.3 :ASDF3.2 :ASDF3.1
:ASDF3 :ASDF2 :ASDF :OS-UNIX :NON-BASE-CHARS-EXIST-P :ASDF-UNICODE :ROS.INIT
:X86-64 :64-BIT :64-BIT-REGISTERS :ALIEN-CALLBACKS :ANSI-CL :AVX2
:C-STACK-IS-CONTROL-STACK :CALL-SYMBOL :COMMON-LISP :COMPACT-INSTANCE-HEADER
:COMPARE-AND-SWAP-VOPS :CYCLE-COUNTER :ELF :FP-AND-PC-STANDARD-SAVE ..)
When this feature is present, all inlinable generic functions are inlined
unless it is declared notinline
.
Block compilation
SBCL got block compilation on version 2.0.2, which was in CMUCL since 1991 but a little forgotten since.
You can enable block compilation with a one-liner:
(setq *block-compile-default* t)
But what is it?
Block compilation addresses a known aspect of dynamic languages: function calls to global, top-level functions are expensive.
Much more expensive than in a statically compiled language. They’re slow because of the late-bound nature of top-level defined functions, allowing arbitrary redefinition while the program is running and forcing runtime checks on whether the function is being called with the right number or types of arguments. This type of call is known as a “full call” in Python (the compiler used in CMUCL and SBCL, not to be confused with the programming language), and their calling convention permits the most runtime flexibility.
But local calls, the ones inside a top-level functions (for example lambda
s, labels
and flet
s) are fast.
These calls are more ‘static’ in the sense that they are treated more like function calls in static languages, being compiled “together” and at the same time as the local functions they reference, allowing them to be optimized at compile-time. For example, argument checking can be done at compile time because the number of arguments of the callee is known at compile time, unlike in the full call case where the function, and hence the number of arguments it takes, can change dynamically at runtime at any point. Additionally, the local call calling convention can allow for passing unboxed values like floats around, as they are put into unboxed registers never used in the full call convention, which must use boxed argument and return value registers.
So enabling block compilation kind of turns your code into a giant labels
form.
One evident possible drawback, dependending on your application, is that you can’t redefine functions at runtime anymore.
We can also enable block compilation with the :block-compile
keyword to compile-file
:
(defun foo (x y)
(print (bar x y))
(bar x y))
(defun bar (x y)
(+ x y))
(defun fact (n)
(if (zerop n)
1
(* n (fact (1- n)))))
> (compile-file "foo.lisp" :block-compile t :entry-points nil)
> (load "foo.fasl")
> (sb-disassem:disassemble-code-component #'foo)
If you inspect the assembly,
you [will] see that FOO and BAR are now compiled into the same component (with local calls), and both have valid external entry points. This improves locality of code quite a bit and still allows calling both FOO and BAR externally from the file (e.g. in the REPL). […]
But there is one more goody block compilation adds…
Notice we specified
:entry-points
nil above. That’s telling the compiler to still create external entry points to every function in the file, since we’d like to be able to call them normally from outside the code component (i.e. the compiled compilation unit, here the entire file).
For more explanations, I refer you to the mentioned blog post, the current de-facto documentation for SBCL, in addition to CMUCL’s documentation (note that the form-by-form level granularity in CMUCL ( (declaim (start-block ...)) ... (declaim (end-block ..))
) is missing in SBCL, at the time of writing).
Finally, be aware that “block compiling and inlining currently does not interact very well [in SBCL]”.
See also
- CMUCL’s Advanced Compiler Use and Efficiency Hints, which is were SBCL comes from.
Page source: performance.md