Optimizing C and C++ code
Embedded software often runs on processors with limited computation power, thus optimizing the code becomes a necessity. In this article we will explore the following optimization techniques for C and C++ code developed for Real-time and Embedded Systems.
- Premature optimization is the root of all evil
- Adjust structure sizes to power of two
- Place case labels in narrow range
- Place frequent case labels first
- Break big switch statements into nested switches
- Minimize local variables
- Declare local variables in the inner most scope
- Reduce the number of parameters
- Use references for parameter passing and return value for types bigger than 4 bytes
- Don't define a return value if not used
- Consider locality of reference for code and data
- Locality of reference in multi-dimensional arrays
- Prefer int over char and short
- Define lightweight constructors
- Prefer initialization over assignment
- Use constructor initialization lists
- Do not declare "just in case" virtual functions
- In-line 1 to 3 line functions
- Avoid cascaded function calls
- Prefer preincrement over postincrement
- Define move constructors in C++11
- Use hardware accelerators and SIMD hardware
- Use profile guided optimization
Many techniques discussed here have roots in the material we covered in the articles dealing with C to Assembly translation. A good understanding of the following articles will help:
Premature optimization is the root of all evil
Donald Knuth wrote, "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
In general, correctness and readability considerations trump code performance issues for most of your code. For a small fraction of your code, you may have to sacrifice readability to improve performance. Such optimizations should be carried out when the project is nearing completion. You have a better idea of the performance critical code when you have a working system.
That said, it is important to recognize that many optimization techniques are just sound programming practices as they improve performance as well as code readability. Such techniques should be applied right from the project start.
Adjust structure sizes to power of two
When arrays of structures are involved, the compiler performs a multiply by the structure size to perform the array indexing. If the structure size is a power of 2, an expensive multiply operation will be replaced by an inexpensive shift operation. Thus keeping structure sizes aligned to a power of 2 will improve performance in array indexing.
Place case labels in narrow range
If the case labels are in a narrow range, the compiler does not generate an if-else-if cascade for the switch statement. Instead, it generates a jump table of case labels along with manipulating the value of the switch to index the table. This code generated is faster than if-else-if cascade code that is generated in cases where the case labels are far apart. Also, performance of a jump table based switch statement is independent of the number of case entries in switch statement.
Place frequent case labels first
If the case labels are placed far apart, the compiler will generate if-else-if cascaded code with comparing for each case label and jumping to the action for leg on hitting a label match. By placing the frequent case labels first, you can reduce the number of comparisons that will be performed for frequently occurring scenarios. Typically this means that cases corresponding to the success of an operation should be placed before cases of failure handling.
Break big switch statements into nested switches
The previous technique does not work for some compilers as they do not generate the cascade of if-else-if in the order specified in the switch statement. In such cases nested switch statements can be used to get the same effect.
To reduce the number of comparisons being performed, judiciously break big switch statements into nested switches. Put frequently occurring case labels into one switch and keep the rest of case labels into another switch which is the default leg of the first switch.
Splitting a switch statement
Minimize local variables
If the number of local variables in a function is less, the compiler will be able to fit them into registers. Hence, it will be avoiding frame pointer operations on local variables that are kept on stack. This can result in considerable improvement due to two reasons:
- All local variables are in registers so this improves performance over accessing them from memory.
- If no local variables need to be saved on the stack, the compiler will not incur the overhead of setting up and restoring the frame pointer.
Declare local variables in the inner most scope
Do not declare all the local variables in the outermost function scope. You will get better performance if local variables are declared in the inner most scope. Consider the example below; here object "a" is needed only in the error case, so it should be invoked only inside the error check. If this parameter was declared in the outermost scope, all function calls would have incurred the overhead of object a's creation (i.e. invoking the default constructor for "a").
Local variable scope
Reduce the number of parameters
Function calls with large number of parameters may be expensive due to large number of parameter pushes on stack on each call. For the same reason, avoid passing complete structures as parameters. Use pointers and references in such cases.
Use references for parameter passing and return value for types bigger than 4 bytes
When parameters are passed by value, the complete parameter memory is copied on the stack. This is fine for regular types like integer, pointer etc. These types are generally restricted to four bytes. When passing bigger types, the cost of copying the object on the stack can be prohibitive. In case of classes there will be an additional overhead of invoking the constructor for the temporary copy that is created on the stack. When the function exits the destructor will also be invoked.
Thus it is efficient to pass references as parameters. This way you save on the overhead of a temporary object creation, copying and destruction. This optimization can be performed easily without a major impact to the code by replacing pass by value parameters by const references. (It is important to pass const references so that a bug in the called function does not change the actual value of the parameter.
Passing bigger objects as return values also has the same performance issues. A temporary return object is created in this case too.
Don't define a return value if not used
The called function does not "know" if the return value is being used. So, it will always pass the return value. This return value passing may be avoided by not defining a return value which is not being used.
Consider locality of reference for code and data
The processor keeps data or code that is referenced in cache so that on its next reference if gets it from cache. These cache references are faster. Hence it is recommended that code and data that are being used together should actually be placed together physically. This is actually enforced into the language in C++. In C++, object's data is stored in contiguous memory, thus improving locality of reference for data. This also applies to code, as most methods that deal with the object will be stored in contiguous memory.
Locality of reference can be improved for C code as well. The declaration order of related code and functions can be arranged so that closely coupled code and data are declared together.
Locality of reference in multi-dimensional arrays
When working with two dimensional arrays, organize the algorithm so that your inner most loop iterates over the second index.
If you have accessed
a[x][y+1] will be faster than accessing
a[x+1][y]. The entry
a[x][y+1] will probably
be read from the processor cache.
a[x+1][y] would most likely need to fetched from the external memory.
Keep in mind that sequential reading of memory is always faster than random
Prefer int over char and short
With C and C++ prefer use of int over char and short. The main reason behind this is that C and C++ perform arithmetic operations and parameter passing at integer level. If you have a value that can fit in a byte, you should still consider using an int to hold the number. If you use a char, the compiler will first convert the values into integer, perform the operations and then convert back the result to char.
Lets consider the following code which presents two functions that perform the same operation with char and int.
Comparing char and int operations
A call to sum_char involves the following operations:
- Convert the second parameter into an int by sign extension (C and C++ push parameters in reverse)
- Push the sign extended parameter on the stack as b.
- Convert the first parameter into an int by sign extension.
- Push the sign extended parameter on to the stack as a.
- The called function adds a and b
- The result is cast to a char.
- The result is stored in char c.
- c is again sign extended
- Sign extended c is copied into the return value register and function returns to caller.
- The caller now converts again from int to char.
- The result is stored.
A call to sum_int involves the following operations:
- Push int b on stack
- Push int a on stack
- Called function adds a and b
- Result is stored in int c
- c is copied into the return value register and function returns to caller.
- The called function stores the returned value.
Thus we can conclude that int should be used for all integer variables unless storage requirements force us to use a char or short. When char and short have to be used, consider the impact of byte alignment and ordering to see if you would really save space. (Many processors align structure elements at 16 byte boundaries)
Define lightweight constructors
Keep the constructor light weight. The constructor will be invoked for every object creation. Keep in mind that many times the compiler might be creating temporary object over and above the explicit object creations in your program. Thus optimizing the constructor might give you a big boost in performance. If you have an array of objects, the default constructor for the object should be optimized first as the constructor gets invoked for every object in the array.
Prefer initialization over assignment
Consider the following example of a complex number::
Initialization and assignment
In the function foo, the complex number c is being initialized first by the instantiation and then by the assignment. In foo_optimized, c is being initialized directly to the final value, thus saving a call to the default constructor of Complex.
Use constructor initialization lists
Use constructor initialization lists to initialize the embedded variables to the final initialization values. Assignments within the constructor body will result in lower performance as the default constructor for the embedded objects would have been invoked anyway. Using constructor initialization lists will directly result in invoking the right constructor, thus saving the overhead of default constructor invocation.
In the example given below, the optimized version of the Employee constructor saves the default constructor calls for m_name and m_designation strings.
Constructor initialization lists
Do not declare "just in case" virtual functions
Virtual function calls are more expensive than regular function calls so do not make functions virtual "just in case" somebody needs to override the default behavior. If the need arises, the developer can just as well edit the additional base class header file to change the declaration to virtual.
In-line 1 to 3 line functions
Converting small functions (1 to 3 lines) into in-line will give you big improvements in throughput. In-lining will remove the overhead of a function call and associated parameter passing. But using this technique for bigger functions can have negative impact on performance due to the associated code bloat. Also keep in mind that making a method inline should not increase the dependencies by requiring a explicit header file inclusion when you could have managed by just using a forward reference in the non-inline version. (See the article on header file include patterns for more details).
Avoid cascaded function calls
A disturbingly common practice in coding is to cascade function calls that return pointers or references. Refer to the following code:
Cascaded function calls
<p>The code above has poor readability and the compiler cannot optimize the common sub-expression as the compiler cannot assume that <code>GetX()</code>, <code>GetY()</code> and <code>GetZ()</code> functions will return the same reference every time. </p> <p>The following code optimizes the code by storing the reference to <code>Z</code> and using it within the loop.</p> <h4>Optimized by storing the cascade in a variable</h4> <script src="https://gist.github.com/eventhelix/675e7995255e6092e608e44859a3ccea.js"></script>
Prefer preincrement over postincrement
For classes that overload the pre-increment and postincrement operators, using the pre-increment operators is more efficient.
An overloaded post-increment operator's code would be:
- Create a copy of the object,
- increment the variable and
- Return the variable created in step 1.
- Increment the variable
- Return the variable
A preincrement operator avoids creating a temporary. An operator's code would be:
Define move constructors in C++11
A C++ 03 compiler always calls the copy constructor when it needs to make a copy of an object. This can be wasteful in scenarios where the object being copied
is a temporary. Consider the case of a
remote_integer class that stores a reference to an object saved on the heap (refer to the following
code block). One of the constructors of this
class takes an integer and allocates memory to store this integer. Now consider what happens when function
bar(); the compiler creates a temporary copy of foo_ri by calling the copy constructor. This temporary object gets copied once more in the bar function
when bar_ri stores the return value.
A function returning
The redundant copies happen because the compiler has no way of telling the user that a copy is being made from a temporary object that does not need to be preserved.
A copy constructor
C++ 11 adds the concept of a move constructor. The compiler will call a move constructor (if one is defined) when it is copying
from a temporary object. This gives the user an option to forgo the copying and literally highjack the contents of the temporary
object. A move constructor is shown below. Note the absence of a const and a double ampersand in the constructor signature. In our
example, the move constructor for the
remote_integer does not allocate any new memory; it just copies the pointer
from the passed
remote_integer special reference (called Rvalue reference).
A move constructor
bar example. In a function return, the compiler will call the move constructor
when returning a temporary in
foo() and saving the return value in
bar. Since the move constructor was
used, the memory allocated in the
remote_integer foo_ri=5; line of
foo() would be still in use. Thus we
saved two memory allocations in just returning a value. The benefits of this will accrue in passing parameters and internal STL copies.
C++ 11 implementation of the standard library has been optimized to take full advantage of move constructors. Implementing move constructors in classes that are used in STL containers will improve performance. For details refer to the video on Rvalue references.
Use hardware accelerators and SIMD hardware
Perform some spring cleaning on your compiler options. Check if new hardware and performance and parallelization options are enabled. For example, Intel processors support AVX, 256-bit SIMD unit. Intel/AMD processors also support SSE 4.2.
If your platform is equipped with a GPU, performance of algorithms that blend themselves to parallelization can be improved with CUDA (NVIDIA) or C++ AMP (Microsoft). C++ AMP can also be used to improve performance when no GPU is available.
Use profile guided optimization
Compilers from Microsoft and Intel support profile guided optimization (POGO). With POGO, compilers use the information from actual program execution to find the areas that need to be optimized for speed. The compiler can also use the loop execution and branching history from previous runs to generate the optimal code for loops and branching.