Optimization

Cost-benefit analysis: Does it really matter if a part of your code is slow? This depends on usage patterns and user expectations.
It helps to be aware of language-specific performance issues (especially C++). These issues need to be understood and dealt with in early design stages, because they can affect the design in fundamental ways (e.g. virtual functions being replaced by static polymorphism).
Make use of the large base of existing libraries. Build your application on top of libraries which offer high performance. Writing an efficient libraries is a lot of work; you don't want to do it yourself.
Use a profiler before, during, and after you optimize to gauge improvements and ensure you aren't wasting your time trying to tune a routine that accounts for only 2% of the runtime.
Look for algorithm and data structure (trees, hash tables, spatial data structures) improvements first. These will offer the largest performance savings and will outlive platform-specific tuning. Maybe you have O(N) algorithms which could be replaced by O(logN) ones.
Be aware that what is good for one architecture may be bad for another. If you have to support many platforms, you should code in a way that guarantees good (not necessarily stellar) performance on all platforms. Note that some libraries are able to tune themselves for particular architectures; if you build on top of them, problem solved.
Understand which optimizations the compiler is competent to perform, and let it do them. You should be optimizing at a level above that of the compiler, or you are wasting your time.
Be aware of platform tools: profilers, hardware counters, assembly annotators, etc. Some compilers (e.g. Intel, SGI) have full-blown performance tuning environments.
Experiment with optimization flags. Sometimes higher levels of optimization produce slower code. Measure compile time/execution time tradeoffs: who cares if a 3% performance improvement requires two more hours of compilation? I will sometimes hack up a script to run through the relevant combinations of optimization flags and measure compile time vs run time.
Try using the "assume no aliasing" flag on your compiler to see if there is a big performance improvement. This is a dangerous flag to use, but it provides a quick way to determine if aliasing is an issue. If it is, you might be able to use the restrict keyword to fix the problem.
You can commit efficiency sins of the grossest sort as long as your inner loops are tight. You don't have to avoid slow language features such as virtual functions, new and delete or even scripting (e.g. driving your C++ program from Python/Perl/Eiffel etc.) Just keep these features away from your inner loops. It's all about amortization.

There are many optimizations which are usually useless and only serve to obscure the code. These optimizations make no difference because of modern architectures and compilers:

Replacing floating-point math with integer arithmetic. On some platforms, integer math is slower than floating-point math. Floating-point has been optimized to death by the hardware gurus, and it's amazingly efficient.
Deciding what data should be in registers. The register keyword is obsolete. Compilers do a much better job of register assignment than you will.
Loop unrolling. The compiler almost always does a better job. When some compilers encounter manually unrolled loops, they will de-unroll it, then unroll it properly.
Reversing loops: for (i=N-1; i<= 0; i--). On some platforms this can increase performance by 3%; on others it can decrease performance by 10%.
Replacing multiplications with additions. Modern CPUs can issue one multiplication each clock cycle. Many of the "strength reduction" tricks from the 70s and 80s will now result in slower code. Replacing floating-point division with multiplication is still relevant; some compilers will do this automatically if a certain option is specified.

Tips for efficient code generation on compilers:

Data type size: integer/ word is the most fundamental and normally the most efficient type for a given processor. In practice, using a smaller type for a local variable is computationally inefficient, as it will first promoted to integer, perform operation and then truncate the result to store back into original data type, unless the processor using 8 or 16-bit word size. This is a speed-memory trade-off.
Variable signedness: Unsigned variables can result in more efficient generated code for certain operations (division, modulus) than using signed variables. Most architectures� ABIs (Abstract Binary Interface) define plain char type as smallest type that most efficient. So, unless a character needs to be signed or unsigned, use the plain char type.
Access types:
- Use static keyword (or C++ anonymous namespaces) for global data whenever possible. In some cases, this allows compiler to deduce things about the access patterns to a variable. On the other hand, avoid static function-local data. The data's value needs to be preserved between calls to the function, making it very expensive to access the data and requiring permanent RAM storage.
- const keyword is helpful in a couple of ways. First, constant variables can usually be allocated to a read-only data section, which can save on the amount of RAM required if the read-only data is allocated to flash or ROM. Second, when applied to a pointer or reference parameter, it allows the compiler to recognize certain cases when a variable is guaranteed not to be modified through the pointer by a function call.
- restrict keyword tells a compiler that the specified pointer that is the only way to access a given piece of data (no alias of this pointer). This can allow the compiler to optimize more aggressively.
Scope: Declare a variable in the inner-most scope possible, particularly when its address needs to be taken. In such cases, the compiler must keep the variable around on the stack until it goes out of scope. This can inhibit certain optimizations that depend on the variable being dead.
Pointer: It is usually more efficient to have a function return a scalar value (i = Function();) than to pass the scalar value by reference or by address (Function(&i);), as reference or pointer forces the compiler to allocate the variable on the stack, all but assuring less efficient code generation.

Index