26-05-2012, 10:45 AM
An optimization guide for Windows, Linux and Mac
platforms
Optimizing software in C++.pdf (Size: 879.17 KB / Downloads: 22)
Introduction
This manual is for advanced programmers and software developers who want to make their
software faster. It is assumed that the reader has a good knowledge of the C++
programming language and a basic understanding of how compilers work. The C++
language is chosen as the basis for this manual for reasons explained on page 8 below.
This manual is based mainly on my study of how compilers and microprocessors work. The
recommendations are based on the x86 family of microprocessors from Intel, AMD and VIA
including the 64-bit versions. The x86 processors are used in the most common platforms
with Windows, Linux, BSD and Mac OS X operating systems, though these operating
systems can also be used with other microprocessors. Many of the advices may apply to
other platforms and other compiled programming languages .
The costs of optimizing
University courses in programming nowadays stress the importance of structured and
object-oriented programming, modularity, reusability and systematization of the software
development process. These requirements are often conflicting with the requirements of
optimizing the software for speed or size.
Today, it is not uncommon for software teachers to recommend that no function or method
should be longer than a few lines. A few decades ago, the recommendation was the
opposite: Don’t put something in a separate subroutine if it is only called once. The reasons
for this shift in software writing style are that software projects have become bigger and
more complex, that there is more focus on the costs of software development, and that
computers have become more powerful.
The high priority of structured software development and the low priority of program
efficiency is reflected, first and foremost, in the choice of programming language and
interface frameworks. This is often a disadvantage for the end user who has to invest in
ever more powerful computers to keep up with the ever bigger software packages and who
is still frustrated by unacceptably long response times, even for simple tasks.
Choice of microprocessor
The benchmark performance of competing brands of microprocessors are very similar
thanks to heavy competition. Processors with multiple cores are advantageous for
applications that can be divided into multiple threads that run in parallel. Small lightweight
processors with low power consumption are actually quite powerful and may be sufficient for
less intensive applications.
Choice of programming language
Before starting a new software project, it is important to decide which programming
language is best suited for the project at hand. Low-level languages are good for optimizing
execution speed or program size, while high-level languages are good for making clear and
well-structured code and for fast and easy development of user interfaces and interfaces to
network resources, databases, etc.
The efficiency of the final application depends on the way the programming language is
implemented. The highest efficiency is obtained when the code is compiled and distributed
as binary executable code. Most implementations of C++, Pascal and Fortran are based on
compilers.
Several other programming languages are implemented with interpretation. The program
code is distributed as it is and interpreted line by line when it is run. Examples include
JavaScript, PHP, ASP and UNIX shell script. Interpreted code is very inefficient because the
body of a loop is interpreted again and again for every iteration of the loop.
Conclusion
Vectorized code often contains a lot of extra instructions for converting the data to the right
format and getting them into the right positions. The amount of extra data conversion and
shuffling that is needed determines whether it is profitable to use vectorized code or not.
The code in example 12.7 is slower than non-vectorized code on older processors, but
faster on processors with 128 bit execution units. The code in example 12.6b and 12.6c is
faster than the non-vectorized code on all processors despite the extra data conversion,
packing and unpacking. This is because the bottleneck here is not data conversion and
packing, but division. Division is very time-consuming and there is a lot to save by doing
division in single precision vectors. The code in example 12.8b and c benefit a lot from
vectorization