
By: Dov Bulka; David Mayhew
Publisher: Addison-Wesley Professional
Pub. Date: November 03, 1999
Print ISBN-10: 0-201-37950-3
Print ISBN-13: 978-0-201-37950-1
Pages in Print Edition: 336
Subscriber Rating: 



[1 Rating]
Object definitions trigger silent execution in the form of object constructors and destructors. We call it “silent execution” as opposed to “silent overhead” because object construction and destruction are not usually overhead. If the computations performed by the constructor and destructor are always necessary, then they would be considered efficient code (inlining would alleviate the cost of call and return overhead). As we have seen, constructors and destructors do not always have such “pure” characteristics, and they can create significant overhead. In some situations, computations performed by the constructor (and/or destructor) are left unused. We should also point out that this is more of a design issue than a C++ language issue. However, it is seen less often in C because it lacks constructor and destructor support.
Just because we pass an object by reference does not guarantee good performance. Avoiding object copy helps, but it would be helpful if we didn't have to construct and destroy the object in the first place.
Don't waste effort on computations whose results are not likely to be used. When tracing is off, the creation of the string member is worthless and costly.
Don't aim for the world record in design flexibility. All you need is a design that's sufficiently flexible for the problem domain. A char pointer can sometimes do the simple jobs just as well, and more efficiently, than a string.
Inline. Eliminate the function call overhead that comes with small, frequently invoked function calls. Inlining the Trace constructor and destructor makes it easier to digest the Traceoverhead.
Constructors and destructors may be as efficient as hand-crafted C code. In practice, however, they often contain overhead in the form of superfluous computations.
The construction (destruction) of an object triggers recursive construction (destruction) of parent and member objects. Watch out for the combinatorial explosion of objects in complex hierarchies. They make construction and destruction more expensive.
Make sure that your code actually uses all the objects that it creates and the computations that they perform. We would encourage people to peer inside the classes that they use. This advice is not going to be popular with OOP advocates. OOP, after all, preaches the use of classes as encapsulated black-box entities and discourages you from looking inside. How do we balance between those competing pieces of advice? There is no simple answer because it is context sensitive. Although the black-box approach works perfectly well for 80% of your code, it may wreak havoc on the 20% that is performance critical. It is also application dependent. Some application will put a premium on maintainability and flexibility, and others may put performance considerations at the top of the list. As a programmer you are going to have to decide the question of what exactly you are trying to maximize.
The object life cycle is not free of cost. At the very least, construction and destruction of an object may consume CPU cycles. Don't create an object unless you are going to use it. Typically, you want to defer object construction to the scope in which it is manipulated.
Compilers must initialize contained member objects prior to entering the constructor body. You ought to use the initialization phase to complete the member object creation. This will save the overhead of calling the assignment operator later in the constructor body. In some cases, it will also avoid the generation of temporary objects.
The cost of a virtual function stems from the inability to inline calls that are dynamically bound at run-time. The only potential efficiency issue is the speed gained from inlining if there is any. Inlining efficiency is not an issue in the case of functions whose cost is not dominated by call and return overhead.
Templates are more performance-friendly than inheritance hierarchies. They push type resolution to compile-time, which we consider to be free.
If you must return an object by value, the Return Value Optimization will help performance by eliminating the need for creation and destruction of a local object.
The application of the RVO is up to the discretion of the compiler implementation. You need to consult your compiler documentation or experiment to find if and when RVO is applied.
You will have a better shot at RVO by deploying the computational constructor.
A temporary object could penalize performance twice in the form of constructor and destructor computations.
Declaring a constructor explicit will prevent the compiler from using it for type conversion behind your back.
A temporary object is often created by the compiler to fix a type mismatch. You can avoid it by function overloading.
Avoid object copy if you can. Pass and return objects by reference.
You can eliminate temporaries by using <op>= operators where <op> may be +, -, *, or /.
Flexibility trades off with speed. As the power and flexibility of memory management increases, execution speed decreases.
The global memory manager (implemented by new() and delete()) is general purpose and consequently expensive.
Specialized memory managers can be more than an order of magnitude faster than the global one.
If you allocate mostly memory blocks of a fixed size, a specialized fixed-size memory manager will provide a significant performance boost.
A similar boost is available if you allocate mostly memory blocks that are confined to a single thread. A single-threaded memory manager will help by skipping over the concurrency issues that the global new() and delete() must handle.
The global memory manager (implemented by new() and delete()) is general-purpose and consequently expensive.
A significant boost is available if you mostly allocate memory blocks that are confined to a single thread. A single-threaded memory manager is much faster than a multithreaded one.
If you develop a set of efficient single-threaded allocators, you can easily extend their reach into multithreaded environments by the use of templates.
Inlining is the replacement of a method call with the code for the method.
Inlining improves performance by removing call overhead and allowing cross-call optimizations to take place.
Inlining is primarily an execution-time optimization, though it can also result in smaller executable images as well.
Literal arguments and inlining, when combined, provide significant opportunities for a compiler to provide significant performance improvements.
Inlining may backfire, and overly aggressive inlining will almost certainly do so. Inlining can increase code size. Large code size suffers a higher rate of cache misses and page faults than smaller code.
Nontrivial inlining decisions should be based on sample execution profiles, not gut feelings.
Consider rewriting high frequency methods with large static size and small dynamic size to extract their significant dynamic characteristic, and then inline the dynamic component.
Trivial and singleton methods can always be inlined.
Inlining can improve performance. The goal is to find a program's fast path and inline it, though inlining this path may not be trivial.
Conditional inlining prevents inlining from occuring. This decreases compile-time and simplifies debug during the earlier phases of development.
Selective inlining is a technique that inlines methods only in some places. It can offset some of the code size explosion potential of inlining a method by inlining method calls only on performance-critical paths.
Recursive inlining is an ugly but effective technique for improving the performance of recursive methods.
Care needs to be taken with local static variables.
Inlining is aimed at call elimination. Be sure of the real cost of calls on your system before using inlining.
The STL (Standard Template Library)is an uncommon combination of abstraction, flexibility, and efficiency.
Depending on your application, some containers are more efficient than others for a particular usage pattern.
Unless you know something about the problem domain that the STL doesn't, it is unlikely that you will beat the performance of an STL implementation by a wide enough margin to justify the effort.
It is possible, however, to exceed the performance of an STL implementation in some specific scenarios.
Reference counting is not an automatic performance winner. Reference counting, execution speed, and resource conservation form a delicate interaction that must be evaluated carefully if performance is an important consideration. Reference counting may help or hurt performance depending on the usage pattern. The case in favor of reference counting is strengthened by any one of the following items:
The target object is a large resource consumer
The resource in question is expensive to allocate and free
A high degree of sharing; the reference count is likely to be high due to the use of the assignment operator and copy constructor
The creation or destruction of a reference is relatively cheap
If you reverse these items, you start leaning towards skipping reference counting in favor of the plain uncounted object.
Coding optimizations are local in scope and do not necessitate understanding of overall program design. This is a good place to start when you join an ongoing project whose design you don't yet understand.
The fastest code is the one that's never executed. Try the following to bail out of a costly computation:
Are you ever going to use the result? It sounds silly, but it happens. At times we perform computation and never use the results.
Do you need the results now? Defer a computation to the point where it is actually needed. Premature computations may never be used on some execution flows.
Do you know the result already? We have seen costly computations performed even though their results were available two lines above. If you already computed it earlier in the execution flow, make the result available for reuse.
Sometimes you cannot bail out, and you just have to perform the computation. The challenge now is to speed it up:
Is the computation overly generic? You only need to be as flexible as the domain requires, not more. Take advantage of simplifying assumptions. Reduced flexibility increases speed.
Some flexibility is hidden in library calls. You may gain speed by rolling your own version of specific library calls that are called often enough to justify the effort. Familiarize yourself with the hidden cost of those library and system calls that you use.
Minimize memory management calls. They are relatively expensive on most compilers.
If you consider the set of all possible input data, 20% of it shows up 80% of the time. Speed up the processing of typical input at the expense of other scenarios.
The speed differential among cache, RAM, and disk access is significant. Write cache-friendly code.
A fundamental tension exists between software performance and flexibility. On the 20% of your software that executes 80% of the time, performance often comes first at the expense of flexibility.
Caching opportunities may surface in the overall program design as well as in the minute coding details. You can often avoid big blobs of computation by simply stashing away the result of previous computations.
The use of efficient algorithms and data structures is a necessary but not sufficient condition for software efficiency.
Some computations may be necessary only on a subset of the overall likely execution scenarios. Those computations should be deferred to those execution paths that must have them. If a computation is performed prematurely, its result may go unused.
Large-scale software often tends towards chaos. One by-product of chaotic software is the execution of obsolete code: code that once upon a time served a purpose but no longer does. Periodic purges of obsolete code and other useless computations will boost performance as well as overall software hygiene.
SMP is currently the dominant MP architecture. It consists of multiple symmetric processors connected via a single bus to a single memory system. The bus is the scalability weak link in the SMP architecture. Large caches, one per processor, are meant to keep bus contention under control.
Amdahl's Law puts an upper limit on the potential scalability of an application. The scalability is limited by portions of the computation that are serialized.
The trick to scalability is to reduce and, if possible, eliminate serialized code. Following are some steps you can take towards that goal:
The farther the memory you want to use is from the processor, the longer it takes to access. The resource closest to the processor, registers, are limited in their capability, but extremely fast. Their optimization can be very valuable.
Virtual memory is not free. Indiscriminate reliance on system maintained virtual structures can have very significant performance ramifications, typically negative ones.
Context switches are expensive; avoid them.
Lastly, though we are aware that internally managed asynchronous I/O has its place, we also feel that the coming shift in processor architecture will significantly disadvantage monolithic threading approaches.
扫码加好友,拉您进群



收藏
