22-02-2013, 09:23 AM
Fine-tuning your HPC Investments with Performance Analysis
Fine-tuning your HPC.pdf (Size: 477.47 KB / Downloads: 16)
Performance Analysis for Oil and Gas HPC
• Understand how efficiently your HPC resources are used today
• Identify bottlenecks and improve efficiency on current systems
• Focus attention on appropriate emerging technologies
—multi-core processors
—application accelerators
—interconnection network fabrics and topologies
—storage systems
• Guide procurement of future systems that meet your needs
Performance Analysis Challenges
• Microprocessor architectures are hard to program effectively
—processors that are pipelined, out of order, superscalar
—multi-level memory hierarchy
—multi-level parallelism: multi-core, SIMD instructions
• Gap between typical and peak performance is huge
• HPC applications pose challenges for tools
—large complex programs
– multi-lingual
– multiple instantiations of code: templates, inlining
– threaded parallelism
—binary-only external libraries
– sometimes partially stripped
—complex execution environments
– dynamic loading of code
– batch parallel execution on clusters
Rice’s HPCToolkit Performance Tool Goals
• Measurement of both serial and parallel codes
—cope with all the complexities of real codes
—especially multi-threaded codes on multi-core processors
• Scalable data collection for large-scale parallel executions
• Insightful analysis that pinpoints and explains problems
• Effective presentation of analysis results
—correlate measurements with code (yield actionable results)
—intuitive enough for application scientists to use
—detailed enough to meet the needs of compiler writers
HPCToolkit Design Principles
• Work at binary level for language independence
—support multi-lingual codes with external binary-only libraries
• Profile rather than adding code instrumentation
—minimize measurement overhead and distortion
—enable data collection for large-scale parallelism
• Collect and correlate multiple performance measures
— can’t diagnose a problem with only one species of event
• Compute derived metrics to aid analysis
• Associate costs with both static and dynamic context
—loop nests, procedures, inlined code, calling context
• Support top down performance analysis
—intuitive enough for scientists and engineers to use
—detailed enough to meet the needs of compiler writers
Call Path Profiling
• No instrumentation
—statistical sampling of hardware performance counter overflows
—gather calling context information using stack unwinding
—overhead proportional to sampling frequency
– not calling frequency
• Capture samples in full calling context
—attribute sample to individual PC and source line
—associate costs with full calling context
– call sites too, not just callers
• Measurement overhead is only a few percent