Programming Massively Parallel Processors - Edition 3 - By David B. Kirk and Wen-mei W. Hwu Elsevier Educate

Purchase textbook

Programming Massively Parallel Processors,

Edition 3 A Hands-on Approach

By David B. Kirk and Wen-mei W. Hwu

Publication Date: 07 Dec 2016

Educator ancillaries
Student ancillaries

0 Reviews

No accessibility information available.

Programming Massively Parallel Processors: A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and GPU architecture, exploring, in detail, various techniques for constructing parallel programs.

Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth.

For this new edition, the authors have updated their coverage of CUDA, including coverage of newer libraries, such as CuDNN, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices.

Key Features

Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
Utilizes CUDA version 7.5, NVIDIA's software development tool created specifically for massively parallel environments
Contains new and updated case studies
Includes coverage of newer libraries, such as CuDNN for Deep Learning

About the author

By David B. Kirk, NVIDIA Fellow and Wen-mei W. Hwu, CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA

Dedication
Preface
- Target Audience
- How to Use the Book
- Illinois–NVIDIA GPU Teaching Kit
- Online Supplements
Acknowledgements
Chapter 1. Introduction
- Abstract
- 1.1 Heterogeneous Parallel Computing
- 1.2 Architecture of a Modern GPU
- 1.3 Why More Speed or Parallelism?
- 1.4 Speeding Up Real Applications
- 1.5 Challenges in Parallel Programming
- 1.6 Parallel Programming Languages and Models
- 1.7 Overarching Goals
- 1.8 Organization of the Book
- References
Chapter 2. Data parallel computing
- Abstract
- 2.1 Data Parallelism
- 2.2 CUDA C Program Structure
- 2.3 A Vector Addition Kernel
- 2.4 Device Global Memory and Data Transfer
- 2.5 Kernel Functions and Threading
- 2.6 Kernel Launch
- 2.7 Summary
- References
Chapter 3. Scalable parallel execution
- Abstract
- 3.1 CUDA Thread Organization
- 3.2 Mapping Threads to Multidimensional Data
- 3.3 Image Blur: A More Complex Kernel
- 3.4 Synchronization and Transparent Scalability
- 3.5 Resource Assignment
- 3.6 Querying Device Properties
- 3.7 Thread Scheduling and Latency Tolerance
- 3.8 Summary
Chapter 4. Memory and data locality
- Abstract
- 4.1 Importance of Memory Access Efficiency
- 4.2 Matrix Multiplication
- 4.3 CUDA Memory Types
- 4.4 Tiling for Reduced Memory Traffic
- 4.5 A Tiled Matrix Multiplication Kernel
- 4.6 Boundary Checks
- 4.7 Memory as a Limiting Factor to Parallelism
- 4.8 Summary
Chapter 5. Performance considerations
- Abstract
- 5.1 Global Memory Bandwidth
- 5.2 More on Memory Parallelism
- 5.3 Warps and SIMD Hardware
- 5.4 Dynamic Partitioning of Resources
- 5.5 Thread Granularity
- 5.6 Summary
- References
Chapter 6. Numerical considerations
- Abstract
- 6.1 Floating-Point Data Representation
- 6.2 Representable Numbers
- 6.3 Special Bit Patterns and Precision in IEEE Format
- 6.4 Arithmetic Accuracy and Rounding
- 6.5 Algorithm Considerations
- 6.6 Linear Solvers and Numerical Stability
- 6.7 Summary
- References
Chapter 7. Parallel patterns: convolution: An introduction to stencil computation
- Abstract
- 7.1 Background
- 7.2 1D Parallel Convolution—A Basic Algorithm
- 7.3 Constant Memory and Caching
- 7.4 Tiled 1D Convolution with Halo Cells
- 7.5 A Simpler Tiled 1D Convolution—General Caching
- 7.6 Tiled 2D Convolution With Halo Cells
- 7.7 Summary
- 7.8 Exercises
Chapter 8. Parallel patterns: prefix sum: An introduction to work efficiency in parallel algorithms
- Abstract
- 8.1 Background
- 8.2 A Simple Parallel Scan
- 8.3 Speed and Work Efficiency
- 8.4 A More Work-Efficient Parallel Scan
- 8.5 An Even More Work-Efficient Parallel Scan
- 8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs
- 8.7 Single-Pass Scan for Memory Access Efficiency
- 8.8 Summary
- 8.9 Exercises
- References
Chapter 9. Parallel patterns—parallel histogram computation: An introduction to atomic operations and privatization
- Abstract
- 9.1 Background
- 9.2 Use of Atomic Operations
- 9.3 Block versus Interleaved Partitioning
- 9.4 Latency versus Throughput of Atomic Operations
- 9.5 Atomic Operation in Cache Memory
- 9.6 Privatization
- 9.7 Aggregation
- 9.8 Summary
- Reference
Chapter 10. Parallel patterns: sparse matrix computation: An introduction to data compression and regularization
- Abstract
- 10.1 Background
- 10.2 Parallel SpMV Using CSR
- 10.3 Padding and Transposition
- 10.4 Using a Hybrid Approach to Regulate Padding
- 10.5 Sorting and Partitioning for Regularization
- 10.6 Summary
- References
Chapter 11. Parallel patterns: merge sort: An introduction to tiling with dynamic input data identification
- Abstract
- 11.1 Background
- 11.2 A Sequential Merge Algorithm
- 11.3 A Parallelization Approach
- 11.4 Co-Rank Function Implementation
- 11.5 A Basic Parallel Merge Kernel
- 11.6 A Tiled Merge Kernel
- 11.7 A Circular-Buffer Merge Kernel
- 11.8 Summary
- Reference
Chapter 12. Parallel patterns: graph search
- Abstract
- 12.1 Background
- 12.2 Breadth-First Search
- 12.3 A Sequential BFS Function
- 12.4 A Parallel BFS Function
- 12.5 Optimizations
- 12.6 Summary
- References
Chapter 13. CUDA dynamic parallelism
- Abstract
- 13.1 Background
- 13.2 Dynamic Parallelism Overview
- 13.3 A Simple Example
- 13.4 Memory Data Visibility
- 13.5 Configurations and Memory Management
- 13.6 Synchronization, Streams, and Events
- 13.7 A More Complex Example
- 13.8 A Recursive Example
- 13.9 Summary
- References
- A13.1 Code Appendix
Chapter 14. Application case study—non-Cartesian magnetic resonance imaging: An introduction to statistical estimation methods
- Abstract
- 14.1 Background
- 14.2 Iterative Reconstruction
- 14.3 Computing F^HD
- 14.4 Final Evaluation
- References
Chapter 15. Application case study—molecular visualization and analysis
- Abstract
- 15.1 Background
- 15.2 A Simple Kernel Implementation
- 15.3 Thread Granularity Adjustment
- 15.4 Memory Coalescing
- 15.5 Summary
- References
Chapter 16. Application case study—machine learning
- Abstract
- 16.1 Background
- 16.2 Convolutional Neural Networks
- 16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation
- 16.4 Reduction of Convolutional Layer to Matrix Multiplication
- 16.5 cuDNN Library
- References
Chapter 17. Parallel programming and computational thinking
- Abstract
- 17.1 Goals of Parallel Computing
- 17.2 Problem Decomposition
- 17.3 Algorithm Selection
- 17.4 Computational Thinking
- 17.5 Single Program, Multiple Data, Shared Memory and Locality
- 17.6 Strategies for Computational Thinking
- 17.7 A Hypothetical Example: Sodium Map of the Brain
- 17.8 Summary
- References
Chapter 18. Programming a heterogeneous computing cluster
- Abstract
- 18.1 Background
- 18.2 A Running Example
- 18.3 Message Passing Interface Basics
- 18.4 Message Passing Interface Point-to-Point Communication
- 18.5 Overlapping Computation and Communication
- 18.6 Message Passing Interface Collective Communication
- 18.7 CUDA-Aware Message Passing Interface
- 18.8 Summary
- Reference
Chapter 19. Parallel programming with OpenACC
- Abstract
- 19.1 The OpenACC Execution Model
- 19.2 OpenACC Directive Format
- 19.3 OpenACC by Example
- 19.4 Comparing OpenACC and CUDA
- 19.5 Interoperability with CUDA and Libraries
- 19.6 The Future of OpenACC
Chapter 20. More on CUDA and graphics processing unit computing
- Abstract
- 20.1 Model of Host/Device Interaction
- 20.2 Kernel Execution Control
- 20.3 Memory Bandwidth and Compute Throughput
- 20.4 Programming Environment
- 20.5 Future Outlook
- References
Chapter 21. Conclusion and outlook
- Abstract
- 21.1 Goals Revisited
- 21.2 Future Outlook
Appendix A. An introduction to OpenCL
- A.1 Background
- A.2 Data Parallelism Model
- A.3 Device Architecture
- A.4 Kernel Functions
- A.5 Device Management and Kernel Launch
- A.6 Electrostatic Potential Map in OpenCL
- A.7 Summary
Appendix B. THRUST: a productivity-oriented library for CUDA
- B.1 Background
- B.2 Motivation
- B.3 Basic Thrust Features
- B.4 Generic Programming
- B.5 Benefits of Abstraction
- B.6 Best Practices
Appendix C. CUDA Fortran
- C.1 CUDA Fortran and CUDA C Differences
- C.2 A First CUDA Fortran Program
- C.3 Multidimensional Array in CUDA Fortran
- C.4 Overloading Host/Device Routines with Generic Interfaces
- C.5 Calling CUDA C via ISO_C_Binding
- C.6 Kernel Loop Directives and Reduction Operations
- C.7 Dynamic Shared Memory
- C.8 Asynchronous Data Transfers
- C.9 Compilation and Profiling
- C.10 Calling Thrust from CUDA Fortran
Appendix D. An introduction to C++ AMP
- D.1 Core C++ AMP Features
- D.2 Details of the C++ AMP Execution Model
- D.3 Managing Accelerators
- D.4 Tiled Execution
- D.5 C++ AMP Graphics Features
- D.6 Summary
- Reference
Index

ISBN: 9780128119860

Page Count: 576

Retail Price (USD) :

Pacheco, Introduction to Parallel Programming, Morgan Kaufmann, Jan 2011, 9780123742605, $79.95
Barlas, Multicore and GPU Programming: An Integrated Approach, Morgan Kaufmann, Nov 2014, 9780124171374, $99.95
Herlihy, The Art of Multiprocessor Programming, Revised Reprint, Morgan Kaufmann, May 2012, 9780123973375, $74.95
McCool , Structured Parallel Programming: Patterns for Efficient Computation, Morgan Kaufmann, Jun 2012, 9780124159938, $59.95
Hwang, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things, Morgan Kaufmann, Oct 2011, 9780123858801, $89.95
Culler, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, Aug 1998, 9781558603431, $144.00

Access to teacher/student resources is available to registered users with approved review copies or confirmed adoptions. To review this material, please request a review copy.