CS528 : High Performance Computing

High Performance Computing (3-0-0-6)

Instructor : A. Sahu TAs : Manojit Ghose and Chinmaya Swain

Lecture time and Venue: Time slot B1 (Mon 4-5PM, Tue 3-4PM, Wed 2-3PM), Core II, Room: 2204

Rules: Mid Sem Exam 40%, End Sem Exam 40%, Project (Mostly require Solution Algo/Design for the to be defined problem and you may need to do a bit of C++ coding) 20% weightage.

End Semester Question Paper QP.pdf and Model Solution MS.pdf

You may read last year HPC course lecture notes

07 Aug 2017: Introduction

Summary of the Lecture
Summary: Class timing, Course Contents, Programming Model, Parallel Architecture, Resource Management, GPU, Phi Models. Scheduling n tasks on to m processors, Preemptive version: Polynomial time O(n), Non preemptive version NPC, Scheduling n tasks on to two machine so that over all execution time is minimized and this can be mapped to subset sum problem, where the all the n task needs to be divided in to two sets Sp1 and Sp2 such that difference between the sum of execution time of tasks of both the sets is minimized. Approximation algorithm for this subset sum problem is given in Algorithm Book (Corman) Section 33.5. Greedy load balancing have 2 Approximation, Longest task first Greedy have 2/3 Approximation (and 4/3 Apprx) Ref Chapter 11.1 of Algorithm Design Book by Kleinberg-Tardos. Power-Energy Efficient Scheduling P=1/2.C.V²f=kf³ and E=Kf² , where C=capacitance, f is frequency and V is voltage. Running a processor at frequency f/n is n times slower but n² times energy efficient. Most of the servers are under utilized, so why to run at maximum speed? Hot spot in Program execution. Identify hot spot using profiling tool for program optimization.
08 Aug 2017: A bit detail introduction to HPC

Summary of the Lecture
Summary: What is HPC? Definition depends on context. High performance laptop/desktop/mobile/IoT/machine/energy per FLOPs.
Who need HPC? Ans : many applications need to to be speed up like weather forecasting, numerical solution, heuristic evaluations, scientific simulations, graphics simulations, data analysis, big data, etc.
How to build HPC? How to improve performance of PC/Laptop? Ans: Good Processor, High Cache, RAM, SSD, Disk. Good Processor? Intel Atom(1-4 Small Core/1-4 Small Threads), i3(2Core/4Thread), i5(4Core/4Thread), i7(4Core/8Threads). Small Cores: 2 Issue, in-order. Big Core: 4 Issue, out of order Pipeline. Support of Vector instruction VX(MMX, SSE, SSE2 or 128 bit VX which is 4 32 bit operation in one cycle). Xeon Processor (upto 4Cores to 20 cores/ 8Threads to 40thread) and AVX instructions (512 bit wide vector instructions or 16 32-bit operation per cycle)
HPC : Computing along with Memory, Cache, Bandwidths (I/O, Network, Disks)
More details Ans: Given a Problem and Instance of Data, how to efficiently compute? Use good Algorithms/Approaches/Heuristics. Good Platform, Parallel Platform, Write/Generate Good Parallel Code (Auto Parallel Compiler), Good Debugger.
Multicore Platform:+ high performance, + Cost effective (share a lot of resources among cores: hostel rooms with share toilets, bathroom, common room is cheaper as compared to renting a Flat/Full Fledged house) + power Efficient (can be run at low frequency). -(Writing Parallel code) -(parallel debugging) -(scheduling) -(synchronization:locking/critical section for shared resources, coherence:multiple copies of same data should have same value, consistent:result of parallel version of program should be same as sequential version program: Serializability should be maintained).
How to evaluate HPC? Benchmarking using standard application data.
Hot spot in Program execution. Identify hot spot using profiling tool for program optimization.
09 Aug 2017: HPC and Multicore Part II

Summary of the Lecture
Shown Physical card (Intel Xeon Phi Co-processor Card 31S1P, 57 cores (2 Issue, Inorder), 228 Hardware threads, each core having AVX (512 bit vector operation), 28.5MB Cache, 8GB RAM)
Architecture of HPC System: HPC System (like google data centers) may have many Containers/Rooms, A container have many Racks, A rack have many Chasis/Rack Server, A chasis have many host/socket/processor, A host/processor have many cores, A cores have many hardware threads/virtual processors

Programming upto chasis level is done using Serial and Shared memory parallel programing (Pthread, OpenMP and Cilk). But above chasis level in general we use distributed memory model parallel programming (MPI-Message Passing Interface). We will cover Pthread and OpenMP in one class, Clik in one class, MPI in one class, and GPU/Cuda Programming in one class. MPI have more than 100 API, but studying about 12 API will make you good in MPI Coding

Important Problem of Multi core/Multi Processor:(a) Synchronization, (b) Coherence and (c) Consistency. Synchronization/Locking require support from Hardware, this support is system dependent support, which is atomic instruction support. Example of atomic instruction combos are Test and Set (TAS), Exchange, Fetch and Increment, load locked and Store conditional (LLSC). Trylock TAS is advanced version of TAS with exponential backup which try to lock if fail wait for some time and further fail to acquire waiting time multiplied by factor 2

In Chip multi processor (CMP), all the L2 level cache of cores are coherent and coherence is maintained by hardware it self. In CMP, Lock variable get cached and coherence ensure the locking works fine.

Monitor uses Semaphore, Semaphore uses Mutex, Mutex use TAS and TAS uses Hardware provided Atomic intruction TAS/LLSC/FAI/XCHG

A book on "Art of multiprocessor programming" By Shavit and Herlihy, whcih cover detail proof of locking protocol and formal proof and design of Concurrent/Parallel Data structure like thread safe and efficient list, queue, stack, hash, heap and skip list.

Consistency is about getting same result from both serial version of code and parallel version A good paper on consistency: you may read the references there in

Many issue need to be discussed: Scheduling, execution time, security, power, energy, cost (initial cost/investment: space, capital), heating, cooling, virtualisation, usage, prediction, renting
14 Aug 2017: Memory Access Optimization

Summary of the Lecture
Summary: Multicore System: Requirement for Correctness of Program (Synchronization/locking, Coherence and Consistency). For Performance: program need to independent in term of software and independent program need to be run on independent core/system. Running application on two cores will not give 200% percent performance. Performance of application may be lesser than 200% (due to share resource interference) or higher than 200% (because of higher space)
Shared resources: L2 cache, L1-L2 Link, L3 cache, L2-L3 link, RAM, L3-RAM link, Disk, RAM-DIsk Bus, Internet. Most of the cases L1 cache is private to each cores

L2/L3/DRAM/Disk space is shared, Links/Link Bandwidths are shared. Link is defined by parameters: Latency (time to access to first data/item) and Bandwidth (capability of subsequent accesses per cycle after the first access).

Space and Link is shared: require efficient mechanism to improve performance and reduce interference. Policies for space/bandwidth sharing using logical partition: Equal partition, Dynamic partition based on Demand, Partition based on both utilization and performance.

How to allocate bandwidth to different cores and given the bandwidth allocation how to realize using some method. Bandwidth realization using token distribution method. Generate N token per time slot, distribute based on allocated portion of the bandwidth. Every time a processor/client allowed to access if need to spend a allocated token.

Cache: formal definition (a safe place to keep data for efficient use), L1/L2/L3/DRAM/Disk all can be used as cache. Mozilla uses cache, Page replacement also uses cache techniques. Already computed data can be put into cache

Cache: fully associative (linked list, a data can go to any place of cache, slower access as list need to search sequentially), Direct mapped (array access using index, hashed, a data can go to specific hashed place, there may be collision, faster access) and set associative (mixed approach, hashed location, but each hashed location can have upto 4 or 8 data block, parallel access in a hashed list)

Program cache behaviour: for(i=0 to 100;i++) S=S+A[i]; Lets assume direct mapped cache, each block can hold four elements. Brining one block have one miss and 3 hits (spacial locality). Number of hit and miss for the whole program is 25 misses and 75 hits to caches.

Suppose the same code run for another 200 iterations. for(j=0 to 200; j++) { for(i=0 to 100;i++) S=S+A[i];} and the array A[100] get fit in to the cache. temporal locality: To run the program, we will have 200*100 (20,000) accesses and out of which 25 accesses have misses and 19975 access have hit in cache. Assuming cache hit time is 1 cycles and cache miss is 20 cycle, this program have huge performance gain by using the cache.
16 Aug 2017: Memory access Optimization: PartII

Summary of the Lecture
Summary: for(i=0 to 100;i++) S=S+A[i]. One read per iteration, one compute (+) per iteration. 1Ghz=10⁹ which is 1ns=10^-9s. Suppose you want to do 10⁹ operations per second, you need to supply data at 10⁹*4 Bytes per second to compute from memory.
for(i=0 to 100;i++) A[i]=B[i]+C[i]*D[i]; Three reads, one write, one addition and one multiplication per iteration. Suppose you want to do 10⁹ *2 operations per second, you need to supply data at 10⁹*4*3 Bytes per second to compute from memory for read and 10⁹*4 Bytes of write to memory.

Cache: safe place to keep data for efficient use: Processor(compute), L1, L2, L3, RAM, SSD, Disk, InternetCache, CloudServer. Many level of cache hierarchy

Cache Performance Improvement

(1) Reducing hit time (small, simple, direct mapped cache, pipelined cache, avoid address translation)

(2) Reducing miss penalty (Multi Level Caches: don't go to bank every time if your pock running out of money, keep some money at your hostel room/locker, Critical word first (read are urgent then write, put all writes to buffer, Wrap around read, concurrent read from cache/memory, data forwarding to processor directly, Victim cache or recycle bin:use two block space as recycle buffer),

(3) Reducing Miss rate (Larger Block size:take benefit of spatial locality, reduces capacity miss, but increase miss penalty as large block need to be transfered/swap-in-out from memory for miss, Larger cache: Reduce capacity miss but increase hit time, so keep small L1 big L2, Increase associativity: to increase hit rate bit hit time increase, use prediction to assume set associative cache as direct mapped cache to reduce hit time and power consumption of multiple tag matching)

Using software approach to reduce miss rate: Compiler optimization, loop interchange

(4) Reducing Miss rate and Miss penalty : Non blocking, cache, Hardware and software pre-fetching
17 Aug 2017: Structured Code: Pre-fetching, Loop Optimization and Non-Blocking Cache

Summary of the Lecture
Summary: Pre-fetching, Loop Optimization and Non-Blocking Cache
Pre-fetching : Streaming access or row wise access. If we know the access is streaming access, why not to access before hand and compute safely. It is bad idea to access at the time requirement/computation.Using prefetching we can overlap communication/data transfer and computation. Prefetching is can be down by manually a the time of writing code, by compiler (compiler understand the code and insert prefetch instructions) and can be done at hardware level using profile based/machine learning based stream detection.

Loop Optimization : matrix multiplication example, interchanging loop can result in less number of cache miss, we can convert the same to matrix access to row access. Column access result in higher cache miss.

Non-Blocking Cache : Multi-ported cache, most of time processor can request multiple access request at the same time, subsequent cycles. So to make memory system efficient, we need to support multiple access need to handled at the same time. Hit under miss: cache is processing cache miss and access (hit/miss) is allowed, Miss under a miss: cache is processing a miss and another miss is allowed to process. This improve bandwidth of cache system.

Matrix multiplication: main core of MatLab. If we use all the above techniques Prefetching, Loop Optimization and Non-Blocking Cache then Performance of Matrix multiplication can be enhanced to great extend
19 Aug 2017: Makeup class Serial Code Optimization and profiling

Summary of the Lecture
Summary:
21 Aug 2017: Multi threading PDF Note
Sample Pthread Codes
22 Aug 2017: OpenMP PDF Note
23 Aug 2017: Cilk Introduction PDF Note
26 Aug 2017 : Makeup class, Cilk, Parallelism, Speed up

Summary of the Lecture
Summary: Cilk Philosophy (a) write good code for you application using Cilk construct, analyze to the code and ensure enough parallelism (T₁/T_inf) is greater than P, (b) Schedule the work using good scheduler (work stealing), greedy scheduler also achieve linear speed up if T₁/T_inf is greater than P
[[Resources for Cilk: cilk tool cilk-5.4.6.tar.gz, How to Install Cilk HowtoInstallCilk.txt Test program and Makefile for cilk cilkmatmultest and Cilk Mannual And Resources at Cilk@MIT, PowerPoint: lecture-1.ppt, lecture-2.ppt, lecture-3.ppt ]]
28 Aug 2017 : Shared Memory Vs Distributed Memory Program

Summary of the Lecture
Summary: Pthread, OpenMP, Cilk uses Shared memory programming model, which is not scalable to many machines. Distributed Memory programming model. How to used two physical machine for programming. Message Passing Interface, NFS (Network File System), NIS (network information system). How to install MPI ch.
29 Aug 2017 : Introduction to MPI Programming

Summary of the Lecture
Process and data sharing, mailbox, Communication, Partitioning, Domain Decomposition, mapping, process rank, communication reduction. Backbone of Hello world MPI program with MPI_init, MPI_COMM_rank, MPI_COMM_size and MPI_Finlize.
30 Aug 2017 : MPI Programming: 12 APIs and Examples PPT Slides [[MPI Program examples]]

Summary of the Lecture
12 MPI APIs (a) Basics MPI_init, MPI_COMM_rank, MPI_COMM_size and MPI_Finlize. (b) point to point communication MPI_Send and MPI_Recv and (c) Collective communications MPI_Bcast, MPI_Gather, MPI_Scatter, MPI_AllGather, MPI_Reduce, MPI_AllReduce and MPI_Scan. Example MPI programs
04 Sep 2017 : Multiprocessor Interconnection Network and Embedding GraphsPDF SLides
05 Sep 2017 : Multiprocessor Interconnection Network and Embedding Graphs
06 Sep 2017: Scheduling Algorithm α|β|γ, 1|∑w_jC_j, P/Q|ptmn|C_max, P|C_max, Machine environment (identical (P), uniform/speed scaled (Q), unrelated/heterogeneous(R)), task environment (execution time, release time, deadline, preemption, dependency (independent, chain, in/out-tree, fork-join, DAG)), Optimization criteria (C_max,∑C_j,∑w_jC_j, ∑w_jT_j, ∑w_jU_j)
[[Chapter 1 of Scheduling Algorithm By Peter Brucker]][[ Survey of Scheduling Algorithm]]

11 Sep 2017: Scheduling Problems: Exponential Problems (P|C_max, P|prec,pi=1|C_max, P|prec,pi=1,ptmn|C_max, P|tree|C_max), List scheduling-2 Approximation for P|C_max, LPT schduling 4/3 Approximation for (P|C_max), Critical Path or Highest Level First Greedy algorithm 2-Approximation for P|prec,pi=1|C_max and P|prec|C_max. Polynomial time solution for P|ptmn|C_max, Q|ptmn|C_max, P|tree, pi=1|C_max using the HLF algorithm, P|pi=1|Cmax using simple executing in ceil(n/m) phases.
P|prec,ptmn|C_max is in NPC [[Ref page 70 of Survey of Scheduling Algorithm]], Summery of Scheduling Problems From Brucker Book Brief Summary of Complexity of Parallel Machine Scheduling

12 Sep 2017 : CP Scheduling and Scheduling Real time tasks

Summary of the Lecture
Solving Problem P_m|prec, p_i=1|C_max. If we assume T₁1/T_inf is greater than P, greedy scheduler achieve linear speed up and T_p ≤ 2*T_opt, But in general Critical path based approaches (CP-List scheduling) achieves the same T_p ≤ 2*T_opt. For P_m|tree, p_i=1|C_max the Critical path based approaches (CP-List scheduling or highest level first approaches) achieves optimal solution.

Real time job: Optimization criteria (a) minimize the number of late jobs: min(∑U_i), U_i)=1 if f_i greater than d_i else 0). (b) min(∑w_iU_i), (c) Tardiness: min (∑T_i), T_i=max(0, f_i - d_i) and (d) lateness T_i= f_i - d_i

Problem 1|sync, t_i(e_i, a_i=0,d_i)|L_max, Jackson Algorithm, Execute the tasks in order of non-decreasing deadlines is Optimal, which is Earliest Due Date Algorithm and it is non-preemptive.

Problem 1|t_i(e_i, a_i,d_i),ptmn|L_max, Horn Algorithm, At any instant of time, execute the tasks with earliest deadline is Optimal, which is Earliest deadline First Algorithm and it is preemptive. Hard Real Time System Sec 3.2 and 3.3 by G Butazzo

03 Oct 2017: Introduction to Cloud System and HPC

Summary of the Lecture
Summary:
Utilization of HPC system, Solution to low utilization of HPC system: private (reduce operational cost), made it public (pay and use for public users, need to ensure security). Virtualization allow to run virtual OS on host OS with upto maximum of 15% performance degradation, Virtualization allow HPC/Data center to go for public mode. In simple HPC, a user can see what other user are doing but in virtualize HPC user can simply submit job and not allowed to see others activities. When ever user submit a JOB to data center, the resource manager at data center create a VM (Virtual Machine) for the JOB and execute JOB on top of VM.
Basic architecture of Cloud System: Users, Broker, Cloud Resources and Cloud Resource Manager. Broker do the negotiation of cost and job envt with users.
Cost of setting up and running a Cloud: Initial Cost and Operation Cost. Operation Cost: Power Consumption, Cooling Cost and Management/Broker cost. HPC: contain many containers, a container contain many racks, a rack contain many rack server/chasis, a chasis contain many server/host/sockets/processor and a host/socket contain many cores. Power model of a server/host: simple model (P=P_idle+(P_max-P_min)*U, where U is utilization), simple model 2 (P=1/2 CV²f, where C is capacitance, V is voltage and f frequency of operation), power model of multi-threaded processor (P=B+iδ and δ=S+k.f³, where B is based power consumption, i is number of active hardware thread, k is constant and f is frequency of operation)
Inverter AC vs Normal AC: Power consumption P = P_min+ αU³, if P_min = 0 or small, running AC/system at lower frequency will minimized energy consumption. Otherwise running AC/system at full speed an run to switch off mode will minimize energy consumption.
04 Oct 2017: Introduction to Cloud System and HPC

Summary of the Lecture

[[Ref: Chapter 6 of Computer Architecture Book, Hennessey-Paterson]] Criteria for WSC: (a) Cost-Performance, (b) Energy Efficiency, (c) Dependability, (d) Network I/O and (e) Executing interactive and batch mode workloads. CRAC (Computer AIR conditioning Unit), Chiller 30%-40% of power, CRAC 10-20% of power of IT infrastructure. IT equipment power consumption: 33% processors, 30% RAM, 10% disks, 5% network, 22% others. Power Utilization Effectiveness (PUE) = (Total Facility Power) / (IT Equipment Power). PUE is greater than equal to 1. PUE 1.55 have almost 0.55 of Cooling. PUE depends on workload and external air temperature. Cost of WSC : CAPEX (Capital cost expenditure) and OPEX (Operational cost Expenditure). OPEX is 2% of 1st year of CAPEX+OPEX in year 2006. 10 year cost OPEX is 30% of total cost. CAPEX is reducing year by year as cost of hardware is reducing. OPEX is increasing as Power cost is increasing. Cloud: Amazon EC2: 8 core, 8GB, 2TB HDD for 1 hour cost $0.68.

[[Ref: Energy Efficiency of Large System]] Energy Efficiency for Nodes:(a) Techniques for measuring and modeling power consumption, (b) Node level energy optimization (c) grid/data center power management and (e) cloud Virtualization; Node level energy optimization depends on (a) h/w capability (SSD, Si0 state,...), (b) Sleep state for supporting Dynamic power Management (DPM), (c) DVFS and (d) s/W or kernel/OS support to use the H/W support, DPM and DVFS; Grid/Data Center level Power management: (b) Load and Efficient Electric Component management (PSU, PowerGrid in side DC, Inteligent AC vent controller, CRAC, ), use of Solar/Wind energy, place the DC at cool place ("Google have put their entire cloud in side the sea to reduce cooling cost"), place DC near to power sources (b) Thermal management (scheduling to reduce temperature/AC cooling requirement), (c) Workload consolidation (reduce Physical number of active/switch on component/machine), (d) Energy aware Task Scheduling using DPM, DVFS, ..; PowerTOP : measuring EC of computing components, use of ACPI (Advanced Configuration for Power Interface), PUE, Data Center Infrastructure Efficiency DCiE=1/PUE, Node Power=Pcpu(37%)+Pmem(17%)+Pdisk(5%)+Pslot(23%)+Pmotherboard(12%)+Pfan(5%). Virtualization allows (provide on more level of abstraction to system, so) the HPC system manager to consolidate, migration and load balance in efficient way.
04 Oct 2017: Introduction to Cloud System and HPC
09 Oct 2017: Mid Sem paper discussion
10 Oct 2017: Project Problem Statement discussion

CS528ProjectProblemStatements.pdf

CS528ProjectAllocation.pdf

11 Oct 2017: Data Center Pricing Model
16 Oct 2017: Data Center Energy Consumption Modeling Paper.pdf
23 Oct 2017: Energy-Aware Rolling-Horizon Scheduling for Real-Time Tasks in Virtualized Cloud Data Centers Paper.pdf
24 Oct 2017: Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment Paper.pdf
25 Oct 2017: Energy-aware VM Consolidation in Cloud Data Centers Using Utilization Prediction Model Paper.pdf
30 Oct 2017: Profiling-Based Workload Consolidation and Migration in Virtualized Data Centers Paper.pdf
31 Oct 2017: End-to-End Delay Minimization for Scientific Workflows in Clouds under Budget Constraint Paper.pdf

01 Nov 2017: Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in Clouds Paper.pdf
06 Nov 2017: Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in Clouds Contd....
07 Nov 2017:Scheduling of real time task (Lower Bound formulation) Paper
08 Nov 2017:Reliability aware scheduling of real time task on Cloud Paper1 Paper2

13 Nov 2017:Benchmarking of HPC system and Workload generation
14 Nov 2017: Open for project discussion <
15 Nov 2017: Open for project discussion

Course Contents:

Detailed Course Contents

[3 Lects] Parallel Processing Concepts; Levels and model of parallelism: instruction, transaction, task, thread, memory, function, data flow models, demand-driven computation etc;
[2 Lects] Parallel Architectures: super-scalar architectures, SIMD, multi-core, multi-threaded, server and cloud;
[2 Lects] Basic Optimization for Serial Codes, Caches and Data Access Optimizations;

[4 Lects + 2 Lects ] Parallel languages and programming environments: Pthread, OpenMP, MPI, Java, Cilk; Performance Analysis of Parallel Algorithms;
[3 Lects] Accelerated HPC: architecture, programming and typical accelerated system with GPU, FPGA, Xeon Phi, etc;

[3 Lects] Fundamental design issues in HPC: Load balancing, scheduling, synchronization and resource management; Operating systems for scalable HPC;
[3 Lects] Fundamental limitations in HPC: bandwidth, latency and latency hiding techniques;
[1 Lect] Scalable Storage systems: RAID, SSD cache, SAS, SAN;
[2 Lects] Benchmarking HPC: scientific, engineering, commercial applications and workloads;
[2 Lects] HPC based on cluster, cloud, and grid computing: economic model, infrastructure, platform, computation as service;

[3 Lects] Power-aware HPC Design: computing and communication, processing, memory design, interconnect design, power management, Energy Aware Scheduling; Energy Efficiency of Large System, Data Center Energy Model
[1 Lect] Advanced topics: peta scale computing; big data processing, optics in HPC, quantum computers;

Text/reference Book:

List of Text/reference Book and Papers

George Hager and Gerhard Wellein. Introduction to High Performance Computing for Scientists and EngineersCRC Press, India, 2010.
John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach (5th ed.) Elsevier India Pvt. Ltd. 2011.
Online HPC Book
Very good HPC web link by George Hager HPC course by Hager
A survey on Cloud Resource Scheduling ACM Survey Cloud Scheduling
Energy Efficiency Techniques in Cloud Computing: A Survey and Taxonomy PDF
A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing SystemsPDF
Cloud Computing: Survey on Energy EfficiencyPDF

Other resources and materials:

List of Other resources and materials:

Resources for Cilk: cilktool cilk-5.4.6.tar.gz, How to Install Cilk HowtoInstallCilk.txt Test program and Makefile for cilk cilkmatmultest and Cilk Mannual And Resources at Cilk@MIT, PowerPoint: lecture-1.ppt, lecture-2.ppt, lecture-3.ppt
Guid to OpenMP: OpenMp Guide, OpenMP HandsOn A Handson On OpenMP at SC08 , An Introduction to Parallel Programming with OpenMP, OpenMP by Examples, Intro to OpenMp
C++ thread References C++ Multithreading
Optimization of Computer Programs in C : Optimizing C Code
Optimization of C/C++ : Optimizing C/C++ Code
C++ Concurrency in Action, by Anthony Williams Book PDF Version
Art of Multiprocessor Programming by Herlihy and Shavit Book PDF Version