Architecture of HPC System: HPC System (like google data centers) may have many Containers/Rooms, A container have many Racks, A rack have many Chasis/Rack Server, A chasis have many host/socket/processor, A host/processor have many cores, A cores have many hardware threads/virtual processors
Programming upto chasis level is done using Serial and Shared memory parallel programing (Pthread, OpenMP and Cilk). But above chasis level in general we use distributed memory model parallel programming (MPI-Message Passing Interface). We will cover Pthread and OpenMP in one class, Clik in one class, MPI in one class, and GPU/Cuda Programming in one class. MPI have more than 100 API, but studying about 12 API will make you good in MPI Coding
Important Problem of Multi core/Multi Processor:(a) Synchronization, (b) Coherence and (c) Consistency. Synchronization/Locking require support from Hardware, this support is system dependent support, which is atomic instruction support. Example of atomic instruction combos are Test and Set (TAS), Exchange, Fetch and Increment, load locked and Store conditional (LLSC). Trylock TAS is advanced version of TAS with exponential backup which try to lock if fail wait for some time and further fail to acquire waiting time multiplied by factor 2
In Chip multi processor (CMP), all the L2 level cache of cores are coherent and coherence is maintained by hardware it self. In CMP, Lock variable get cached and coherence ensure the locking works fine.
Monitor uses Semaphore, Semaphore uses Mutex, Mutex use TAS and TAS uses Hardware provided Atomic intruction TAS/LLSC/FAI/XCHG
A book on "Art of multiprocessor programming" By Shavit and Herlihy, whcih cover detail proof of locking protocol and formal proof and design of Concurrent/Parallel Data structure like thread safe and efficient list, queue, stack, hash, heap and skip list.
Consistency is about getting same result from both serial version of code and parallel version A good paper on consistency: you may read the references there in
Many issue need to be discussed: Scheduling, execution time, security, power, energy, cost (initial cost/investment: space, capital), heating, cooling, virtualisation, usage, prediction, renting
Shared resources: L2 cache, L1-L2 Link, L3 cache, L2-L3 link, RAM, L3-RAM link, Disk, RAM-DIsk Bus, Internet. Most of the cases L1 cache is private to each cores
L2/L3/DRAM/Disk space is shared, Links/Link Bandwidths are shared. Link is defined by parameters: Latency (time to access to first data/item) and Bandwidth (capability of subsequent accesses per cycle after the first access).
Space and Link is shared: require efficient mechanism to improve performance and reduce interference. Policies for space/bandwidth sharing using logical partition: Equal partition, Dynamic partition based on Demand, Partition based on both utilization and performance.
How to allocate bandwidth to different cores and given the bandwidth allocation how to realize using some method. Bandwidth realization using token distribution method. Generate N token per time slot, distribute based on allocated portion of the bandwidth. Every time a processor/client allowed to access if need to spend a allocated token.
Cache: formal definition (a safe place to keep data for efficient use), L1/L2/L3/DRAM/Disk all can be used as cache. Mozilla uses cache, Page replacement also uses cache techniques. Already computed data can be put into cache
Cache: fully associative (linked list, a data can go to any place of cache, slower access as list need to search sequentially), Direct mapped (array access using index, hashed, a data can go to specific hashed place, there may be collision, faster access) and set associative (mixed approach, hashed location, but each hashed location can have upto 4 or 8 data block, parallel access in a hashed list)
Program cache behaviour: for(i=0 to 100;i++) S=S+A[i]; Lets assume direct mapped cache, each block can hold four elements. Brining one block have one miss and 3 hits (spacial locality). Number of hit and miss for the whole program is 25 misses and 75 hits to caches.
Suppose the same code run for another 200 iterations. for(j=0 to 200; j++) { for(i=0 to 100;i++) S=S+A[i];} and the array A[100] get fit in to the cache. temporal locality: To run the program, we will have 200*100 (20,000) accesses and out of which 25 accesses have misses and 19975 access have hit in cache. Assuming cache hit time is 1 cycles and cache miss is 20 cycle, this program have huge performance gain by using the cache.
for(i=0 to 100;i++) A[i]=B[i]+C[i]*D[i]; Three reads, one write, one addition and one multiplication per iteration. Suppose you want to do 109 *2 operations per second, you need to supply data at 109*4*3 Bytes per second to compute from memory for read and 109*4 Bytes of write to memory.
Cache: safe place to keep data for efficient use: Processor(compute), L1, L2, L3, RAM, SSD, Disk, InternetCache, CloudServer. Many level of cache hierarchy
Cache Performance Improvement
(1) Reducing hit time (small, simple, direct mapped cache, pipelined cache, avoid address translation)
(2) Reducing miss penalty (Multi Level Caches: don't go to bank every time if your pock running out of money, keep some money at your hostel room/locker, Critical word first (read are urgent then write, put all writes to buffer, Wrap around read, concurrent read from cache/memory, data forwarding to processor directly, Victim cache or recycle bin:use two block space as recycle buffer),
(3) Reducing Miss rate (Larger Block size:take benefit of spatial locality, reduces capacity miss, but increase miss penalty as large block need to be transfered/swap-in-out from memory for miss, Larger cache: Reduce capacity miss but increase hit time, so keep small L1 big L2, Increase associativity: to increase hit rate bit hit time increase, use prediction to assume set associative cache as direct mapped cache to reduce hit time and power consumption of multiple tag matching)
Using software approach to reduce miss rate: Compiler optimization, loop interchange
(4) Reducing Miss rate and Miss penalty : Non blocking, cache, Hardware and software pre-fetching
Pre-fetching : Streaming access or row wise access. If we know the access is streaming access, why not to access before hand and compute safely. It is bad idea to access at the time requirement/computation.Using prefetching we can overlap communication/data transfer and computation. Prefetching is can be down by manually a the time of writing code, by compiler (compiler understand the code and insert prefetch instructions) and can be done at hardware level using profile based/machine learning based stream detection.
Loop Optimization : matrix multiplication example, interchanging loop can result in less number of cache miss, we can convert the same to matrix access to row access. Column access result in higher cache miss.
Non-Blocking Cache : Multi-ported cache, most of time processor can request multiple access request at the same time, subsequent cycles. So to make memory system efficient, we need to support multiple access need to handled at the same time. Hit under miss: cache is processing cache miss and access (hit/miss) is allowed, Miss under a miss: cache is processing a miss and another miss is allowed to process. This improve bandwidth of cache system.
Matrix multiplication: main core of MatLab. If we use all the above techniques Prefetching, Loop Optimization and Non-Blocking Cache then Performance of Matrix multiplication can be enhanced to great extend