Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions W10D2/W10D2Notes_ChenZejia.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
## W10D2 notes

**written by Chen Zejia.**

### Async/Sync

* Sync: All components of the system work together in the same frame of reference. The frame of reference can be time or space. It's easier and clearer to design.
* Async: It usually don't have a common frame of reference.

### Interruption(I/O)

* Interruption: External devices interrupts CPU. The information is collected by an IC and IC transmits the information to the CPU. CPU can access the specific information through the bus.
* Daisy Chain: The devices are linearly arranged and core is the first. When multiple requests are sent, the request closest to the core was responded first.

### Internet

* Use supercube to increase the dimension and decrease the distance.
* ATM/ISDN(Euro)
* ATM: Async Transfer Mode. Build virtual circuits for transmitting data when needed. All the lines are shared.
* TCP/IP(USA)
* Data is divided into multiple slices for transmission.

Both ATM/ISDN and TCP/IP are asynchronous.

* SNMP: Simple Network Management Protocol, it's used to get the information of each node.
* OSPF: Open Shortest Path First, it can maintain a topology map of the network and find the best path between two nodes.
* DNS: Map url to IP, DNS is the largest database.

### FHSS

* FHSS: Frequency-hopping spread spectrum. Constantly changing the carrier frequency while transmitting data and split data into carrier of different frequencies.

### CDMA

* CDMA: Code-division multiple access. The transmission information is combined through the orthogonal carrier. Use the corresponding decoder to get the information when querying. This method is similar to the Fourier transform.

4 changes: 4 additions & 0 deletions W10D2/menu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
### W10D2 notes

* [陈则佳](W10D2Notes_ChenZejia)

48 changes: 48 additions & 0 deletions W14D1/W14D1Notes_ChenZejia.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
## W14D1 notes

**written by Chen Zejia.**

### Functions

### Performance

* metric: CPI(Cycles per instruction) / MFLOPS
* how to make it faster: Reducing latency / Parallelism(Super Scalar/Pipeline) / Locality

### Principles

* Small (=fast)
* Simple (=regularity)
* Tradeoff / Compromise
* Amdahl's law: $Sp = \frac 1{1-\eta + \eta/s}$ (make the most common part faster)

### Parallelism: Pipeline(ILP)

* Balance(stage-stage)
* Speedup(potential) = N-stages

#### Hazard

* Structural: FU conflict. eg: Mem-conflict(load and instruction fetch) $\leftarrow$ I-cache / D-cache

* Data: **RAW**(true-dep) $\leftarrow$ Distance
"small" distance: forwarding(H/W)
"large" distance: out-of-order(H/W) or code movement(S/W)

**WAR WAW**(Reg conflict)(pseudo-dep) $\leftarrow$ renaming register

* Control $\leftarrow$ Stall+Prediction / Kill branch / Delay slot filling

### Locality: cache

$AMAT = T_{hit}\mathrm{(cache)} + \eta_{miss-rate}\times T_{peralty}$
$T_{hit}$: DM is fast and FA is slow. $T_{peralty}$ (depends on Mem) $\leftarrow$ Wider Bus / multi-bank

#### Def of Cache: DM/FA/SA

* DM line: valid+tag+data.
Instruction: tag+index+bs and index decides which line the data in.
width = $1 + t + 2^b$ lines = $2^i$, while $t$: length of tag, $i$: length of index, $b$: length of bs.
* FA
* SA = DM + FA
N-way SA: $\mathrm{DM}_1$ $\mathrm{DM}_2$ ... $\mathrm{DM}_n$
4 changes: 4 additions & 0 deletions W14D1/menu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
### W14D1 notes

* [陈则佳](W14D1Notes_ChenZejia)

102 changes: 102 additions & 0 deletions W7D1/W7D1Notes_ChenZejia.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## W7D1 notes

**written by Chen Zejia.**

### Cache

* $AMAT = T_{hit} + \eta_{miss}\times T_{penalty}$
* Main reasons of miss: Compulsory, Capacity and Conflict.

### How to Reduce Misses

some feasible methods:

* Choose proper block size.
* Enhance the associativity.
* Victim Cache: add buffer to place data discarded from cache.
* Pseudo-Associativity: for example, toggle the highest bit of index.
* Hardware Prefetching: for example, add stream buffer between cache and memory. The structure of stream buffer is similar to victim cache and write buffer. When we extract data from the memory, we place extra block in stream buffer.
* Software Prefetching: access to the data in advance in order to load it to the cache.
* Binding prefetch: load data to register. But it uses one register.
* Non-Binding prefetch: just touch `Mx`. But it might be useless and the instruction set is more complex.

#### Compiler Optimizations

Reorder the procedures in order to reduce the conflict.

##### Merging Arrays

```c++
int val[SIZE];
int key[SIZE];//Before

struct merge{
int val;
int key;
}merged_array[SIZE];//After
```

Improve spatial locality by merging multiple arrays.

Note that the conflict of 2 arrays can be solved by victim cache. When the number of arrays in the conflict is greater than the size of victim cache, Merging Arrays can work.

##### Loop Interchange

```c++
for(k=0;k<100;k=k+1)
for(j=0;j<100;j=j+1)
for(i=0;i<5000;i=i+1)
x[i][j]=2*x[i][j];//Before

for(k=0;k<100;k=k+1)
for(i=0;i<5000;i=i+1)
for(j=0;j<100;j=j+1)
x[i][j]=2*x[i][j];//After
```

Swap the order of loops in order to sequential access the data. It improves spatial locality.

##### Loop Fusion

```c++
for(i=0;i<N;i=i+1)
for(int j=0;j<N;j=j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1)
d[i][j] = a[i][j] + c[i][j];//Before

for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j]=a[i][j]+c[i][j];
}//After
```

2 misses per access to a&c vs. one miss per access.
Merge loops to sequential access the data. It improves spatial locality.

##### Blocking

```c++
for(i=0;i<N;i=i+1)
for(j=0;j<N;j=j+1){
r=0;
for(k=0;k<N;k=k+1)
r = r + y[i][k]*z[k][j];
x[i][j]=r;
}//Before

for(jj=0;jj<N;jj=jj+B)
for(kk=0;kk<N;kk=kk+B)
for(i=0;i<N;i=i+1)
for(j=jj;j<min(jj+B-1,N);j=j+1){
r=0;
for(k=kk;k<min(kk+B-1,N);k=k+1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
}//After
```

Access data by a certain block size in order to use cache more fully.
It works better with fully associative cache since fully associative cache can use the most of the space. And choosing proper size of block is also important.
4 changes: 4 additions & 0 deletions W7D1/menu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
### W7D1 notes

* 陈则佳[notes](https://github.com/czj0xyz/Arch2022-Notes/blob/main/W7D1/W7D1Notes_ChenZejia.md)