- 【Better Explained】
-
-
-
Part I
Program Structure and Execution
-
Representing and Manipulating Information
-
【Better Explained】
- 01——计算机上的信息和数据的真实面目,最本质的存在形态。一切其它形态均源于此形态——内存级别的16进制和二进制数
-
Computer numbering formats
http://en.wikipedia.org/wiki/Computer_numbering_formats
-
【Better Explained】
- Arithmetic overflow
- Byte order or endianness
- big-endian
- 内存中与平时书写习惯一致 也是网络字节序
- little-endian
- Linux
- Conversions Between Signed and Unsigned
- 字长相等
- C 中 同样字长的 unsigned 和 signed 转换时 ,数值可能改变(正数不变),内存中的位不变。
- 字长不等
- Expanding the Bit Representation of a Number
- 【Better Explained】
- short foo=-12345;
(unsigned) foo 等价于 (unsigned) (int) foo 值为4294954951
(unsigned) (unsigned short) foo 值为53191【思考下很有意思!】
- short 先变成 unsigned 【符号扩展】 之后在从有符变成无符【机器解释】【CSapp的解释】
- 我的解释当进行类型转换时先确定被转换数据类型【有符还是无符的?】若是有符的进行符号扩展【根据当前所有数据二进制位在在当前这种类型里是正还是负,并在保持正负性的情况下进行的符号扩展】,完了!不存在有符到无符转换这一歩,实际上这一歩是 Cpu 根据 context 【这个 context 是 printf 的输出符号或者是声明定义类型时确定的】信息 自动判定是有符还是无符。也就是整个转换其实只有符号扩展这一歩而已。
- Zero extension
- 无符数转换成更大的数据
- Sign extension
- 有符数转换成更大的数据
- Truncating Numbers
- x mod (2的k次方)
-
Data types
- Integer
- Unsigned Encodings
- B2Uw
- UMaxw
- Signed number representations
- Sign-and-magnitude method
- Ones' complement
- Two’s-complement
- Excess-N
- Base −2
- Floating point
http://en.wikipedia.org/wiki/Floating_point
- IEEE 754-2008
- Rounding algorithms
- Roundings to nearest
- Round to nearest, ties away from zero
- Round to nearest, ties to even
- Directed roundings
- Round toward −∞
- Round toward 0
- Round toward +∞
- IEEE 754: floating point in modern computers
- 【Better Explained】
- exp 是 unsigned
- 格式化的
- exp不为0或255 significand为小于1大于等于0的任意值 s为0或1
- 非格式化的
- 正负零
- exp全为0 significand 全为0值 s为0或1
- gradual underflow
- exp全为0 significand 为非0值 s为0或1
- 特殊值
- +∞ -∞
- exp全为1 significand 全为 0 s为0或1
- NaN
- Categories of single-precision, floating-point values.
- Internal representation
- Half
- 10-bit significand
- Single
- 24-bit significand
- Double
- 53-bit significand
- Extended
- 64-bit significand
- Quad
- 112-bit significand
- Special values
- Signed zero
- Subnormal numbers
- Infinities
- NaNs
- word
- 32或64个二进制位(mathematics)大小的数。这里的位区分于最小内存的单元的位(computer)
-
Numeral system
- Octal
- Hexadecimal Notation
- Two’s-Complement Encodings
- Unsigned Encodings
- Binary numeral system
http://en.wikipedia.org/wiki/Binary_numeral_system#Fractions_in_binary
- Fractional Binary Numbers
- 就是 Fixed point
-
Character encoding
http://en.wikipedia.org/wiki/Character_encoding
- UTF-8
- UTF-16
- GB2312
- 。。。
-
Computational complexity of mathematical operations
http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations
-
Arithmetic
- Integer Arithmetic
- 【Better Explained】
- 计算机执行整数运算实际上是一种模运算形式
- Arithmetic overflow
- Abelian group
- Binary multiplier
- Unsigned Addition
- overflow
- Two’s-Complement Addition
- Negative overflow
- Positive overflow
- Two’s-Complement Negation
- Complement and Increment
- Complement Upper Bits
- Unsigned Multiplication
- 模数乘法罢了
- Two’s-Complement Multiplication
- http://www1.hrbust.edu.cn/zuzhijigou/metc/material/zcyl/Chap02/2.3.2.htm 【注意计算时符号位的积用()括起来 ,方便区分】
- Multiplying by Constants
- 乘法转换成移位和加减
- Dividing by Powers of Two
- Floating-point arithmetic operations
-
Bitwise operation
- 【Better Explained】
- Bool algebra
- Boolean ring
- Logical reasoning
- Deduction
- Induction
- Abduction
- 以为运算采取向零舍入【正负两个方向向令靠近】
- Bool Operations
- Bitwise NOT
- Bitwise AND
- Bitwise OR
- Bitwise XOR
- Bit shifts
- Arithmetic shift
- signed binary numbers
- Logical shift
- unsigned binary numbers
- Rotate no carry
- Rotate through carry
-
Logical Operations【我加的,不在父词条中】
- Logical negation (NOT)
- Logical AND
- Logical OR
-
Machine-Level Representation of Programs
-
【Better Explained】
-
Computers execute machine code, sequences of bytes encoding the low-level operations that manipulate data, manage memory, read and write data on storage devices, and communicate over networks.
- 计算机上的数据信息动起来!!!通过把和数据相关的一些元操作【只是最常用和主要的】一一映射为一些字节序列【就是连续的二进制位罢了,呵呵】,再通过让CPU识别这些字节序列,完成相应的元操作。本质上,这些元操作对应的字节序列也是数据。
-
Assembly code, a textual representation of the machine code giving the individual instructions in the program.
- 汇编代码就是这些元操作字节序列的,文本存在形态。本质上是没有区别的只是机器码是给机器看的,汇编代码是给人看的,呵呵
-
汇编指令与机器码地相互转换
- http://blog.csdn.net/jiangyuanfu/archive/2009/08/18/4456306.aspx
- Assembly Language
- x86
- x86 assembly language
-
GNU toolchain
http://en.wikipedia.org/wiki/GNU_toolchain
-
【Better Explained】
- Gnu as Document
- http://sourceware.org/binutils/docs-2.16/as/
- GNU make
-
GNU Compiler Collection (GCC)
- The C preprocessor (cpp)
- gcc -E file.i
- The C compiler (cc1)
- gcc -S file.s
- The GNU Assembler(GAS,as)
- gcc -c file.o
- GNU linker (GNU ld)
- gcc -o file
- GNU Binutils
- GNU Bison
- GNU m4
-
GNU Debugger (GDB)
- objdump maybe the best partner
-
GNU build system (autotools)
- Autoconf
- Autoheader
- Automake
- Libtool
-
AT&T assembly
-
【Better Explained】
- 操作数不能同时为 Memory reference
-
Accessing Information
- x86 registers
- %eax和%ah是整体和部分的关系
- Operand Specifiers
- immediate
- In ATT-format assembly code, these are written with a ‘$’ followed by an integer using standard C notation, for example, $-577 or $0x1F.
- register
- Register, denotes the contents of one of the registers, either one of the eight 32-bit registers (e.g., %eax) for a double-word operation, one of the eight 16-bit registers (e.g., %ax) for a word operation, or one of the eight single-byte register elements (e.g., %al) for a byte operation.
- memory reference
- Operand forms
- Data Movement Instructions
- Data movement instructions
- With sign expansion, the upper bits of the destination are filled in with copies of the most significant bit of the source value.
-
Arithmetic and Logical Operations
- Integer arithmetic operations
- Load Effective Address
- leal 的目的操作数必须是register
- Unary Operations
- Binary Operations
- Shift Operations
- Special Arithmetic Operations
-
Control
- Condition Codes
- Accessing the Condition Codes
- SET指令
- CMP指令
- TEST指令
- Jump Instructions
-
Translate C language in assembly language
- Conditional Branches
- Loops
- Conditional Move Instructions
- Switch Statements
-
Array Allocation and Access
- Basic Principles
- Pointer Arithmetic
- Nested Arrays
- Fixed-Size Arrays
- Variable-Size Arrays
-
Heterogeneous Data Structures
- Structures
- Unions
- Data Alignment
-
Procedures
- Stack Frame Structure
- Transferring Control
- Register Usage Conventions
- Recursive Procedures
- Out-of-Bounds Memory References and Buffer Overflow
- x86-64
-
Processor Architecture
-
List of CPU architectures
http://en.wikipedia.org/wiki/List_of_CPU_architectures
-
Embedded CPU architectures
- ARM's ARM Architecture
- Intel's 8051 architecture
- Zilog's Z80 architecture
- Western Design Center's 65816 architecture
- Hitachi's SuperH architecture
- Axis Communications' ETRAX CRIS architecture
- Power Architecture (formerly PowerPC)
-
Microcomputer CPU architectures
- x86
- IA-32
- x86-64
- AMD64
- Intel 64
- Advanced RISC Machines' (originally Acorn) ARM
- Motorola's 6800 and 68000 architectures
- Power Architecture (formerly IBM POWER and PowerPC)
-
Workstation/Server CPU architectures
- Intel's Itanium architecture (formerly IA-64)
- Sun Microsystems's SPARC architecture
- Power Architecture (formerly IBM POWER and PowerPC)
- DEC's Alpha architecture
-
Mini/Mainframe CPU architectures
- IBM's System/360, System/370, ESA/390 and z/Architecture (1964-present)
- DEC's PDP-11 architecture, and its successor, the VAX architecture
-
Mixed-core CPU architectures
- IBM's Cell architecture
-
List of Instruction sets
http://en.wikipedia.org/wiki/List_of_instruction_sets
- Intel
- AMD
- ARM
- Donald Knuth
- IBM
- Motorola
- ISA
http://en.wikipedia.org/wiki/Instruction_set
-
Cpu Design
http://en.wikipedia.org/wiki/CPU_design
-
【Better Explained】
- CPU design focuses on these areas:
- datapaths (such as ALUs and pipelines)
- control unit: logic which controls the datapaths
- Memory components such as register files, caches
- Clock circuitry such as clock drivers, PLLs, clock distribution networks
- Pad transceiver circuitry
- Logic gate cell library which is used to implement the logic
- A CPU design project generally has these major tasks:
- Programmer-visible instruction set architecture, which can be implemented by a variety of microarchitectures
- Architectural study and performance modeling in ANSI C/C++ or SystemC
- High-level synthesis (HLS) or RTL (e.g. logic) implementation
- RTL Verification
- Circuit design of speed critical components (caches, registers, ALUs)
- Logic synthesis or logic-gate-level design
- Timing analysis to confirm that all logic and circuits will run at the specified operating frequency
- Physical design including floorplanning, place and route of logic gates
- Checking that RTL, gate-level, transistor-level and physical-level representations are equivalent
- Checks for signal integrity, chip manufacturability
-
Y86
- Y86 Instruction set Architecture
- 【Better Explained】
- Y86 Programs
- Some Y86 Instruction Details
- push %esp
- pop %esp
- Processor's State
- Program Registers
- Condition Codes
- Program Counter (PC)
- Memory
- Program States
- AOK
- HLT
- ADR
- INS
- Instruction set
- Four movl instructions
- Four integer operation instructions
- The seven jump instructions
- Six conditional move instructions
- The call instruction pushes the return address on the stack and jumps to the destination address.
- The ret instruction returns from such a call.
- The pushl and popl instructions
- The halt instruction
- Instruction Encoding
- 【Better Explained】
- Initial byte
- the higher 4 bits , code part
- the lower 4 bits ,function part
- register specifier byte
- 4-byte constant word
- Exceptions
- Sequential Y86 Implementations
- Organizing Processing into Stages
- Fetch
- 取指
- Decode
- 译码
- Execute
- 执行
- Memory
- 访存
- Write back
- 写回
- PC update
- SEQ Hardware Structure
- Fetch
- Hardware Unit
- Instruction memory
- PC incrementer
- Control logic blocks
- Program stat
- Decode
- Hardware Unit
- Register file
- A
- B
- Control logic blocks
- srcA
- srcB
- Program stat
- Execute
- Hardware Unit
- ALU
- CC
- Control logic blocks
- ALU B
- ALU fun
- ALU A
- Program stat
- Memory
- Hardware Unit
- Virtual Memory System
- Control logic blocks
- Mem. control
- Addr
- Data
- Program stat
- Write back
- Hardware Unit
- Register file
- E
- M
- Control logic blocks
- dstM
- dstE
- Program stat
- PC update
- SEQ Timing
- 【Better Explained】
- The processor never needs to read back the state updated by an instruction in order to complete the processing of this instruction.
- Some rule in Y86
- PC is loaded with a new instruction address every clock cycle
- CC is loaded only when an integer operation instruction is executed.
- Virtual memory is written only when an rmmovl pushl or call instruction is executed.
- Hardware unit that require Clock Sequence control
- Clocked registers
- PC
- CC
- Random-access memories
- Virtual memory system
- The register file
- SEQ stage implementations
- Fetch Stage
- seq_fetch.png
- Decode & Write back
- seq_decode&writeback.png
- Execute Stage
- seq_execute.png
- Memory Stage
- seq_memory.png
- PC update Stage
- seq_PC.png
- Pipelined Y86 Implementations
- SEQ+: Rearranging the Computation Stages
- Circuit Retiming
- Inserting Pipeline Registers
- Rearranging and Relabeling Signals
- Next PC Prediction
- Branch Prediction
- Pipeline Hazards
- Data Hazard
- Program registers
- Control Hazard
- PC
- Mispredicted branches
- Ret instructions require special handing
- Avoiding Data Hazards by Stalling
- Avoiding Data Hazards by Forwarding
- Data forwarding
- Bypassing
- Load/Use Data Hazards
- One class of data hazards cannot be handled purely by forwarding,because memory reads occur late in the pipeline. 言外之意,就是内存中的值 valM 不像 valE 从一开始出现就是被钉死了的。本质原因还是出来的晚,CSAPP言简意赅。
- Load inerlock
- Exception Handling
- Inner Exception
- Halt
- illegal Instruction that combine code and fun
- Fetch,read or write a illegal address
- Excepting instruction
- Exception Handler
- This is the one for your OS
- Some details
- The higher priority,the deeper of a instruction,that will be reported to your OS
- A excepting instruction occurs after a mispredicted branch
- The different stage in pipeline will be updated by different instruction. For instance ,if there is some error in one stage of pipeline made by a malicious instruction will effect other instruction that should not be executed.
- Some mechanism
- Take stat and other information to Program stat and W_register
- PIPE Stage Implementations
- PC Selection and Fetch Stage
- Decode and Write-Back Stage
- Execute Stage
- Memory Stage
- Pipeline Control Logic
- Special logic control cases
- Processing ret
- The pipeline must stall until the ret instruction reaches the write-back stage.
- Load/use hazard
- The pipeline must stall for one cycle between an instruction that reads a value from memory and an instruction that uses this value.
- Mispredicted branches
- By the time the branch logic detects that a jump should not have been taken,several instructions at the branch target will have started down the pipeline. These instructions must be removed from the pipeline.
- Instruction squashing
- Exception
- 【Better Explained】
- Bad stuff
- There are two stage will make exception:Fetch and Memory
- There are three stage will update the new program stat:Execute Memory and Write-back
- If an instruction leads an exception ,we forbid the later instruction updating the programmer-visible stat. In addition , when the excepting instruction reaches the write-back stage,we stop the application.
- Detecting Special Control Conditions
- Pipeline Control Mechanisms
- Combinations of Control Conditions
- Control logic implementation
-
Logic Design
- Tools
- Hardware Description Language HDL
- Verilog
- VHDL
- Hardware Control Language HCL
- 【Better Explained】
- The different between HCL's Combinational Circuits and C's logic expression
- 组合电路 随输入 实时 变化 ,C 语言的 逻辑表达式只在程序执行到时求值。
- 组合电路 操作数只有0和1 ,C 中0表FALSE,something else 表示TREUE
- C语言逻辑操作有短路属性,组合电路无
- Signal Declarations
- boolsig name ’C-expr’
- intsig name ’C-expr’
- Quoted Text
- quote ’string’
- Expressions
- HCL Boolean expressions
- 0
- Logic value 0
- 1
- Logic value 1
- name
- Named Boolean signal
- int -expr in {int -expr 1,int -expr 2, . . . ,int -expr k}
- Set membership test
- int -expr 1 == int -expr 2
- Equality test
- int -expr 1 != int -expr 2
- Not equal test
- int -expr 1 < int -expr 2
- Less than test
- int -expr 1 <= int -expr 2
- Less than or equal test
- int -expr 1 > int -expr 2
- Greater than test
- int -expr 1 >= int -expr 2
- Greater than or equal test
- !bool -expr
- NOT
- bool -expr 1 && bool -expr 2
- AND
- bool -expr 1 || bool -expr 2
- OR
- HCL Integer expressions
- Numbers
- Named integer signals
- Case expressions
- [bool -expr 1 : int -expr 1;bool -expr 2 : int -expr 2;...;];
- Blocks
- bool name = bool -expr;
- int name = int -expr;
- Logic synthesis
- General Principles of Pipelining
- Pipeline
- Limitations of Pipelining
- Nonuniform Partitioning
- Diminishing Returns of Deep Pipelining
- Pipelining a System with Feedback
- 【Better Explained】
- Introducing pipelining into a system containing feedback paths is a peril
- We must deal with feedback effects properly
- Data dependency
- Control dependency
-
Hardware Units
- Circuit
- Logic Gates
- Combinational Circuits
- Bit equal
- MUX,multiplexor
- Word-Level Combinational Circuits
- Arithmetic logic unit ALU
- Sequential circuit
- Memory device
- 【Better Explained】
- Read and write a register at the same time
- 我们会看到一个从旧值到新值的变化
- Clocked registers
- PC
- CC
- Random-access memories
- Virtual memory system
- The register file
- The instruction memory
- Clocking
-
Optimizing Program Performance
http://en.wikipedia.org/wiki/Program_optimization
-
Common theme
- Trade-offs
- Bottlenecks
- When to optimize
- Macros
- Time taken for optimization
- Platform dependent and independent optimizations
-
【Better Explained】
-
Writing Efficient Programs
- Bentley's rules
- Algorithms and Data structures
- Write source code that the compiler can effectively optimize.
- Parallel Computing
-
Understanding Processors
- 【Better Explained】
- Boundary
- Latency bound
- Throughput bound
- the reciprocal of issue time
- Overall Operation
- Nehalem
- Superscalar
- Multi operation per clock-cycle
- Out-of-order
- ICU Instruction Control Unit
- Instruction Cache
- Fetch Control
- Branch prediction
- Speculative execution
- Instruction Decode
- FIFO
- Retired
- Flushed
- Retirement unit
- Register file
- EU Execution Unit
- Functional units
- 【Better Explained】
- Register renaming
- Load
- Store
- FP add + integer
- FP mul/div +integer
- Branch +integer
- Data cache
- Functional Unit Performance
- 【Better Explained】
- Latency
- deep in pipeline
- Issue time
- Fully pipeline
- An Abstract Model of Processor Operation
- Transformation from machine code to data-flow
- Other performance stuffs
-
Understanding Memory
- 【Better Explained】
- Write/read dependency
- the different
- Load Store
- Store Load
- Load unit
- Store unit
- Store buffer
-
Life in the Real World: Performance Improvement Techniques
- High-level design. Choose appropriate algorithms and data structures for the problem at hand. Be especially
vigilant to avoid algorithms or coding techniques that yield asymptotically poor performance.
- Basic coding principles. Avoid optimization blockers so that a compiler can generate efficient code.
- Eliminate excessive function calls. Move computations out of loops when possible. Consider selective compromises of program modularity to gain greater efficiency.
- Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
- Low-level optimizations.
- Try various forms of pointer versus array code.
- Reduce loop overhead by unrolling loops.
- Find ways to make use of the pipelined functional units by techniques such as iteration splitting.
- A final word of advice to the reader is to be careful to avoid expending effort on misleading results.
- Algorithmic efficiency
http://en.wikipedia.org/wiki/Algorithmic_efficiency
-
Source code level
-
【Better Explained】
- Expressing Program Performance
- GHz:9 powers of 10 cycles per second
- CPE:Cycles Per Element
- strlen :Asymptotic inefficiency
-
Loop optimizations
- Loop-invariant code motion
- Loop unrolling
- 【Better Explained】
- Reducing the auxiliary operations
- Conditional branch test
- Computing loop index
- Reducing the critical path operations
- The last index
-
Enhancing Parallelism
Reducing data dependencies
- Multiple Accumulators
- Loop unrolling , Multiple parallel
- IA32 Threshold 4
- x86-64 Threshold 12
- Reassociation Transformation
-
Reducing procedure call
- Code Transformation and replacement
- Reducing memory reference
-
Reducing conditional tests
- Trasformat Commanded style to Functional style
-
Compiler optimization
http://en.wikipedia.org/wiki/Optimizing_compiler
-
【Better Explained】
- Optimization Blocker
- Memory aliasing
- Function calls
-
Optimization techniques
- Common themes
- Loop optimizations
- Data-flow optimizations
- SSA-based optimizations
- Code generator optimizations
- Functional language optimizations
- Other optimizations
- Assembly level
-
Run time
-
Identifying and Eliminating Performance Bottlenecks
- Program Profiling
- Gprof
- Using a Profiler to Guide Optimization
- Amdahl’s Law
-
The Memory Hierarchy
http://en.wikipedia.org/wiki/Memory_hierarchy
-
【Better Explained】
-
Buses
- 【Better explained】
- North bridge
- South bridge
- Front side bus,FSB
- HyperTransport
- QuickPath
- A bus is a collection of parallel wires that carry address, data, and control signals
- bus transaction
- read transaction
- write transaction
- System bus
- Memory bus
- I/O bridge
- Bus interface
- I/O bus
- Universal Serial Bus,USB
- Graphics Card
- Peripheral Component Interconnect ,PCI
- Host bus
- SCSI scuzzy
- SATA satuh
-
Storage Technologies
-
Random-Access Memory
http://en.wikipedia.org/wiki/RAM
- Volatile memory
- SRAM
- bistable latching circuitry
- DRAM
- capacitor within an integrated circuit
- Memory Modules
- DIMM
- SIMM
- Conventional DRAM
- supercell
- two-dimensional array
- RAS
- CAS
- Internal row buffer
- Enhanced DRAM
- FPM DRAM
- EDO DRAM
- SDRAM
- DDR SDRAM
- DDR (2 bits)
- DDR2 (4 bits)
- DDR3 (8 bits)
- RDRAM
- VRAM
- VRAM output is produced by shifting the entire
contents of the internal buffer in sequence
- VRAM allows concurrent reads and writes to the
memory
- Nonvolatile Memory
- PROM
- EPROM
- EEPROM
- Flash
- ROM
- Firmware
-
Disk Storage
- Disk drive
- platter
- Two surfaces
- magnetic recording material
- spindle
- revolutions per minute (RPM)
- Track
- Sector
- Gap
- formatting bits
- Cylinder
- Disk Capacity
- Recoding density
- Track density
- Areal density
- Multiple zone recording
- Recoding zone
- Disk Operation
- Read/write head
- At any point in time, all heads are positioned on the same cylinder
- Disks read and write data in sector-sized blocks
- Actuator arm
- radial axis
- Seek
- Head crash
- Access time
- Seek time
- Rotational latency
- Transfer time
- Logical Disk Blocks
- disk controller
- a small buffer on the controller
- Firmware
- logical block number
- (surface, track, sector) triple
- logical blocks
- Accessing Disks
- Memory-mapped I/O
- I/O port
- Direct Memory Access,DMA
-
Solid State Disks
- USB
- SATA
- Flash chip
- Flash translation layer
- Block
- Page
-
Locality
-
【Better Explained】
- they tend to reference data items that
are near other recently referenced data items, or that were recently referenced themselves
- principle of locality
- temporal locality
- a memory location that is referenced once is likely to be referenced again multiple times in the near future
- spatial locality
- if a memory location is referenced once, then the program is likely to reference a nearby memory location in the near future
-
Locality of References to Program Data
- Stride-1 reference pattern
- Sequential reference pattern
- stride-k reference pattern
- Locality of Instruction Fetches
-
Summary of Locality
- Programs that repeatedly reference the same variables enjoy good temporal locality
- the smaller the stride the better the spatial locality
- The smaller the loop body and the greater the number of loop iterations, the better the locality
-
The Memory Hierarchy
-
Caching
- 【Better Explained】
- chunk
- block
- Transfer unit
- Cache Hits
- Cache Misses
- Replacing
- Evicting
- Victim block
- Replacement policy
- LRU
- Kinds of Cache Misses
- Cold cache
- Warmed up
- Compulsory miss/Cold miss
- Conflict miss
- Working set
- Capacity miss
- Cache Management
- compiler
- Register
- MMU
- TLB
- hardware logic built into the caches
- L1
- L2
- L3
- OS & address translation hardware on the CPU
- DRAM
-
Cache Memories
- Generic Cache Memory Organization
- General organization of cache (S; E; B;m)
- cache
- cache set
- cache line
- block
- m address bits t tag bits, s set index bits, b block offset bits
- Direct-Mapped Caches
- Set Selection
- s
- Line Matching
- valid bit
- tag bit
- block offset bit
- Word Selection
- Line Replacement on Misses
- Conflict Misses
- Thrash
- Set Associative Caches
- E-way set associative cache
- Set Selection
- Line Matching
- Word Selection
- Line Replacement on Misses
- LRU
- LFU
- Fully Associative Caches
- Set Selection
- Only one
- Line Matching
- Word Selection
- Issues with Writes
- Write hit
- Write through
- Write back
- Dirty bit
- Write miss
- Write allocate
- Not write allocate
-
Core i7 Cache Hierarchy
- Core 0
- L1 i -cache
- L1 d -cache
- L2 Unified cache
- L3 Unified cache
-
Performance Impact of Cache Parameters
- Metrics
- Miss rate
- Hit rate
- Hit time
- Miss penalty
- Cache Size
- Block Size
- Associativity E
- L1 L2 8
- L3 16
- Write Strategy
-
Writing Cache-friendly Code
- Make the common case go fast
- Minimize the number of cache misses in each inner loop
- Repeated references to local variables are good because the compiler can cache them in the register file (temporal locality)
- Stride-1 reference patterns are good because caches at all levels of the memory hierarchy store data
as contiguous blocks (spatial locality)
-
The Impact of Caches on Program Performance
- The Memory Mountain
-
Part II
Running Programs on a System
-
Linking
-
【Better Explained】
-
Time of the Linking
- Compile time
- Load time
- Run time
- Separate compilation
-
Benefits of Understanding Linking
- Build large programs
- Avoid dangerous programming errors
- Understand how language scoping rules are implemented
- Understand other important systems concepts
- Exploit shared libraries
-
Linking
-
Symbol resolution
- C++ Java Mangling demangling
- Multiply Defined Global Symbols
- Uninitialized global variables get weak symbols.
- Functions and initialized global variables get strong symbols.
- Rule 1: Multiple strong symbols are not allowed.
- Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol.
- Rule 3: Given multiple weak symbols, choose any of the weak symbols.
- Resolve References with Static Libraries
- U
- E
- D
-
Relocation
- Relocating sections and symbol definitions
- The linker then assigns run-time memory addresses to the new aggregate sections, to each section defined by the input modules, and to each symbol defined by the input modules.
- Relocating symbol references within sections
- Relocation Entries
- relocation types
- R 386 PC32
- R 386 32
- Relocating PC-Relative References
- Relocating Absolute References
-
Static Linking
- Linking with Static Libraries
- The general rule for libraries is to place them at the end of the command line.
- On the other hand, if the libraries are not independent, then they must be ordered
- Libraries can be repeated on the command line if necessary to satisfy the dependence requirements.
-
Dynamic Linking
- Shared libraries
- .interp
- JNI
- Linux API
- PIC
- GOT
- .data
- PLT
- .test
- PIC Data References
- performance disadvantages
- PIC Function Calls
- lazy binding
- Loading and Linking Shared Libraries from Applications
- Life in real world
- Distribution Software
- Building high-performance web server
-
Object Files
-
Tools for Manipulating Object Files
- objdump
- readelf
- ar
- strings
- strip
- nm
- size
- ldd
-
Object file formats
- a.out
- COFF
- PE
- ELF
-
Relocatable object file
- Typical ELF relocatable object file.png
- symbol table
- Global symbols defined by module that located file in C or Class in C++ Java
- Not include static global variables
- Global symbols defined by some other module
- Local symbols
- Only Include static global variables and functions. not local variables
-
Executable object file
- Typical ELF executable object file
- objdump -dx
- execve loader _start
- Linux run-time memory image.png
- Shared object file
-
Exceptional Control Flow
-
【Better Explained】
- control transfer
-
control flow of the processor.
- “smooth” sequence
- ECF
- Hardware level
- operating systems level
- application level
-
Benefits of understand ECF
- understand important systems concepts
- understand how applications interact with the operating system
- write interesting new application programs
- understand concurrency
- understand how software exceptions work
-
Tools for Manipulating Processes
- strace
- ps
- top
- pmap
- /proc
-
Exceptions
-
【Better Explained】
- An exception is an abrupt change in the control flow in response to some change in the processor’s state.
- processor’s state
- event
- related to current instruction
- unrelated to current instruction
- The differences to Procedure call
- Return
- Procedure call,pop return address from the stack
- Depending on the class of exception
- Exception handlers run in kernel mode
- Push some additional processor state onto the stack,EFLAGS
- If control is being transferred from a user program to the kernel, all of these items are pushed onto the kernel’s stack rather than onto the user’s stack.
- It looks like meaning that exception occurs in user mode rather than kernel mode
- Asynchronous exceptions occur as a result of events in I/O devices that are external to the processor
- Synchronous exceptions occur as a direct result of executing an instruction
- system-level functions
- system calls
- The standard C library
-
Exception Handling
- exception number
- exception table
- exception table base register
- exception handler
- Exception return
- depending on the type of event that caused the exception
- Icurr
- Inext
- Abort
-
Classes of Exceptions
- Interrupts
- Signal from I/O device
- Async
- Handler returns to next instruction
- Traps
- Intentional exception
- Sync
- Handler returns to instruction following the syscall
- Faults
- Potentially recoverable error
- Sync
- Handler either reexecutes current instruction or aborts.
- Aborts
- Nonrecoverable error
- Sync
- Handler returns to abort routine
-
Exceptions in Linux/IA32 Systems
- exception types
- 0 to 31 exceptions defined by the Intel architects
- 0 Divide error
- Fault
- 13 General protection fault
- Segmentation faults
- Fault
- 14 Page fault
- Fault
- 18 Machine check
- Abort
- 32 to 255 interrupts and traps defined by Linux
- 128 0x80 System call
- Trap
- Linux/IA32 System Calls
- By convention, %eax %ebx, %ecx, %edx, %esi,%edi, and %ebp
-
Processes
-
Context
- program’s codeand data stored in memory
- the contents of its general-purpose registers
- the program counter
- status registers
- the floating-point registers
- user’s stack
- kernel’s stack
- environment variables
- various kernel data structures
- page table
- process table
- file table
-
abstractions
- Logical control flow
- A sequence of PC values
- Corresponded exclusively to instructions contained in our program’s executable object file or in shared objects linked into our program dynamically at run time.
- Instances
- Exception handlers, processes, signal handlers, threads, and Java processes
- Concurrent Flows
- A logical flow whose execution overlaps in time with another flow
- concurrency
- The general phenomenon of multiple flows executing concurrently
- multitasking / Time slicing
- The notion of a process taking turns with other processes
- time slice
- Each time period that a process executes a portion of its flow
- parallel flows
- If two flows are running concurrently on different processor cores or computers
- Private address space
- Private
- Can not be accessed by other process
- 0x08048000
- 0x00400000
-
User and Kernel Modes
- Processors typically provide this capability with a mode bit in some control register that characterizes the privileges that the process currently enjoys.
- /proc
- exports the contents of many kernel data structures as a hierarchy of text files
- /sys
- exports additional low-level information about system buses and devices
-
Process scheduling
- At certain points during the execution of a process, the kernel can decide to preempt the current process and restart a previously preempted process
-
Context Switches
- (1) saves the context of the current process
- (2) restores the saved context of some previously preempted process
- (3) passes control to this newly restored process
-
Process Control
- Obtaining Process IDs
- Creating and Terminating Processes
- process states
- Running.
- Stopped
- Terminated
- fork
- Call once, return twice
- Concurrent execution
- Duplicate but separate address spaces
- Shared files
- process hierarchy
- Reaping Child Processes
- zombie
- waitpid
- Determining the Members of the Wait Set
- pid > 0
- pid = -1
- Modifying the Default Behavior
- 0
- WNOHANG
- WUNTRACED
- Checking the Exit Status of a Reaped Child
- WIFEXITED(status)
- WEXITSTATUS(status)
- WIFSIGNALED(status)
- WTERMSIG(status)
- WIFSTOPPED(status)
- WSTOPSIG(status)
- Error Conditions
- -1
- ECHILD
- EINTR
- Putting Processes to Sleep
- sleep
- 0
- >0
- pause
- Loading and Running Programs
- execve
- called once and never returns or -1
- Typical organization of the user stack when a new program starts.png
- getenv
- setenv
- unsetenv
- Using fork and execve to Run Programs
- shell
- The read step reads a command line from the user
- The evaluate step parses the command line and runs programs on behalf of the user.
- Web servers
-
Process Groups
- Every process belongs to exactly one process group
- getpgrp
- setpgid
-
Signals
-
Linux signals
- A signal is a message that notifies a process that an event of some type has occurred in the system.
- Default action
- Dumping core
- Years ago, main memory was implemented with a technology known as core memory.
- an historical term that means writing an image of the code and data memory segments to disk.
- SIGTRAP
- SIGABRT
- SIGFPE
- SIGSEGV
- ignore
- SIGCHLD
- SIGCONT
- SIGURG
- SIGWINCH
- stop until next SIGCONT
- SIGSTOP
- SIGTSTP
- SIGTTIN
- SIGTTOU
- neither be caught nor ignored
- terminate*
- SIGKILL
- stop until next SIGCONT*
- SIGSTOP
- Each signal type corresponds to some kind of system event. Low-level hardware exceptions are processed by the kernel’s exception handlers and would not normally be visible to user processes. Signals provide a mechanism for exposing the occurrence of such exceptions to user processes.
-
Signal Terminology
- Sending a signal
- The kernel sends (delivers) a signal to a destination process by updating some state in the context of the destination process.
- delivered reasons
- (1) the kernel has detected a system event
- a divide-by-zero error
- the termination of a child process
- (2) A process has invoked the kill function
- A process can send a signal to itself.
- How to send
- the kill Program
- the Keyboard
- the kill Function
- the alarm Function
- Receiving a signal
- A destination process receives a signal when it is forced by the kernel to react in some way to the delivery of the signal.
- pending signal
- A signal that has been sent but not yet received
-
Signal Handling Issues
- Pending signals can be blocked
- Unix signal handlers typically block pending signals of the type currently being processed by the handler
- Pending signals are not queued
- The crucial lesson is that signals cannot be used to count the occurrence of events in other processes.
- System calls can be interrupted
- Portable Signal Handling
- sigaction
- Only signals of the type currently being processed by the handler are blocked
- As with all signal implementations, signals are not queued
- Once the signal handler is installed, it remains installed
- Interrupted system calls are automatically restarted whenever possible
-
Race
- A good method to expose your race in code
- Coin of sleep
-
Nonlocal Jumps
- An important application of nonlocal jumps is to permit an immediate return from a deeply nested function call, usually as a result of detecting some error condition
- Another important application of nonlocal jumps is to branch out of a signal handler to a specific code location, rather than returning to the instruction that was interrupted by the arrival of the signal.
-
setjmp
- called once but returns multiple times
-
longjmp
- called once but never returns
-
Virtual Memory
http://en.wikipedia.org/wiki/Virtual_memory
-
【Better Explained】
-
three important capabilities
- It uses main memory efficiently by treating it as a cache for an address space stored on disk, keeping only the active areas in main memory, and transferring data back and forth between disk and memory as needed
- It simplifies memory management by providing each process
with a uniform address space
- It protects the address space of each process from corruption by other processes
-
Benefits of understand Virtual Memory
- Virtual memory is central
- Virtual memory is powerful
- Virtual memory is dangerous
-
Address Spaces
- LAS
- VAS
- PAS
-
Core i7/Linux Memory System
- Core i7 Address Translation
- Linux Virtual Memory System
- Linux Virtual Memory Areas(also called segments)
- An area is a contiguous chunk of existing (allocated) virtual memory whose pages are related in some way
- task_struct
- mm_struct
- pgd
- vm_area_structs
- vm_start
- vm_end
- vm_prot
- vm_flags
- vm_next
- Linux Page Fault Exception Handling
- Is virtual address legal?
- segmentation fault:
accessing a non-existing page
- Is the attempted memory access legal?
- protection exception:
- normal page fault
- Memory Mapping
- 【Better explained】
- initializes the contents of a virtual memory area by associating it with an object on disk
- swap file/space/area,maintained by the kernel.
- A object mapped into physical memory that the physical pages are not necessarily contiguous
- A virtual memory area that a shared object is mapped into is often called a shared area. Similarly for a private area
- Mapping object
- Regular file in the Unix filesystem
- Anonymous file
- created by the kernel, that contains all binary zeros
- no data is actually transferred between disk and memory
- demand-zero pages
- Shared Objects
- shared object
- It is visible to any other processes that have also mapped the shared object into their virtual memory
- The changes are also reflected in the original object on disk
- private object
- not visible to other processes
- not reflected back to the object on disk
- private copy-on-write
- The fork Function
- The execve Function
- Delete existing user areas
- Map private areas
- Map shared areas
- Set PC
- User-level Memory Mapping
- Dynamic Memory Allocation
- brk ptr
- Explicit allocators
- malloc calloc realloc free
- sbrk
- Why Dynamic Memory Allocation?
- Often we do not know the sizes of certain data structures until the program actually runs
- Allocator Requirements and Goals
- Requirements
- Handling arbitrary request sequences
- Making immediate responses to requests
- Using only the heap
- Aligning blocks (alignment requirement)
- The allocator must align blocks in such a way that they can hold any type of data object. On most systems, this means that the block returned by the allocator is aligned on an eight-byte (double-word) boundary.
- Not modifying allocated blocks
- Goals
- 【Better Explained】
- aggregate payload
- peak utilization
- Maximizing throughput
- throughput is defined as the number of requests that it completes per unit time
- Maximizing memory utilization
- In fact, the total amount of virtual memory allocated by all of the processes in a system is limited by the amount of swap space on disk
- Fragmentation
- internal fragmentation
- external fragmentation
- Implementation
- Free block organization
- Implicit Free Lists
- block
- header
- payload
- padding
- Explicit Free Lists
- Simple Segregated Storage
- Placement
- first fit
- next fit
- best fit
- Segregated Fits
- Buddy Systems
- Splitting
- Coalescing
- false fragmentation
- immediate coalescing
- deferred coalescing
- Coalescing with Boundary Tags
- footer
- Getting Additional Heap Memory
- sbrk
- Implicit allocators /Garbage collectors
- Garbage Collector Basics
- directed reachability graph
- a set of root nodes
- a set of heap nodes
- conservative garbage collectors
- Mark&Sweep Garbage Collectors
- Conservative Mark&Sweep for C Programs
- Common Memory-Related Bugs in C
- Dereferencing Bad Pointers
- Reading Uninitialized Memory
- Allowing Stack Buffer Overflows
- Assuming that Pointers and the Objects they Point to Are the Same Size
- Making Off-by-One Errors
- Referencing a Pointer Instead of the Object it Points To
- Misunderstanding Pointer Arithmetic
- Referencing Nonexistent Variables
- Referencing Data in Free Heap Blocks
- Introducing Memory Leaks
-
Physical and Virtual Addressing
- Memory management unit (MMU)
- Page table base register (PTBR)
-
Address Translation
- Address translation with a page table.png
- Page hit.png
- Page fault.png
- Integrating Caches and VM.png
- Translation lookaside buffer (TLB)
- Components of a virtual address that are used to access the TLB.png
- Multi-level Page Tables
- This scheme reduces memory requirements
- If a PTE in the level-1 table is null that represents a significant potential savings
- only the level-1 table and the most heavily used level-2 page tables need to be cached in main memory
-
Virtual address (VA)
- Virtual page offset (VPO)
- Virtual page number (VPN)
-
Physical address (PA)
- Physical page number (PPN)
- Physical page offset (PPO)
-
The Role of VM
-
VM as a Tool for Caching
- Page
- virtual page (VP)
- Unallocated
- Cached
- Uncached
- physical page (PP)page frames
- DRAM SRAM
- Page Tables
- page table entry (PTE)
- Page Hits
- Page Faults
- swapping or paging
- swapped in (paged in)
- swapped out (paged out)
- demand paging
- Allocating Pages
- Locality
- working set or resident set
- thrashing
-
VM as a Tool for Memory Management
- demand paging and separate virtual address spaces
- Simplifying sharing
- Simplifying loading
- Simplifying linking
- Simplifying memory allocation
-
VM as a Tool for Memory Protection
- Permission bits
- SUP
- READ
- WRITE
- segmentation fault
- more bits for other process access
-
Part III
Interaction and Communication Between Programs
-
System-Level I/O
-
【Better Explained】
-
Benefits of understand Unix I/O
- understand other systems concepts
- Sometimes you have no choice but to use Unix I/O
- there are problems with the standard I/O library that make it risky to use for network programming.
- Unix file is a sequence of some bytes
- All I/O devices, such as networks, disks, and terminals, are modeled as files
- This elegant mapping allows kernel to export a simple, low-level API that enables all input and output to be performed in a uniform and consistent way
-
R I/O
- W.Richard Stevens
-
Unix I/O
-
Opening and Closing Files
- descriptor
- Closing files
- free the data structures
- restore the descriptor
-
Reading and Writing Files
- file position
- EOF
- There is no explicit “EOF character” at the end of a file
- read
- 0 EOF
-
Reading File Metadata
- stat
-
Sharing Files
- Descriptor table
- File table
- v-node table
-
I/O Redirection
- dup2
-
Standard I/O
- Stream is a pointer a structure of type FILE
- Full duplex
- Restriction 1:Input functions following output functions.
- Restriction 2:Output functions following input functions
- ?
-
Network Programming
- The Client-Server Programming Model
-
Networks
-
WAN
- Router
- LAN
- Bridged Ethernet
- Some twisted pairs of wires
- Multiple Ethernet segments
- Ethernet segment
- Some twisted pairs of wires
- Hub
- Bridge
-
Internet Protocol
- Naming scheme
- 48-bit address
- internet addresses
- Delivery mechanism
- Frame
- Header
- Playload
- Encapsulation
-
The Global IP Internet
-
IP Addresses
- scalar IP address in a structure
- dotted-decimal representation
- network byte order
- inet_aton
- inet_ntoa
-
Internet Domain Names
- ICANN
- Internet domain name hierarchy
- HOSTS.txt
- DNS
- gethostbyname
- gethostbyaddr
-
Internet Connections
- Socket Address Structures
- sockaddr
- SA
- Steven
- sockaddr_in
- in_addr
- The Sockets Interface
- socket
- bind
-
Web Servers
-
Web Basics
- HTTP
- HTML
- Hyperlinks
- WWW
- Tim Berners-Lee
- MIME
- URL
- ?
- &
-
Web Content
- Fetch a disk file and return its contents to the client
- static content
- serving static content
- Run an executable file and return its output to the client
- dynamic content
- serving dynamic content
- /
- HTTP Transactions
-
Concurrent Programming
-
【Better Explained】
-
Application-level concurrency
- Accessing slow I/O devices
- Interacting with humans
- Reducing latency by deferring work
- Servicing multiple network clients
- Computing in parallel on multi-core machines
-
concurrent programs
- Processes
- each logical control flow is a process that is scheduled and maintained by the kernel
- Since processes have separate virtual address spaces, communicating with each other flows by IPC
- I/O multiplexing
- applications explicitly schedule their own logical flows in the context of a single process
- Since the program is a single process, all flows share the same address space
- Logical flows are modeled as state machines that the main program explicitly transitions from state to state as a result of data arriving on file descriptors.
- Threads
- Threads are logical flows that run in the context of a single process and are scheduled by the kernel.
- scheduled by the kernel like process flows
- sharing the same virtual address space like I/O multiplexing flows
-
Concurrent ProgrammingWith Processes
-
Pros and Cons of Processes
- file tables are shared
- separate address spaces for processes
- Avoiding accidentally overite the virtual memory of another process
- make it more difficult for processes to share state information
- Slower because the overhead for process control and IPC is high.
-
Traditional view of a process
- Process context
- Data registers
- Condition codes
- Stack pointer (SP)
- Program counter (PC)
- Kernel context
- Process ID (PID)
- VM structures
- Open files
- Signal handlers
- brk pointer
- Code, data, and stack
- stack
- shared libraries
- run-time heap
- read/write data
- read-only code/data
-
Concurrent ProgrammingWith I/O Multiplexing
-
Pros and Cons of I/O Multiplexing
- event-driven designs give programmers more control over the behavior of their programs than process-based designs
- I/O multiplexing runs in the context of a single process
- do not require a process context switch to schedule a new flow
- coding complexity
-
Concurrent ProgrammingWith Threads
-
【Better Explained】
- main thread
- peer thread
- pool of peers
- The main impact of this notion of a pool of peers is that a thread can kill any of its peers, or wait for any of its peers to terminate.
- thread routine
- Posix Threads
- A joinable thread like a zombie process
- A joinable thread can be reaped and killed by other
threads. Its memory resources (such as the stack) are not freed until it is reaped by another thread
- A detached thread cannot be reaped or killed by other threads. Its memory resources are freed
automatically by the system when it terminates.
- Mapping Variables to Memory
- Global variables
- Local automatic variables
- Local static variables
-
Threads Memory Model
- Thread 1 context:
- Data registers
- general-purpose registers
- Condition codes
- stack 1
- On my view ,the stack is the thread routine call stack,in other worlds ,it is just a ordinary function call stack.
- ????
- SP1
- PC1
- TID1
- Thread 2 context:
- Data registers
- general-purpose registers
- Condition codes
- stack 2
- SP2
- PC2
- TID2
- Shared code and data
- shared libraries
- run-time heap
- read/write data
- read-only code/data
- Kernel context:
- VM structures
- Open files
- Signal handlers
- brk pointer
- PID
-
Synchronizing Threads with Semaphores
- Sequential Consistency
- Synchronization error
- Progress Graphs
- critical section
- safe trajectory
- unsafe trajectory
- Semaphores
- Edsger Dijkstra
- Proberen
- Verhogen
- Semaphore invariant
- Binary semaphore
- Mutex
- Counter semaphore
- Schedule Shared Resources
- Forbidden region
- Producer-consumer model
- Using Threads for Parallelism
- Strong scaling
- Weak scaling
-
Thread-unsafe functions
- Failing to protect shared variables
- Semaphore
- it does not require any changes in the calling program
- the additional synchronization will slow down the program
- Relying on state across multiple function invocations
- The only way to make a function such as rand thread-safe is to rewrite it
- Returning a pointer to a static variable
- Rewrite it.
- Lock-and -copy
- the additional synchronization will slow down the program
- Such as gethostbyname return a pointer to a complex structure,if we want to copy the entire Hierarchy ,we need deep copy.
- In terms of the class of multiple invocations functions ,it doesn't work.
- Calling thread-unsafe functions
- Functions relying on multiple function invocations will be unsafe
- Functions that Involved shared variables and static variable can make it safe by using semaphore and lock-and-copy
-
Reentrancy
- Reentrant function
- Reentrant functions are characterized by the property that they do not reference any shared data when they are called by multiple threads
- Reentrant functions are typically more efficient than non-reentrant
thread-safe functions because they require no synchronization operations
- explicitly reentrant
- all function arguments are passed by value (i.e., no pointers)
- all data references are to local automatic stack variables (i.e., no references to static or global variables)
- Implicity reentrant
- Function arguments can be passed by reference (that is, we allow them to pass pointers)
- the calling threads must be careful to pass arguments that pointers to non-shared data
-
Library Functions in Threaded Programs
- Rewrite
- Lock-and-copy-deep copy
-
Race
- A race occurs when the correctness of a program depends on one thread reaching point x in its control flow before another thread reaches point y
- Golden rule
- Threaded programs must work correctly for any feasible trajectory
-
Deadlock
- a collection of threads are blocked, waiting for a condition that will never be true
- Consistent lock will avoid the deadlock if you use mutex