Fundamentals Database

ACID
1. Atomicity
  1. One of the transactions can succeed or fail. No 3rd status
2. Consistency
  1. In Data
    1. Defined by the user
    2. Referential integrity ( foreign keys )
  2. In Reads
    1. After commit data. get should return exactly data committed
3. Isolation
  1. Read Uncommitted
    1. Dirty Read
  2. Read Committed
  3. Repeatable Read
    1. Phantom Reads
  4. SnapShot
  5. Serializable
  6. Read phenomena
4. Durability
  1. Changes made by committed transactions must be persisted in durable non volatile storage
  2. WAL - Write ahead log before do it. when crash have log and cannot lost data. Store data after computing not store event ( query )
    1. expensive because when store data are indexes, data, files, ... when have change that why dbms using...
    2. write-ahead-log segments only log version change
  3. Asynchronous snapshot
    1. Lostable data but high performance than wal
  4. AOF ( Redis using ) record event ( query ) and will trigger it when crash
  5. When write data to disk OS will be write to cache after that plush to disk. we can be lost data when system crash
  6. using fsync ( force sync ) to force write to disk without cache but it will be slow
  7. Can be config db engine ( innodb,... )
  8. Default config mostly depend on os
Indexes
1. Clustered index
  1. B-tree or HashMap has pointer to the Heap
  2. Indexes store index of page and row
  3. MYSQL InnoDB this is primary key
  4. They will store all data in record
  5. Other index not primary is non-clustered index It has pointer to the clustered index ( MYSQL )
  6. Better if set auto-increment because Data will be store ( order by ) this key. very bad if insert value with id not increment
2. Non-clustered index
  1. Only store value of column in index pointer to clustered index
  2. Need go to clustered index to get other data if needed
3. How to db choose index
  1. Will be scan table if have a little data
  2. will choose an index if the data on one side that is very small
  3. Other will be using multiple index
  4. Select two column have two index
4. Explain Query
5. Key vs non-Key column index
6. BitMap Index Scan
  1. Use bitwise to calculate and | or
7. Bloom Filter
8. Index Scan
9. Table Scan
10. B+Tree
  1. Btree
Storage
1. Heap
  1. Page.
2. IO operation always fetch minimum a page (8/16kb) not fetch only single row
3. Column Base
  1. Writes are slower
  2. OLAP
  3. Compress greatly
  4. Amazing for aggregation
  5. Inefficient query multi-columns
4. Row Base
  1. Optimize for read/writes
  2. OLTP
  3. Compression isn't efficient
  4. Aggregation isn't efficient
  5. Efficient queries multi-columns
5. Primary key
6. Compare
Concurrency Control
1. Shared Lock
2. Exclusive Locks
3. Level locking
  1. Row
  2. Table
  3. Page
4. Pessimistic
  1. Locking start when start transaction. everyone else need wait for lock release and continues
    1. Good when transactions often conflict
5. Optimistic
  1. when start transaction they got number current version of transaction. when commit system will compare with current version if not change this transaction can be apply. but if change transaction need do again
    1. Good when transactions are less conflict
6. Dead Locks
  1. When two transactions lock each other
7. Two-phase Locking
8. Offset slow
  1. Should using where > id because offset will fetch all data and filter not filter in storage engines
Partitioning
1. Split to multiple table same database server
2. Typpe of split
  1. Horizontal ( Split rows )
  2. Vertical ( Split columns )
  3. By Range
  4. By List
  5. By Hash
3. Pros
  1. Query faster when accessing single partition
  2. Easy bulk loading ( attach partition )
  3. Archive old data very easy
4. Cons
  1. Change schema is hard
  2. Ineffective when scan all data
  3. Hard for update -> move data between partitions is hard
Sharding
1. Pros
  1. Scale database
  2. Security ( user can access certain shards )
  3. Optimal and smaller index size
2. Cons
  1. Complex client logic ( logic aware shard )
  2. Cannot join, transactions beetwen shards
  3. Schema changes are hard
3. consistent hashing
4. Split to multiple database server
5. Consider about sharing is last choosen