NoSQL with DynamoDB

DynamoDB basics
1. Concepts
  1. Table
  2. Item
  3. Primary key
    1. Uniquely identifies items in your table
    2. Simple (Partition key)
    3. Composite (Partition key + Sort key)
  4. Attributes
    1. Types
  5. Item collection
    1. All co-located in the same partition by its partition key
  6. Secondary indexes
  7. How it works under the hood
    1. Partitions every 10Gb of data
    2. A request router will hash the PK and figure out the partition it needs to access in O(1)
2. SQL vs NoSQL
  1. Problem NoSQL tries to solve
    1. High scalability
    2. Fast and predictable query performance at any scale
    3. Optimizing access time over storage space
  2. Implications for your model
    1. DynamoDB doesn't allow operations that won't scale
    2. Almost all your access patterns are using your primary key
    3. There are no joins
    4. You need to know your access patterns upfront
    5. You need to add secondary indexes
    6. All your entities in one table + generic PKs
      1. Generic PKs means your partition columns will have a generic name such as PK and SK
3. What is DynamoDB?
  1. It's an indexed object store
  2. Only has a mandatory partition key which determines the data distribution
Under the hood
1. A distributed hash table
  1. Using a hash on the primary key is able to find the right partition, where the data is organized by the sort key (if present) as a B-Tree
2. How data is accessed
  1. 1. Find node for partition key using a Hash Table -> O(1)
  2. 2. Find starting value for sort key using a B-Tree -> O(log n) where n is the size of the item collection, not the entire table
  3. 3. Read values until end of sort key match. This is a sequential read. Limited to 1MB of data
  4. With this process it guarantees a consistent access time at any scale. You just need to prepare your data to avoid multiple accesses
Data modeling process
1. 1. Understand your application
2. 2. Create your entity-relationship diagram (ERD)
3. 3. List your access patterns
  1. Think about each entity and access patterns from that entity (list none if there are none)
4. 4. Design your table to handle your access patterns
  1. List all your access patterns.png
  2. Identify entities by PK.png
5. 5. Know your strategies
Data modeling best practices
1. Denormalize data that doesn't change much
  1. If it changes but not often you can use streams and lambdas to update that data
2. Design your partition key properly
  1. Should you need a partition key which does not distribute data uniformly across partitions, you can use write sharding technique
One-to-many relationships
1. Denormalization + complex attribute
2. Denormalization + duplication
3. Composite primary key + query
Many-to-many relationships
1. These are represented using adjacency list pattern
Dive deep into secondary indexes
1. Additional primary key in the table
2. Data is copied into secondary indexes with new PKs behind the scenes
3. Can't write to secondary indexes
4. Impacts on your write limits (throttling)
Filtering strategies
1. You need to be more intentional about how you model your data cause we don't have a WHERE statement
2. 1. Filtering with the partition key
  1. You get all the items collection
3. 2. Filtering with the sort key
  1. When the SK has a meaning (i.e. a timestamp), you can filter by range (between 2 values), or get a specific one
  2. It has a meaning based on how you modeled it.
4. 3. Sparse index
  1. You create a secondary index which is sparse (i.e. only has some of the items)
  2. Filtering within an entity based on condition
    1. i.e. you want to filter all users that are administrators
    2. Only items that have all elements of a secondary index PK will be copied, you write that column when writing an entity if you want it to be copied to the secondary index
  3. Projecting a single type of entity into a secondary index
    1. From all the entity types in your table, you project the primary key of one of your entities to a secondary index so this secondary index will only have those entities and you can run a full scan without going through all the entity types
5. 4. Client-side filtering
  1. Good when filtering is difficult to model in the DB and dataset isn't that big (i.e. if the entity has ten to a hundred of values)
Sorting strategies
1. Only the sort key is sorted and it's sorted within a partition key
  1. You can fetch all items in a partition key
  2. You can query the sort key by: ==, <, >, >=, <=, "begins with", "between", "in". Sort the results, count, and get top/bottom N values
2. Sorted based on UTF-8 bytes
  1. Is case sensitive (uppercase before lowercase)
3. Sorting with timestamps
  1. Epoch or ISO-8601 work, the latter benefit is that is human readable
  2. Use the same timezone (i.e. UTC)
4. Unique, sortable IDs
  1. UUIDs are not sortable, so no
  2. ULID
    1. 128 bits
    2. URL safe
    3. Lexicographically sortable
  3. Snowflake
    1. 64 bits
    2. URL safe
    3. Lexicographically sortable
  4. KSUID
    1. 128 bits
    2. Lexicographically sortable
5. Sorting on changing attributes
  1. Changing a primary key requires delete + put, which makes these types of attributes is a hassle
  2. Better use a secondary index
6. Ascending vs descending
  1. You can use a ScanIndexForward property
  2. Important when combining multiple entity types in a single item collection
Access patterns strategies
1. When multiple entities in the same table need to be queried there are usually two approaches
  1. Using secondary indexes
  2. Inserting additional items
Pagination
Partition overloading, secondary indexes
Global Secondary indexes (GSI)
1. Support secondary access patterns
2. They create a separate table where the data from the main table is copied (DynamoDB handles this behind the scene)
3. You can choose what data to project to save costs
  1. KEYS_ONLY
    1. Only copies the primary keys of the original table
  2. INCLUDE
    1. You select which attributes from the main table to project
  3. ALL
    1. All the data of the item is copied
4. RCUs / WCUs are provisioned separately from your main table
5. Can be created after the table creation and it will be backfilled automatically
6. Write sharding strategy
  1. When you want to add a GSI on a column which has high write throughput, you "salt" the values by prepending some random value to the column. which will cause the items to be spread out across multiple nodes
  2. i.e. you want to index by the status of an order, which will cause that IN_PROGRESS will be hashed to the same node causing all the in progress orders to be written to that node leading to throttling. Salting this status column you'll end up with values such as IN_PROGRESS#1 to IN_PROGRESS#9 (in case we choose 10 random numbers)
  3. The cardinality of these random values determines how many partitions will be used which has an impact on the write throughput (WCU), so you need to know the write throuput
  4. When reading those keys you can parallelize the reads which will spread across all the shards (1 thread for IN_PROGRESS#1, another for IN_PROGRESS#2, etc)
  5. This can be abstracted in the data access layer, also the fact that you need to chop the salt from the value before its returned to the business layer
  6. Better than randomizing the salt there are other algorithms that can distribute evenly across partitions
7. Index overloading
  1. Using generic column names for secondary keys you can overload the index to support different entities in the same column
High velocity aggregations
1. Using streams and lambdas
Pessimistic locking
1. Use semaphore items
  1. To support one writer with multiple readers but only writing when there's no reader
  2. You can use one column called lock (for writers) and have a counter of the readers
  3. To obtain the write lock you run update PK set lock = true, writeTimeout = timestamp if lock = false and then keep polling until readers are zero
  4. To set the read lock you'd run update PK set readers += 1, readTimeout = timestamp if lock = false
  5. Then poll until lock = false
  6. Call an error handle if writeTimeout expires
Single table design
1. We construct queries that return all needed data in a single interaction with the database. This is important for speeding up the performance of the application for these specific access patterns. However, there is a potential downside, the design of your data model is tailored towards supporting these specific access patterns. Which could conflict with other access patterns, making those less efficient. Because of this trade-off it’s important to prioritize your access patterns and optimize for performance as well as cost based on priority.
2. you need to understand your application’s data access patterns. Access patterns are dictated by your design, and using a single-table design requires a different way of thinking about data modeling.
3. It forces you to avoid doing joins because joins are expensive. Most application have a higher read ratio than write ratio which justifies spending time on optimizing for reads
4. Many developers apply relational design patterns with DynamoDB even though they don’t have the relational tools like the join operation. This means they put their items into different tables according to their type. However, since there are no joins in DynamoDB, they’ll need to make multiple, serial requests to fetch both the Orders and the Customer record.
5. The main reason for using a single table in DynamoDB is to retrieve multiple, heterogenous item types using a single request
6. Downsides
  1. The steep learning curve to understand single-table design
  2. The inflexibility of adding new access patterns
  3. The difficulty of exporting your tables for analytics
7. When not to use single-table design
  1. in new applications where developer agility is more important than application performance
    1. If you’re in the situation where you’re out-scaling a relational database, you probably have a good sense of the access patterns you need. But if you’re making a greenfield application at a startup, it’s unlikely you absolutely require the scaling capabilities of DynamoDB to start, and you may not know how your application will evolve over time.
  2. in applications using GraphQL
Adapting to new access patterns
1. https://dev.to/aws-builders/managing-changing-access-patterns-with-dynamodb-2ef9
Transactions
1. How to use transactions to insert multiple rows in a table (due to single table design) maintaining consistency?
Design patterns and best practices
1. Global tables and summary analytics
2. Write sharding for selective reads
Resources
1. AWS re:Invent 2020: Data modeling with Amazon DynamoDB – Part 1
  1. DynamoDB basics
  2. SQL vs NoSQL
  3. One-to-many relationships
2. AWS re:Invent 2020: Data modeling with Amazon DynamoDB – Part 2
  1. Data modeling process
  2. Filtering strategies
  3. Sorting strategies
3. AWS re:Invent 2020: Amazon DynamoDB advanced design patterns – Part 1
  1. Partition Overloading, secondary indexes
  2. Global tables and summary analytics
  3. Write sharding for selective reads
4. AWS re:Invent 2020: Amazon DynamoDB advanced design patterns – Part 2
  1. Pessimistic Locking
  2. Shard key optimization
  3. Single vs Multi-table shootout
5. Single table design
  1. https://www.alexdebrie.com/posts/dynamodb-single-table/
  2. https://www.alexdebrie.com/posts/dynamodb-no-bad-queries/
  3. time complexity of SQL JOINs
6. Pagination
  1. https://theburningmonk.com/2018/02/guys-were-doing-pagination-wrong/
7. https://www.dynamodbguide.com/what-is-dynamo-db
Why DynamoDB?
1. Global replication across regions
  1. Active-active
  2. Fully managed
  3. Cross region replication under 2 seconds
  4. Relies on DynamoDB streams to propagate the data
2. Scalability
  1. Our current solution with MySQL consist on splitting into more databases (horizontal scaling), with the added complexity of breaking transaction boundaries
  2. Horizontal scaling works best when you can shard the data in a way that a single request can be handled by a single machine. Jumping around multiple boxes and making network requests between them will result in slower performance.
  3. We can accomplish horizontal scalability without breaking the transaction boundaries by adopting DynamoDB with a single table design
3. Performance
  1. Single-digit millisecond latency on any query at any scale
4. Bounded operations
  1. SQL queries are unbounded. There’s no inherent limit to the amount of data you can scan in a single request to your database, which also means there’s no inherent limit to how a single bad query can gobble up your resources and lock up your database.
  2. Aggregations are also expensive on a number of different resources — CPU, memory, and disk. And it’s completely hidden from the developer.
  3. DynamoDB does not allow joins and puts explicit bounds on your queries.
5. Streams
  1. This is a double edged sword, as with the flexibility it provides, we need to monitor different lambdas and also consider them as clients taking care of backward compatibility. So it has to be used with care
  2. Data processing options
6. Large number of write operations
  1. From orders topic
  2. From admin dashboard
7. Trade offs
  1. Need to know the access patterns upfront
  2. We need to have a clear data layer domain split from the main code. This is because some optimizations on DynamoDB such as write sharding, decorate the data we write
  3. Spend more time in the database design, optimization and monitoring
  4. Learn how to interact with the service (no SQL)
  5. How to do data analytics
    1. AWS Glue
    2. CDC with streams
Important metrics
1. Throttling
2. Errors
3. Latency
4. Hot keys / partitions
  1. Can be measured by counting the accesses to primary keys and finding outliers
  2. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/contributorinsights_HowItWorks.html
  3. Partition throughput limits: 3000 RCUs / 1000 WCUs per partition per second
5. Item size < 400KB
6. Item collection size < 10 GB
7. Query and scan limit of 1MB
  1. If your operation has additional results after 1MB, DynamoDB will return a LastEvaluatedKey property that you can use to handle pagination on the client side. We could monitor when we receive LastEvaluatedKey
8. Consumed capacity: WCU and RCU usage
9. Cost (?)
10. Datadog
  1. Integration
  2. Metrics
  3. Other interesting metrics from other providers