url:: https://youtu.be/CZ3wIuvmHeM ## Challenges and Solutions ### Dependency #### Intra-service Requests - Microservice A talking to Microservice B - Problems - Network Latency, Congestion, failure - Logical or Scaling failure - Solutions - Have a fallback service to call or at the very least a static response that allows the customer to carry on with their business - fail fast, and return to the fallback or wait to recover - (FIT) Fault Injection Testing - Synthetic Transactions - Override by device or account - % of live traffic up to 100% (test a launched service under load from live customers) - Enforced thoughout the call - How do we contrain testing scope? - the most critical services are identified as a group for barest functionality and a FIT reciepe is made and blacklists all non-essential services #### Client Libraries #### Data Persistence - CAP Theorem: "In the presence of a network partition, you much choose between consistency and availability." - If you have 1 service needing to write to 3 databases, what if one write fails? - Do you cancel the write? or do you write to what you can? - you can aim for eventual consistency by writing to what databases you can and settle up later - The client writes to one node which then orchestrates the writing to all the other nodes - "Local Quorum" #### Infrastructure - Have redundant hosting across nodes to prevent catastrophic down time ### Scale #### Stateless Services - Its not a cache or database - frequently accessed metadata - no instance affinity - loss of a node is a non-event - Autoscaling groups - Compute efficiency - Node failure - Traffic Spikes - Performance Bugs - Chaos monkey tool test that when a node dies, the service continues to work #### Stateful Services - databases and caches - sometimes a custom app that holds large amounts of data (avoid storing business logic, and state within 1 application if you can avoid it) - loss of a node is a notable event - redundancy is fundemental - EVCache -> difference nodes -> each node has multiple shard caches - separate out systems used for batch versus real time transactions - do request level caching - have an encrypted token with the data to fall back on should the service be unavailable to updated the requested data ### Variance #### Operation Drift - drift over time - alert thresholds - timeouts, retries, fallbacks - throughput (RPS) - Across microservices - Reliability best practices - Continious learning and automation - Incident --> Resolution --> Review --> Remediation --> Analysis --> Best Practices? --> Automation --> Adoption - Production Ready best practices - Alerts - Apache & tomcat - Automated canary Analysis - Autoscaling - Chaos - Consistent naming - ELB Config - Healthcheck - Immutable machine images - Squeeze testing - Staged, red/black deployments - Timeouts, retries, fallbacks #### Polyglot & Containers - The Paved Road (do this for a smooth experience) - Stash - Nebula/Gradle - BaseAMI/Ubuntu - Jenkins - Spinnaker - Runtime Platform - Cost of Variance - Productivity Tooling - Insight & Triage Capabilities - Base Image Fragmentation - Node management - Library/Platform duplication - Learning curve - production expertise - Strategic Stance - Raise awareness of costs - Constrain centralized support - Prioritize by impact - Seek reusable solutions ### Change - Integrated, Automated practices - Conformity checks - Red/black pipelines - Automated canaries - Staged deployments - Squeeze tests