tc's page

peace & chaos

not posts, but wikis

Mechanisms

There are some good habits that I learnt and practices and established during at Amazon.

  1. Done-Done
  2. Working backwards
  3. COE (Correction of Errors)
    1. 5-WHYs
    2. Template
  4. Programtic Automation
  5. Flywheel
  6. Focus in Input Improvement, Measure with Outcome metrics
  7. RCA: once fixed, it remain fixed.
  8. Good intentions won’t work, mechanisms do.

Toolkit by Design

  • Phonetool: everybody wants a phonetool
  • Sage: A internal Q&A forum
    • leveling and badge (phonetool)
  • TamperMonkey: framework for internal helper tools
  • Paste: A simple note book and share tool

  • broadcast, particularly PoA + BuildTool + AWSPerfEng
  • SDE Good Reads
  • Users/Tcchao (personal wiki/blog culture)
  • Tech Blog

  • CR tool (w. add-ons and even customized add-ons)
    • dryrun
    • goodcorp
  • Profile service
  • Weblab tool (w. overrides)

  • VersionSet
  • BuildTool
  • Conflict Resolver

References

GTD

Brad Porters Common steps of all projects (my annotations). I find this framework a good way of thinking about projects. I’ve put in my thoughts and links to papers I like that go a level deeper.

Projects at Amazon tend to succeed only when they complete each of these steps in order.

# Problem identified
All problems at Amazon are customer problems (internal reference,good book)
Why now?
This typically is a PM/PM-T, with SDE/TPM/PM-T on technical feasibility
Artifacts: Email, PRFAQ, 1/3/6 Pager (On Writing)

# Problem widely understood (between leaders, stake holders... company-wide requires much broader group to understand) -  If the problem is not widely understood you will not get a lot of backing to solve the problem.
Usually the first point in time a tie-break is needed if a similar system exists
Do you have data to make a decision? Understand the customer point of view, what's being done now, what's the process, what's the reality on the ground? This will help ensure the problem is understood but the basis of the coming decision is data driven.
SDE L6+ role here is to help influence on technically credible possible solutions ("is this nuts?")
Artifacts: Email, PRFAQ, 1/3/6 Pager (On Writing)

# Owner Identified - Nothing happens at Amazon without a tagged owner that everyone believes owns it. Usually decided in the management chain, someone stepping up, assigned by someone more senior, etc.
Ownership modes should follow PE Roles framework (really, L6+ Roles framework)--  (Roles Framework)
Focus on establishing high velocity decision making  (Alignment vs Decision Making)
This typically is a TPM for OLT/STeam goal, SDE if we don't have a TPM leveled to scope
Artifacts: Meeting minutes, Project wiki, email list

# Plan in Place
Define the parameters of a 'good' solution: constraints, SLA/SLO, Tenets, are dates important? (Tenets)
Define how to measure success, input/output metrics (Metrics and Measurement, good book)
Focus on how a solution works (not the what we will build), this scales across teams better
Define scope boundary, possibly leading to forked off problems (back to step 1)
Don't do all upfront design, bound sub-designs by constraints that can be built in "step 5: executing"
Separate a roadmap of 'ideal' end state from milestones of incremental value
Artifacts: "How" technical document, prototype code (catalyst) (On Writing)

# Executing (you may not do all the work here but you HAVE to track it.)
Design and building iteratively to solve 'what' to build that meets the 'how in step 4.
Define key milestones to deliver value
Artifacts: Design docs, Code, Test Plans (On Writing)

# Adoption / Solution (do not stop at PLAN, this is a common PE failure pattern).
Track with business/health metrics.
Guide the growth from pilot to everywhere, scaling to new regions and customers stresses designs
remove the short-term decisions (tech debt) used to launch quickly
Help re-evaluate milestones based on customer data, recommend changes as necessary
Artifacts: Dashboards, Page 0 metrics, SIMs/CRs removing tech debt and bug fixing, helping with "escalations"

S3 EE

S3 L6 Engineer LPs

The Amazon L6 leveling guidelines describe what makes a successful L6 engineer at Amazon.  In January 2020, the S3G PEs met to go deeper on a subset of these guidelines that we wanted to discuss across the org.  That resulted in the following writeup. Note that these aren't “new requirements” just for S3G engineers.  These are from the levelling guidelines doc, just explained in a different way.  

Quality obsessed. The quality of what your team delivers directly impacts your customer's experience, and so you own the quality of the software and operations in your space. You are an effective steward of security, durability, and availability. You drive your team’s thinking on testing, operational readiness, and durability to the level needed for the most trusted object store on the planet.  You partner with other leaders to help create a culture that ensures your teams think in terms of quality and safety.

Delivery Oriented. Our roadmap exists to deliver value to customers, and so you find ways to help your teams get things over the finish line.  This involves producing exemplary designs and code. But you recognize that you must deliver through others to scale out your contributions.  To do so, you mentor teammates on their work directly but also mentor teammates on how to mentor others.  When your team is deadlocked on a decision, you help move things forward rather than digging in and defending “your side”.  You recognize that your teams may tend to be overly cautious at the expense of delivery, and you find approaches and tradeoffs to move forward quickly while maintaining safety.

L6 engineers need a lot of "grit" to deliver.  S3G is a multi-billion-dollar business with over 100 trillion objects.  No projects are inherently easy.  L6 leaders need to figure out the right problems to solve in their services to deliver on the broader problems we want to be solving.

Seek and Incorporate Diverse Feedback. You seek out and incorporate diverse feedback from outside of your immediate team on your goals, key service metrics, and designs for your teams and projects. This may involve iteration on your ideas or letting them go altogether.  You are transparent about what you’re working on to other technical leaders in object storage.  On the flip side, you make yourself available to provide feedback to others and help build communities with key peers for ideas to be discussed.

Be a Subject Matter Expert. You understand your team’s software, the value it provides, and the challenges it presents. You’re also deeply familiar with what your teams are working on and why those projects are valuable to customers.  You have a deep understanding of the broader S3G architecture in which your services must exist.  You use this knowledge to inform your decisions, and others seek you out for your expertise. 

Influence the roadmap. You deeply understand what your customers need from your space, and you use that understanding to shape your team’s key service metrics and goals.  You're a partner to your SDM. You disambiguate the work we need to do to achieve a goal. You decompose work into projects/milestones for your fellow SDEs. 

Have Backbone. You speak up when you notice a problem or opportunity to do better.  This could be in response to quality problems, velocity problems, or even in response to a team working on the wrong thing altogether.  At the same time, be part of the solution: clarify the problem, drive to a solution, and start the conversation with your own alternatives.

S3 Tenets

Tenets
Secure – We recognize that threats evolve and we continuously raise the bar on the protection of customer data. Customer data is secure by default and we provide built-in security controls that are simple to use, easy to audit, and scalable. 
Reliable – We exceed customer expectations for durability and availability of each storage class. Our systems are resilient to failures, and do not need planned or unplanned downtime. 
Scalable – We scale availability, speed, throughput, capacity, and robustness to support an unlimited number or variety of web-scale applications. We design our systems to use scale as an advantage so that system growth increases, not decreases, our availability, speed, throughput, capacity, and robustness. 
Cost Effective – We charge only for what is used and continually innovate in software and hardware to drive down cost. We pass savings along to our customers. Customers should not be able to easily build or buy the same services elsewhere for less. 
Simple – We build our features to be easy to understand, easy to adopt and easy to use. We provide a customer experience that is intuitive and effective in solving customer problems. We help customers avoid making mistakes and recover if a mistake is made. 
Performance – We deliver consistent performance in our storage classes. While being mindful of costs, we work to improve our offerings or deliver new offerings to meet our customers’ evolving performance requirements. 
First Use Matters – Our largest customer tomorrow may be using our services for the first time today. We balance our investments and aren’t afraid to self-disrupt to ensure our services remain differentiated and compelling for these customers.
Innovative – We innovate on behalf of customers by listening carefully and not being constrained by perceptions or existing implementations. We invest in testing and safety mechanisms so that our teams can iterate rapidly without compromising the customer experience. We plant seeds, protect saplings, and double down when we see customer delight.

Peru Public Doc

It was a great reading of the Peru project public wiki, the most impressive note for myself is actually this structure. This is kind of MUST answered strutucre whenever a re-architecture is proposed.

The rest of this document will attempt to articulate:
1. Our goals.
2. The current state of affairs.
3. How did we get here?
4. Why can’t the <live VersionSet> be fixed?
5. Why are we looking at an evolution of <Brazil> and not just counting on <CodeSuite>?
6. What would a replacement for the live VersionSet look like in terms of content and user experience? 7. Why do we think we won’t make the same mistakes?