Some notes about A/B Testing (V 2.0)

What
1. the practice of showing two variants of the same web page to different segments of visitors at the same time and comparing which variant drives more conversions
2. A/B testing is a general methodology used online when testing product changes and new features
  1. A/B testing works best when testing incremental changes
  2. A/B testing doesn’t work well when testing major changes, like new products, new branding or completely new user experiences
3. the basis for Data Driven Development
4. 2 user experiences with random distribution of users
  1. randomness averages all other factors
  2. allows you to check the difference in user experience by one indicator
What to test
1. Attraction Marketing
  1. channels
2. Product solutions
  1. think about the concept of product value
    1. think about ROI
3. technical solutions
  1. find strange scenarios
  2. understand the value of refactoring
4. business model
Why
1. to expand your business by acquiring new customers and build relationships by catering to existing ones
2. Solve Visitor Pain Points
3. Get Better ROI from Existing Traffic
4. Reduce Bounce Rate
5. Make Low-risk Modifications
6. Achieve Statistically Significant Improvements
7. Profitably Redesign your Website
8. DataDD
Mistakes
1. Not Planning your Optimization Roadmap
  1. Invalid hypothesis
2. Testing too Many Elements Together
3. Ignoring Statistical Significance
4. Using Unbalanced Traffic
5. Testing for Incorrect Duration
6. Failing to Follow an Iterative Process
7. Using the Wrong Tools
8. Testing is carried out for a company that has not reached the required level of DDD
  1. tests are ineffective
"+"/ "-
1. And what else
  1. learn more about our users
    1. to make strategic decisions
  2. insures against accidental product improvements
  3. more flexible project infrastructure for release management:: quick release of changes
    1. Teams know the main metrics
    2. Team is in touch of changes and is included in the process
  4. formulation of hypotheses when setting tasks for the team, understanding the metrics, linking to the company's goals: understanding the relevance of the business goal
2. and what is bad
  1. DDD degrades the design
  2. it is very difficult to execute the test and it is easy to mistake the conclusions
  3. cool idea, but does not affect the business
    1. very disappointing
  4. long, expensive, tiring
  5. local optimum trap
    1. a big step is very expensive
  6. focus on quick and short-term understandable goals, no focus on long-term
Maths
1. general population and sample
  1. general population
    1. all objects that interest us
  2. sample
    1. based on the results, we try to draw conclusions for the general population
2. statistical methods
  1. sample studies
    1. help us make informed decisions based on probabilities
  2. estimation accuracy
    1. confidence interval
      1. captures a certain amount of probability for a range
      2. count
      3. https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
      4. https://sample-size.net/confidence-interval-proportion/
      5. https://www.calculator.net/confidence-interval-calculator.html
      6. c +- 1.645 * sqrt (c*(1-c)/N)
      7. c- conversion in sample
      8. 1.645 - coefficient depending on the level of trust (90%)
      9. N - number of observations
Result
1. Stat test
  1. Task
    1. determine the probability that the difference in results is due to product properties rather than random
      1. mistakes
      2. true negative
      3. no difference
      4. false negative
      5. Type II error: the test showed that there is no difference, but it is
      6. Experiment power (sensitivity)
      7. depends on the effect that is actually
      8. selected confidence level
      9. sample size
      10. effect size
      11. is inversely proportional to the probability of making the Type II error:
      12. the probability of overlooking the effect depends on the size of the actual effect
      13. false positive
      14. Type I error:: the test showed that there is a difference, but it is not
      15. confidence level
      16. the probability of making such a mistake
      17. p-value
      18. an estimate of the probability of obtaining the observed value by chance
      19. the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct
      20. true positive
      21. there is a difference
  2. tools
    1. https://abtestguide.com/calc/
    2. https://abtestguide.com/abtestsize/
  3. structure
    1. Lets's use H0
      1. assumption: there is no difference and check if the data does not contradict this assumption
      2. Null hypothesis
      3. no difference
      4. Alternative hypothesis
      5. samples A and B are taken from populations with different distributions
    2. options
      1. G-test
      2. XI ^2
      3. Student's T test
      4. for binary values
      5. for continuous variables
  4. triangle
    1. confidence level and power level
    2. sample size
    3. registrable effect
    4. Subtopic 4
    5. find a balance
      1. for a more sensitive experiment we increase the sample size
      2. if you reduce the number of observations, then the minimum effect will be greater
Let's Test
1. Research
  1. How the website is currently performing
  2. use Heatmap tools
  3. quantitative and qualitative research
  4. Observe and Formulate Hypothesis
2. Define Typical Values
  1. Confidence level
    1. 90%
      1. Type I error: there is an effect, but in fact it is not
  2. Power
    1. 80 %
      1. Type II error
  3. experiment duration
    1. round up to whole weeks
  4. Hypothesis type
    1. One-sided
3. Create samples
  1. Let's calculate the size of the required sample
    1. https://abtestguide.com/abtestsize/
    2. find out the duration of the experiment
4. Run Test
  1. Split URL Testing
    1. How
      1. Split URL Testing is testing multiple versions of your webpage hosted on different URLs
      2. to compare two versions of a product
      3. to find out how changes in your product have affected its use: to compare the key product metrics for each version
    2. Strategy
      1. Setting up pages for the Split URL test
      2. Adding conversion goals and estimating test duration
      3. Finalizing the test Previewing and starting the test
      4. Previewing and starting the test
  2. Multivariate Testing (MVT)
    1. changes are made to multiple sections of a webpage, and variations are created for all the possible combinations
  3. Multipage Testing
    1. to test changes to particular elements across multiple pages
5. Let's test our results
  1. https://abtestguide.com/calc/
  2. estimate p-value
    1. https://abtestguide.com/calc/
6. use tools
  1. calcs
    1. AB- testguide
      1. https://abtestguide.com/calc/
    2. GTM testing
      1. Google Tag Manager
      2. https://abtestguide.com/gtmtesting/
    3. Bayesian A/B-test Calculator
      1. https://abtestguide.com/bayesian/
    4. Optimizely
      1. https://www.optimizely.com/sample-size-calculator/
    5. PlodCalc
      1. https://prodcalc.app/?fbclid=IwAR23UeOp1zau_itFWUGxehyG_saTaLTykTppnDsaYgwTXMwp6o33LGAqmiw
    6. A/B Split & Multivariate Test Duration Calculator
      1. https://vwo.com/tools/ab-test-duration-calculator/
    7. CLT for means
      1. https://gallery.shinyapps.io/CLT_mean/
    8. Normal Table - z Table - Standard Normal Table - Normal Distribution Table
      1. http://www.normaltable.com/ztable-righttailed.html
    9. Distribution Calculator
      1. https://gallery.shinyapps.io/dist_calc/
    10. Sample Size Calculator (Evan’s Awesome A/B Tools)
      1. https://www.evanmiller.org/ab-testing/sample-size.html
CI-process
1. Measure
2. Prioritize
  1. CIE Prioritization Framework
    1. Confidence
    2. Importance
    3. Ease
3. A/B test
4. Repeat
We are working in changing and unpredictable environment
1. our changes in product
  1. better
  2. worse
  3. hypothesis testing (typical)
    1. User's feedback
      1. feedback is not always truthful and relevant
    2. Sampling bias
      1. statistics of feature using
      2. biases
      3. not difference between correlation and causation
      4. survivor's mistake
    3. Comparisons of events in time
      1. the product is influenced by many factors
      2. changes in competitors
      3. technical features, new technologies
      4. the product has become faster / slower
      5. seasonal demand
      6. pure chance
    4. Evolutionary distortion
      1. we see patterns where there are none
      2. we see factors that confirm our correctness and do not see others
Sources
1. https://vc.ru/flood/6371-ab-errors
2. https://vwo.com/ab-testing/
3. https://blog.hubspot.com/marketing/how-to-do-a-b-testing
4. https://www.crazyegg.com/blog/ab-testing/
5. https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1
6. https://classroom.udacity.com/courses/ud257/lessons/4018018619/concepts/40043986940923
my sketch
1. https://twitter.com/ManukhinaDarya/status/1295284820365520896?s=20