Updated README and docs

roycoding · roycoding · commit 3b7e8dc3d8bd · 2016-08-08T15:40:38.000-05:00
diff --git a/README.md b/README.md
@@ -1,13 +1,116 @@
-#slots
-###*A multi-armed bandit library for Python*
+# slots
+### *A multi-armed bandit library for Python*
 
 Slots is intended to be a basic, very easy-to-use multi-armed bandit library for Python.
 
-See [slots-notes.md](https://github.com/roycoding/slots/blob/master/slots-notes.md) for design ideas.
-
-####Author
+#### Author
 [Roy Keyes](https://roycoding.github.io) -- roy.coding@gmail
 
-####License: BSD
+#### License: BSD
 See [LICENSE.txt](https://github.com/roycoding/slots/blob/master/LICENSE.txt)
 
+
+### Introduction
+slots is a Python library designed to allow the user to explore and use simple multi-armed bandit (MAB) strategies. The basic concept behind the multi-armed bandit problem is that you are faced with *n* choices (e.g. slot machines, medicines, or UI/UX designs), each of which results in a "win" with some unknown probability. Multi-armed bandit strategies are designed to let you quickly determine which choice will yield the highest result over time, while reducing the number of tests (or arm pulls) needed to make this determination. Typically, MAB strategies attempt to strike a balance between "exploration", testing different arms in order to find the best, and "exploitation", using the best known choice. There are many variation of this problem, see [here](https://en.wikipedia.org/wiki/Multi-armed_bandit) for more background.
+
+slots provides a hopefully simple API to allow you to explore, test, and use these strategies. Basic usage looks like this:
+
+```Python
+import slots
+
+# Try 3 bandits with arbitrary win probabilities
+b = slots.MAB()
+b.run()
+```
+
+To inspect the results and compare the estimated win probabilities versus the true win probabilities:
+```Python
+b.best
+> 0
+
+# Assuming payout of 1.0 for all "wins"
+b.est_payouts()
+> array([ 0.83888149,  0.78534031,  0.32786885])
+
+b.bandits.probs
+> [0.8020877268854065, 0.7185844454955193, 0.16348877912363646]
+```
+
+For "real world" (online) usage, test results can be sequentially fed into an `MAB` object. The tests will continue until a stopping criterion is met.
+
+Using slots to determine the best of 3 variations on a live website.
+```Python
+mab = slots.MAB(live=True, payouts=[]*3)
+```
+
+Make the first choice randomly, record responses, and input reward 2 was chosen. Run online trial (input most recent result) until test criteria is met.
+```Python
+mab.online_trial(bandit=2,payout=1)
+```
+
+The response of mab.online_trial() is a dict of the form:
+```Python
+{'new_trial': boolean, 'choice': int, 'best': int}
+```
+Where:
+- If the criterion is met, `new_trial` = `False`.
+- `choice` is the current choice of arm to try.
+- `best` is the current best estimate of the highest payout arm.
+
+By default, slots uses the epsilon greedy strategy. Besides epsilon greedy, the softmax and upper credibility bound strategies are also implemented.
+
+#### Regret analysis
+A common metric used to evaluate the relative success of a MAB strategy is "regret". This reflects that fraction of payouts (wins) that have been lost by using the sequence of pulls versus the currently best known arm. The current regret value can be calculated by calling the `mab.regret()` method.
+
+For example, the regret curves for several different MAB strategies can be generated as follows:
+```Python
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+import slots
+
+# Test multiple strategies for the same bandit probabilities
+probs = [0.4, 0.9, 0.8]
+
+ba = slots.MAB(probs=probs)
+bb = slots.MAB(probs=probs)
+bc = slots.MAB(probs=probs)
+
+# Run trials and calculate the regret after each trial
+rega = []
+regb = []
+regc = []
+for t in range(10000):
+    ba._run('eps_greedy')
+    rega.append(ba.regret())
+    bb._run('softmax')
+    regb.append(bb.regret())
+    bc._run('ucb')
+    regc.append(bc.regret())
+
+
+# Pretty plotting
+sns.set_style('whitegrid')
+sns.set_context('poster')
+
+plt.figure(figsize=(15,4))
+plt.plot(rega, label='$\epsilon$-greedy ($\epsilon$=0.1)')
+plt.plot(regb, label='Softmax ($T$=0.1)')
+plt.plot(regc, label='UCB')
+plt.legend()
+plt.xlabel('Trials')
+plt.ylabel('Regret')
+plt.title('Multi-armed bandit strategy performance (slots)')
+plt.ylim(0,0.2);
+```
+![](./misc/regret_plot.png)
+
+### API documentation
+For documentation on the slots API, see [slots-docs.md](https://github.com/roycoding/slots/blob/master/slots-docs.md).
+
+
+### Todo list:
+- More MAB strategies
+  - Bayesian bandits
+- Argument to save regret values after each trial in an array.
+- TESTS!
diff --git a/misc/regret_plot.png b/misc/regret_plot.png
diff --git a/slots-docs.md b/slots-docs.md
@@ -1,6 +1,10 @@
-#Multi-armed bandit library notes
+# slots
+## Multi-armed bandit library in Python
 
-### What does the library need to do?
+## Documentation
+This documents details the current and planned API for slots. Non-implemented features are noted as such.
+
+### What does the library need to do? An aspirational list.
 1. Set up N bandits with probabilities, p_i, and payouts, pay_i.
 2. Implement several MAB strategies, with kwargs as parameters, and consistent API.
 3. Allow for T trials.
@@ -10,7 +14,8 @@
     2. number of trials completed for each arm
     3. scores for each arm
     4. average payout per arm (payout*wins/trials?)
-    5. Current regret.  Regret = Trials*mean_max - sum^T_t=1(reward_t) See [ref](https://www.princeton.edu/~sbubeck/SurveyBCB12.pdf)
+    5. Current regret.  Regret = Trials*mean_max - sum^T_t=1(reward_t)
+        - See [ref](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
 6. Use sane defaults.
 7. Be obvious and clean.
 
@@ -32,47 +37,53 @@ mab = slots.MAB(payouts = [1,10,15])
 
 # Bandits with payouts specified by arrays (i.e. payout data with unknown probabilities)
 # payouts is an N * T array, with N bandits and T trials
+# (Partially implemented)
 mab = slots.MAB(live = True, payouts = [[0,0,0,0,1.2,0,0],[0,0.1,0,0,0.1,0.1,0]]
 ```
 
 Running tests with strategy, S
 
 ```Python
-# Default: Epsilon-greedy, epsilon = 0.1, num_trials = 1000
+# Default: Epsilon-greedy, epsilon = 0.1, num_trials = 100
 mab.run()
 
-# Run chosen strategy with specified parameters and trials
-mab.eps_greedy(eps = 0.2, trials = 10000)
+# Run chosen strategy with specified parameters and number of trials
 mab.run(strategy = 'eps_greedy',params = {'eps':0.2}, trials = 10000)
 
 # Run strategy, updating old trial data
+# (NOT YET IMPLEMENTED)
 mab.run(continue = True)
 ```
 
 Displaying / retrieving bandit properties
 
 ```Python
 # Default: display number of bandits, probabilities and payouts
+# (NOT YET IMPLEMENTED)
 mab.bandits.info()
 
 # Display info for bandit i
+# (NOT YET IMPLEMENTED)
 mab.bandits[i]
 
 # Retrieve bandits' payouts, probabilities, etc
 mab.bandits.payouts
 mab.bandits.probs
 
 # Retrieve count of bandits
+# (NOT YET IMPLEMENTED)
 mab.bandits.count
 ```
 
 Setting bandit properties
 
 ```Python
 # Reset bandits to defaults
+# (NOT YET IMPLEMENTED)
 mab.bandits.reset()
 
 # Set probabilities or payouts
+# (NOT YET IMPLEMENTED)
 mab.bandits.probs_set([0.1,0.05,0.2,0.15])
 mab.bandits.payouts_set([1,1.5,0.5,0.8])
 ```
@@ -84,33 +95,38 @@ Displaying / retrieving test info
 mab.best()
 
 # Retrieve bandit probability estimates
+# (NOT YET IMPLEMENTED)
 mab.prob_est()
 
 # Retrieve bandit probability estimate of bandit i
+# (NOT YET IMPLEMENTED)
 mab.prob_est(i)
 
 # Retrieve bandit payout estimates (p * payout)
-mab.payout_est()
+mab.est_payout()
 
 # Retrieve current bandit choice
+# (NOT YET IMPLEMENTED, use mab.choices[-1])
 mab.current()
 
 # Retrieve sequence of choices
 mab.choices
 
-# Retrieve probabilty estimate history
+# Retrieve probability estimate history
+# (NOT YET IMPLEMENTED)
 mab.prob_est_sequence
 
 # Retrieve test strategy info (current strategy) -- a dict
+# (NOT YET IMPLEMENTED)
 mab.strategy_info()
 ```
 
 ###Proposed MAB strategies
-1. Epsilon-greedy
-2. Epsilon decreasing
-3. Softmax
-4. Softmax decreasing
-5. Upper credible bound
+- [x] Epsilon-greedy
+- [ ] Epsilon decreasing
+- [x] Softmax
+- [ ] Softmax decreasing
+- [x] Upper credible bound
 
 ###Example: Running slots with a live website
 ```Python