tohtsky
diff --git a/‎.github/workflows/doctest.yml‎
Lines changed: 34 additions & 0 deletions b/‎.github/workflows/doctest.yml‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 13 additions & 19 deletions b/‎README.md‎
Lines changed: 13 additions & 19 deletions
diff --git a/‎doc/requirements.txt‎
Lines changed: 1 addition & 1 deletion b/‎doc/requirements.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/source/conf.py‎
Lines changed: 1 addition & 0 deletions b/‎doc/source/conf.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/source/dependencies.rst‎
Lines changed: 0 additions & 25 deletions b/‎doc/source/dependencies.rst‎
Lines changed: 0 additions & 25 deletions
diff --git a/‎doc/source/index.rst‎
Lines changed: 19 additions & 12 deletions b/‎doc/source/index.rst‎
Lines changed: 19 additions & 12 deletions
diff --git a/‎doc/source/movielens.rst‎
Lines changed: 58 additions & 20 deletions b/‎doc/source/movielens.rst‎
Lines changed: 58 additions & 20 deletions
@@ -0,0 +1,34 @@
+name: Doctest
+on: [push]
+jobs:
+  run_pytest_upload_coverage:
+    runs-on: ubuntu-latest
+    env:
+      OS: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Setup Python
+        uses: actions/setup-python@master
+        with:
+          python-version: "3.8"
+      - name: Build myfm
+        run: |
+          pip install --upgrade pip
+          pip install numpy scipy pandas scikit-learn
+          python setup.py install
+          curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ~/.ml-100k.zip
+      - name: Run pytest
+        run: |
+          pip install pytest phmdoctest sphinx==4.4.0 sphinx_rtd_theme
+      - name: Test Readme.md
+        run: |
+          GEN_TEST_FILE=phmdoctest_out.py
+          phmdoctest README.md --outfile "$GEN_TEST_FILE"
+          pytest "$GEN_TEST_FILE"
+          rm "$GEN_TEST_FILE"
+      - name: Run sphinx doctest
+        run: |
+          cd doc
+          make doctest
@@ -1,11 +1,13 @@
 # myFM
+[![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org)
+[![pypi](https://img.shields.io/pypi/v/myfm.svg)](https://pypi.python.org/pypi/myfm)
+[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/tohtsky/myFM)
+[![Build](https://github.com/tohtsky/myFM/workflows/Build%20wheel/badge.svg?branch=main)](https://github.com/tohtsky/myfm)
+[![Read the Docs](https://readthedocs.org/projects/myfm/badge/?version=stable)](https://myfm.readthedocs.io/en/stable/)
+[![codecov](https://codecov.io/gh/tohtsky/myfm/branch/main/graph/badge.svg?token=kLgOKTQqcV)](https://codecov.io/gh/tohtsky/myfm)
 
-myFM is an implementation of Bayesian [Factorization Machines](https://ieeexplore.ieee.org/abstract/document/5694074/) based on Gibbs sampling, which I believe is a wheel worth reinventing.
-
-The goal of this project is to
 
-1. Implement Gibbs sampler easy to use from Python.
-2. Use modern technology like [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) and [pybind11](https://github.com/pybind/pybind11) for simpler and faster implementation.
+myFM is an implementation of Bayesian [Factorization Machines](https://ieeexplore.ieee.org/abstract/document/5694074/) based on Gibbs sampling, which I believe is a wheel worth reinventing.
 
 Currently this supports most options for libFM MCMC engine, such as
 
@@ -19,33 +21,25 @@ There are also functionalities not present in libFM:
 
 Tutorial and reference doc is provided at https://myfm.readthedocs.io/en/latest/.
 
-# Requirements
-
-Python >= 3.6 and recent version of gcc/clang with C++ 11 support.
-
 # Installation
 
-For Linux / Mac OSX, type
+The package is pip-installable.
 
 ```
 pip install myfm
 ```
 
-In addition to installing python dependencies (`numpy`, `scipy`, `pybind11`, ...), the above command will automatically download eigen (ver 3.3.7) to its build directory and use it for the build.
-
-If you want to use another version of eigen, you can also do
+There are binaries for major operating systems.
 
-```
-EIGEN3_INCLUDE_DIR=/path/to/eigen pip install git+https://github.com/tohtsky/myFM
-```
+If you are working with less popular OS/architecture, pip will attempt to build myFM from the source (you need a decent C++ compiler!). In that case, in addition to installing python dependencies (`numpy`, `scipy`, `pandas`, ...), the above command will automatically download eigen (ver 3.4.0) to its build directory and use it during the build.
 
 # Examples
 
 ## A Toy example
 
 This example is taken from [pyfm](https://github.com/coreylynch/pyFM) with some modification.
 
-```Python
+```python
 import myfm
 from sklearn.feature_extraction import DictVectorizer
 import numpy as np
@@ -75,7 +69,7 @@ This example will require `pandas` and `scikit-learn`. `movielens100k_loader` is
 
 You will be able to obtain a result comparable to SOTA algorithms like GC-MC. See `examples/ml-100k.ipynb` for the detailed version.
 
-```Python
+```python
 import numpy as np
 from sklearn.preprocessing import OneHotEncoder
 from sklearn import metrics
@@ -133,7 +127,7 @@ Below is a toy movielens-like example which utilizes relational data format prop
 
 This example, however, is too simplistic to exhibit the computational advantage of this data format. For an example with drastically reduced computational complexity, see `examples/ml-100k-extended.ipynb`;
 
-```Python
+```python
 import pandas as pd
 import numpy as np
 from myfm import MyFMRegressor, RelationBlock
 
@@ -1 +1 @@
-sphinx==3.2.1
+sphinx==4.4.0
@@ -32,6 +32,7 @@
     "sphinx.ext.autodoc",
     "sphinx.ext.autosummary",
     "sphinx.ext.todo",
+    "sphinx.ext.doctest",
     "sphinx.ext.viewcode",
     "sphinx.ext.autodoc",
     "sphinx.ext.napoleon",
 
@@ -7,17 +7,23 @@
 myFM - Bayesian Factorization Machines in Python/C++
 ====================================================
 
-**myFM** is an unofficial implementation of Bayesian Factorization Machines. Its goals are to
+**myFM** is an unofficial implementation of Bayesian Factorization Machines in Python/C++.
+Notable features include:
 
-* implement a `libFM <http://libfm.org/>`_ - like functionality that is easy to use from Python
-* provide a simpler and faster implementation with `Pybind11 <https://github.com/pybind/pybind11>`_ and `Eigen <http://eigen.tuxfamily.org/index.php?title=Main_Page>`_
+* Implementation most functionalities of `libFM <http://libfm.org/>`_ MCMC engine (including grouping & relation block)
+* A simpler and faster implementation with `Pybind11 <https://github.com/pybind/pybind11>`_ and `Eigen <http://eigen.tuxfamily.org/index.php?title=Main_Page>`_
+* Gibbs sampling for **ordinal regression** with probit link function. See :ref:`the tutorial <OrdinalRegression>` for its usage.
+* Variational inference which converges faster and requires lower memory (but usually less accurate than the Gibbs sampling).
 
-If you have a standard Python environment on MacOS/Linux, you can install the library from PyPI: ::
+
+In most cases, you can install the library from PyPI: ::
 
    pip install myfm
 
 It has an interface similar to sklearn, and you can use them for wide variety of prediction tasks.
-For example, ::
+For example,
+
+.. testcode::
 
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
@@ -35,16 +41,18 @@ For example, ::
    )
    fm = MyFMClassifier(rank=2).fit(X_train, y_train)
 
-   metrics.roc_auc_score(y_test, fm.predict_proba(X_test))
+   print(metrics.roc_auc_score(y_test, fm.predict_proba(X_test)))
    # 0.9954
 
-Try out the following :ref:`examples <MovielensIndex>` to see how Bayesian approaches to explicit collaborative filtering
-are still very competitive (almost unbeaten)!
+.. testoutput::
+   :hide:
+   :options: +ELLIPSIS
 
-One of the distinctive features of myFM is the support for ordinal regression with probit link function.
-See :ref:`the tutorial <OrdinalRegression>` for its usage.
+   0.99...
 
-In version 0.3, we have also implemented Variational Inference, which converges faster and requires lower memory (as we don't have to keep numerous samples).
+
+Try out the following :ref:`examples <MovielensIndex>` to see how Bayesian approaches to explicit collaborative filtering
+are still very competitive (almost unbeaten)!
 
 .. toctree::
    :caption: Basic Usage
@@ -59,7 +67,6 @@ In version 0.3, we have also implemented Variational Inference, which converges
    :caption: Details
    :maxdepth: 1
 
-   dependencies
    api_reference
 
 
 
@@ -30,10 +30,10 @@ This formulation is equivalent to Factorization Machines with
 So you can efficiently use encoder like sklearn's `OneHotEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_
 to prepare the input matrix.
 
-::
+.. testcode ::
 
     import numpy as np
-    from sklearn.preprocessing import OneHotEncoder
+    from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder
     from sklearn import metrics
 
     import myfm
@@ -60,6 +60,12 @@ to prepare the input matrix.
     mae = np.abs(y_test - prediction).mean()
     print(f'rmse={rmse}, mae={mae}')
 
+.. testoutput::
+    :hide:
+    :options: +ELLIPSIS
+
+    rmse=..., mae=...
+
 The above script should give you RMSE=0.8944, MAE=0.7031 which is already
 impressive compared with other recent methods.
 
@@ -78,7 +84,9 @@ user vectors and item vectors are drawn from separate normal priors:
 
 However, we haven't provided any information about which columns are users' and items'.
 
-You can tell  :py:class:`myfm.MyFMRegressor` these information (i.e., which parameters share a common mean and variance) by ``group_shapes`` option: ::
+You can tell  :py:class:`myfm.MyFMRegressor` these information (i.e., which parameters share a common mean and variance) by ``group_shapes`` option:
+
+.. testcode ::
 
     fm_grouped = myfm.MyFMRegressor(
         rank=FM_RANK, random_seed=42,
@@ -93,6 +101,13 @@ You can tell  :py:class:`myfm.MyFMRegressor` these information (i.e., which para
     mae = np.abs(y_test - prediction_grouped).mean()
     print(f'rmse={rmse}, mae={mae}')
 
+.. testoutput::
+    :hide:
+    :options: +ELLIPSIS
+
+    rmse=..., mae=...
+
+
 This will slightly improve the performance to RMSE=0.8925, MAE=0.7001.
 
 
@@ -102,23 +117,32 @@ Adding Side information
 
 It is straightforward to include user/item side information.
 
-First we retrieve the side information from ``Movielens100kDataManager``: ::
+First we retrieve the side information from ``Movielens100kDataManager``:
+
+.. testcode ::
 
     user_info = data_manager.load_user_info().set_index('user_id')
-    user_info['age'] = user_info.age // 5 * 5
-    user_info['zipcode'] = user_info.zipcode.str[0]
+    user_info["age"] = user_info.age // 5 * 5
+    user_info["zipcode"] = user_info.zipcode.str[0]
     user_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(user_info)
 
-    movie_info, movie_genres = data_manager.load_movie_info()
+    movie_info = data_manager.load_movie_info().set_index('movie_id')
     movie_info['release_year'] = [
         str(x) for x in movie_info['release_date'].dt.year.fillna('NaN')
-    ] # hack to avoid NaN
-    movie_info = movie_info[['movie_id', 'release_year'] + movie_genres].set_index('movie_id')
-    movie_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(movie_info.drop(columns=movie_genres))
+    ]
+    movie_info = movie_info[['release_year', 'genres']]
+    movie_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(movie_info[['release_year']])
+    movie_genre_mle = MultiLabelBinarizer(sparse_output=True).fit(
+        movie_info.genres.apply(lambda x: x.split('|'))
+    )
+
+
 
 Note that the way movie genre information is represented in ``movie_info`` DataFrame is a bit tricky (it is already binary encoded).
 
-We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hstack <https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html>`_ function of ``scipy.sparse`` is very convenient for this purpose: ::
+We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hstack <https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html>`_ function of ``scipy.sparse`` is very convenient for this purpose:
+
+.. testcode ::
 
     import scipy.sparse as sps
     X_train_extended = sps.hstack([
@@ -127,9 +151,11 @@ We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hs
             user_info.reindex(df_train.user_id)
         ),
         movie_info_ohe.transform(
-            movie_info.reindex(df_train.movie_id).drop(columns=movie_genres)
+            movie_info.reindex(df_train.movie_id).drop(columns=['genres'])
         ),
-        movie_info[movie_genres].reindex(df_train.movie_id).values
+        movie_genre_mle.transform(
+            movie_info.genres.reindex(df_train.movie_id).apply(lambda x: x.split('|'))
+        )
     ])
 
     X_test_extended = sps.hstack([
@@ -138,17 +164,23 @@ We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hs
             user_info.reindex(df_test.user_id)
         ),
         movie_info_ohe.transform(
-            movie_info.reindex(df_test.movie_id).drop(columns=movie_genres)
+            movie_info.reindex(df_test.movie_id).drop(columns=['genres'])
         ),
-        movie_info[movie_genres].reindex(df_test.movie_id).values
+        movie_genre_mle.transform(
+            movie_info.genres.reindex(df_test.movie_id).apply(lambda x: x.split('|'))
+        )
     ])
 
-Then we can regress ``X_train_extended`` against ``y_train`` ::
+Then we can regress ``X_train_extended`` against ``y_train``
 
-    group_shapes_extended = [len(group) for group in ohe.categories_] + \
-        [len(group) for group in user_info_ohe.categories_] + \
-        [len(group) for group in movie_info_ohe.categories_] + \
-        [ len(movie_genres)]
+.. testcode ::
+
+    group_shapes_extended = (
+        [len(group) for group in ohe.categories_] +
+        [len(group) for group in user_info_ohe.categories_] +
+        [len(group) for group in movie_info_ohe.categories_] +
+        [ len(movie_genre_mle.classes_)]
+    )
 
     fm_side_info = myfm.MyFMRegressor(
         rank=FM_RANK, random_seed=42,
@@ -163,6 +195,12 @@ Then we can regress ``X_train_extended`` against ``y_train`` ::
     mae = np.abs(y_test - prediction_side_info).mean()
     print(f'rmse={rmse}, mae={mae}')
 
+.. testoutput::
+    :hide:
+    :options: +ELLIPSIS
+
+    rmse=..., mae=...
+
 The result should improve further with RMSE = 0.8855, MAE = 0.6944.
 
 Unfortunately, the running time is somewhat (~ 4 times) slower compared to