Skip to content

Commit a009954

Browse files
authored
Add doctest &more tests (#8)
* adding doctest... * add doctest workflow * add ml100k download step. * confused by curl vs wget.. * add sphinx deps. * Improve Readme & index.rst * test for encoders. * add test for dataframeencoder.
1 parent 3028ff1 commit a009954

28 files changed

+519
-166
lines changed

.github/workflows/doctest.yml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: Doctest
2+
on: [push]
3+
jobs:
4+
run_pytest_upload_coverage:
5+
runs-on: ubuntu-latest
6+
env:
7+
OS: ubuntu-latest
8+
steps:
9+
- uses: actions/checkout@v2
10+
with:
11+
fetch-depth: 0
12+
- name: Setup Python
13+
uses: actions/setup-python@master
14+
with:
15+
python-version: "3.8"
16+
- name: Build myfm
17+
run: |
18+
pip install --upgrade pip
19+
pip install numpy scipy pandas scikit-learn
20+
python setup.py install
21+
curl http://files.grouplens.org/datasets/movielens/ml-100k.zip -o ~/.ml-100k.zip
22+
- name: Run pytest
23+
run: |
24+
pip install pytest phmdoctest sphinx==4.4.0 sphinx_rtd_theme
25+
- name: Test Readme.md
26+
run: |
27+
GEN_TEST_FILE=phmdoctest_out.py
28+
phmdoctest README.md --outfile "$GEN_TEST_FILE"
29+
pytest "$GEN_TEST_FILE"
30+
rm "$GEN_TEST_FILE"
31+
- name: Run sphinx doctest
32+
run: |
33+
cd doc
34+
make doctest

README.md

Lines changed: 13 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
# myFM
2+
[![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8%20%7C%203.9%20%7C%203.10-blue)](https://www.python.org)
3+
[![pypi](https://img.shields.io/pypi/v/myfm.svg)](https://pypi.python.org/pypi/myfm)
4+
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/tohtsky/myFM)
5+
[![Build](https://github.com/tohtsky/myFM/workflows/Build%20wheel/badge.svg?branch=main)](https://github.com/tohtsky/myfm)
6+
[![Read the Docs](https://readthedocs.org/projects/myfm/badge/?version=stable)](https://myfm.readthedocs.io/en/stable/)
7+
[![codecov](https://codecov.io/gh/tohtsky/myfm/branch/main/graph/badge.svg?token=kLgOKTQqcV)](https://codecov.io/gh/tohtsky/myfm)
28

3-
myFM is an implementation of Bayesian [Factorization Machines](https://ieeexplore.ieee.org/abstract/document/5694074/) based on Gibbs sampling, which I believe is a wheel worth reinventing.
4-
5-
The goal of this project is to
69

7-
1. Implement Gibbs sampler easy to use from Python.
8-
2. Use modern technology like [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) and [pybind11](https://github.com/pybind/pybind11) for simpler and faster implementation.
10+
myFM is an implementation of Bayesian [Factorization Machines](https://ieeexplore.ieee.org/abstract/document/5694074/) based on Gibbs sampling, which I believe is a wheel worth reinventing.
911

1012
Currently this supports most options for libFM MCMC engine, such as
1113

@@ -19,33 +21,25 @@ There are also functionalities not present in libFM:
1921

2022
Tutorial and reference doc is provided at https://myfm.readthedocs.io/en/latest/.
2123

22-
# Requirements
23-
24-
Python >= 3.6 and recent version of gcc/clang with C++ 11 support.
25-
2624
# Installation
2725

28-
For Linux / Mac OSX, type
26+
The package is pip-installable.
2927

3028
```
3129
pip install myfm
3230
```
3331

34-
In addition to installing python dependencies (`numpy`, `scipy`, `pybind11`, ...), the above command will automatically download eigen (ver 3.3.7) to its build directory and use it for the build.
35-
36-
If you want to use another version of eigen, you can also do
32+
There are binaries for major operating systems.
3733

38-
```
39-
EIGEN3_INCLUDE_DIR=/path/to/eigen pip install git+https://github.com/tohtsky/myFM
40-
```
34+
If you are working with less popular OS/architecture, pip will attempt to build myFM from the source (you need a decent C++ compiler!). In that case, in addition to installing python dependencies (`numpy`, `scipy`, `pandas`, ...), the above command will automatically download eigen (ver 3.4.0) to its build directory and use it during the build.
4135

4236
# Examples
4337

4438
## A Toy example
4539

4640
This example is taken from [pyfm](https://github.com/coreylynch/pyFM) with some modification.
4741

48-
```Python
42+
```python
4943
import myfm
5044
from sklearn.feature_extraction import DictVectorizer
5145
import numpy as np
@@ -75,7 +69,7 @@ This example will require `pandas` and `scikit-learn`. `movielens100k_loader` is
7569

7670
You will be able to obtain a result comparable to SOTA algorithms like GC-MC. See `examples/ml-100k.ipynb` for the detailed version.
7771

78-
```Python
72+
```python
7973
import numpy as np
8074
from sklearn.preprocessing import OneHotEncoder
8175
from sklearn import metrics
@@ -133,7 +127,7 @@ Below is a toy movielens-like example which utilizes relational data format prop
133127

134128
This example, however, is too simplistic to exhibit the computational advantage of this data format. For an example with drastically reduced computational complexity, see `examples/ml-100k-extended.ipynb`;
135129

136-
```Python
130+
```python
137131
import pandas as pd
138132
import numpy as np
139133
from myfm import MyFMRegressor, RelationBlock

doc/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
sphinx==3.2.1
1+
sphinx==4.4.0

doc/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
"sphinx.ext.autodoc",
3333
"sphinx.ext.autosummary",
3434
"sphinx.ext.todo",
35+
"sphinx.ext.doctest",
3536
"sphinx.ext.viewcode",
3637
"sphinx.ext.autodoc",
3738
"sphinx.ext.napoleon",

doc/source/dependencies.rst

Lines changed: 0 additions & 25 deletions
This file was deleted.

doc/source/index.rst

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,23 @@
77
myFM - Bayesian Factorization Machines in Python/C++
88
====================================================
99

10-
**myFM** is an unofficial implementation of Bayesian Factorization Machines. Its goals are to
10+
**myFM** is an unofficial implementation of Bayesian Factorization Machines in Python/C++.
11+
Notable features include:
1112

12-
* implement a `libFM <http://libfm.org/>`_ - like functionality that is easy to use from Python
13-
* provide a simpler and faster implementation with `Pybind11 <https://github.com/pybind/pybind11>`_ and `Eigen <http://eigen.tuxfamily.org/index.php?title=Main_Page>`_
13+
* Implementation most functionalities of `libFM <http://libfm.org/>`_ MCMC engine (including grouping & relation block)
14+
* A simpler and faster implementation with `Pybind11 <https://github.com/pybind/pybind11>`_ and `Eigen <http://eigen.tuxfamily.org/index.php?title=Main_Page>`_
15+
* Gibbs sampling for **ordinal regression** with probit link function. See :ref:`the tutorial <OrdinalRegression>` for its usage.
16+
* Variational inference which converges faster and requires lower memory (but usually less accurate than the Gibbs sampling).
1417

15-
If you have a standard Python environment on MacOS/Linux, you can install the library from PyPI: ::
18+
19+
In most cases, you can install the library from PyPI: ::
1620

1721
pip install myfm
1822

1923
It has an interface similar to sklearn, and you can use them for wide variety of prediction tasks.
20-
For example, ::
24+
For example,
25+
26+
.. testcode::
2127

2228
from sklearn.datasets import load_breast_cancer
2329
from sklearn.model_selection import train_test_split
@@ -35,16 +41,18 @@ For example, ::
3541
)
3642
fm = MyFMClassifier(rank=2).fit(X_train, y_train)
3743

38-
metrics.roc_auc_score(y_test, fm.predict_proba(X_test))
44+
print(metrics.roc_auc_score(y_test, fm.predict_proba(X_test)))
3945
# 0.9954
4046

41-
Try out the following :ref:`examples <MovielensIndex>` to see how Bayesian approaches to explicit collaborative filtering
42-
are still very competitive (almost unbeaten)!
47+
.. testoutput::
48+
:hide:
49+
:options: +ELLIPSIS
4350

44-
One of the distinctive features of myFM is the support for ordinal regression with probit link function.
45-
See :ref:`the tutorial <OrdinalRegression>` for its usage.
51+
0.99...
4652

47-
In version 0.3, we have also implemented Variational Inference, which converges faster and requires lower memory (as we don't have to keep numerous samples).
53+
54+
Try out the following :ref:`examples <MovielensIndex>` to see how Bayesian approaches to explicit collaborative filtering
55+
are still very competitive (almost unbeaten)!
4856

4957
.. toctree::
5058
:caption: Basic Usage
@@ -59,7 +67,6 @@ In version 0.3, we have also implemented Variational Inference, which converges
5967
:caption: Details
6068
:maxdepth: 1
6169

62-
dependencies
6370
api_reference
6471

6572

doc/source/movielens.rst

Lines changed: 58 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ This formulation is equivalent to Factorization Machines with
3030
So you can efficiently use encoder like sklearn's `OneHotEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_
3131
to prepare the input matrix.
3232

33-
::
33+
.. testcode ::
3434
3535
import numpy as np
36-
from sklearn.preprocessing import OneHotEncoder
36+
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder
3737
from sklearn import metrics
3838
3939
import myfm
@@ -60,6 +60,12 @@ to prepare the input matrix.
6060
mae = np.abs(y_test - prediction).mean()
6161
print(f'rmse={rmse}, mae={mae}')
6262
63+
.. testoutput::
64+
:hide:
65+
:options: +ELLIPSIS
66+
67+
rmse=..., mae=...
68+
6369
The above script should give you RMSE=0.8944, MAE=0.7031 which is already
6470
impressive compared with other recent methods.
6571

@@ -78,7 +84,9 @@ user vectors and item vectors are drawn from separate normal priors:
7884
7985
However, we haven't provided any information about which columns are users' and items'.
8086

81-
You can tell :py:class:`myfm.MyFMRegressor` these information (i.e., which parameters share a common mean and variance) by ``group_shapes`` option: ::
87+
You can tell :py:class:`myfm.MyFMRegressor` these information (i.e., which parameters share a common mean and variance) by ``group_shapes`` option:
88+
89+
.. testcode ::
8290
8391
fm_grouped = myfm.MyFMRegressor(
8492
rank=FM_RANK, random_seed=42,
@@ -93,6 +101,13 @@ You can tell :py:class:`myfm.MyFMRegressor` these information (i.e., which para
93101
mae = np.abs(y_test - prediction_grouped).mean()
94102
print(f'rmse={rmse}, mae={mae}')
95103
104+
.. testoutput::
105+
:hide:
106+
:options: +ELLIPSIS
107+
108+
rmse=..., mae=...
109+
110+
96111
This will slightly improve the performance to RMSE=0.8925, MAE=0.7001.
97112

98113

@@ -102,23 +117,32 @@ Adding Side information
102117

103118
It is straightforward to include user/item side information.
104119

105-
First we retrieve the side information from ``Movielens100kDataManager``: ::
120+
First we retrieve the side information from ``Movielens100kDataManager``:
121+
122+
.. testcode ::
106123
107124
user_info = data_manager.load_user_info().set_index('user_id')
108-
user_info['age'] = user_info.age // 5 * 5
109-
user_info['zipcode'] = user_info.zipcode.str[0]
125+
user_info["age"] = user_info.age // 5 * 5
126+
user_info["zipcode"] = user_info.zipcode.str[0]
110127
user_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(user_info)
111128
112-
movie_info, movie_genres = data_manager.load_movie_info()
129+
movie_info = data_manager.load_movie_info().set_index('movie_id')
113130
movie_info['release_year'] = [
114131
str(x) for x in movie_info['release_date'].dt.year.fillna('NaN')
115-
] # hack to avoid NaN
116-
movie_info = movie_info[['movie_id', 'release_year'] + movie_genres].set_index('movie_id')
117-
movie_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(movie_info.drop(columns=movie_genres))
132+
]
133+
movie_info = movie_info[['release_year', 'genres']]
134+
movie_info_ohe = OneHotEncoder(handle_unknown='ignore').fit(movie_info[['release_year']])
135+
movie_genre_mle = MultiLabelBinarizer(sparse_output=True).fit(
136+
movie_info.genres.apply(lambda x: x.split('|'))
137+
)
138+
139+
118140
119141
Note that the way movie genre information is represented in ``movie_info`` DataFrame is a bit tricky (it is already binary encoded).
120142

121-
We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hstack <https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html>`_ function of ``scipy.sparse`` is very convenient for this purpose: ::
143+
We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hstack <https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html>`_ function of ``scipy.sparse`` is very convenient for this purpose:
144+
145+
.. testcode ::
122146
123147
import scipy.sparse as sps
124148
X_train_extended = sps.hstack([
@@ -127,9 +151,11 @@ We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hs
127151
user_info.reindex(df_train.user_id)
128152
),
129153
movie_info_ohe.transform(
130-
movie_info.reindex(df_train.movie_id).drop(columns=movie_genres)
154+
movie_info.reindex(df_train.movie_id).drop(columns=['genres'])
131155
),
132-
movie_info[movie_genres].reindex(df_train.movie_id).values
156+
movie_genre_mle.transform(
157+
movie_info.genres.reindex(df_train.movie_id).apply(lambda x: x.split('|'))
158+
)
133159
])
134160
135161
X_test_extended = sps.hstack([
@@ -138,17 +164,23 @@ We can then augment ``X_train`` / ``X_test`` with auxiliary information. The `hs
138164
user_info.reindex(df_test.user_id)
139165
),
140166
movie_info_ohe.transform(
141-
movie_info.reindex(df_test.movie_id).drop(columns=movie_genres)
167+
movie_info.reindex(df_test.movie_id).drop(columns=['genres'])
142168
),
143-
movie_info[movie_genres].reindex(df_test.movie_id).values
169+
movie_genre_mle.transform(
170+
movie_info.genres.reindex(df_test.movie_id).apply(lambda x: x.split('|'))
171+
)
144172
])
145173
146-
Then we can regress ``X_train_extended`` against ``y_train`` ::
174+
Then we can regress ``X_train_extended`` against ``y_train``
147175

148-
group_shapes_extended = [len(group) for group in ohe.categories_] + \
149-
[len(group) for group in user_info_ohe.categories_] + \
150-
[len(group) for group in movie_info_ohe.categories_] + \
151-
[ len(movie_genres)]
176+
.. testcode ::
177+
178+
group_shapes_extended = (
179+
[len(group) for group in ohe.categories_] +
180+
[len(group) for group in user_info_ohe.categories_] +
181+
[len(group) for group in movie_info_ohe.categories_] +
182+
[ len(movie_genre_mle.classes_)]
183+
)
152184
153185
fm_side_info = myfm.MyFMRegressor(
154186
rank=FM_RANK, random_seed=42,
@@ -163,6 +195,12 @@ Then we can regress ``X_train_extended`` against ``y_train`` ::
163195
mae = np.abs(y_test - prediction_side_info).mean()
164196
print(f'rmse={rmse}, mae={mae}')
165197
198+
.. testoutput::
199+
:hide:
200+
:options: +ELLIPSIS
201+
202+
rmse=..., mae=...
203+
166204
The result should improve further with RMSE = 0.8855, MAE = 0.6944.
167205

168206
Unfortunately, the running time is somewhat (~ 4 times) slower compared to

0 commit comments

Comments
 (0)