Refactoring and improving the README

blokhin · blokhin · commit 009a0bd1ce2a · 2018-03-22T16:33:04.000+01:00
diff --git a/README.md b/README.md
@@ -11,12 +11,16 @@ Live demo
 Rationale
 ------
 
-This is the proof of concept, how a relatively unsophisticated statistical model (namely, _random forest regressor_) trained on the large MPDS dataset predicts a set of physical properties from the only crystalline structure. Similarly to _ab initio_, this method could be called _ab datum_. (Note however that the simulation of physical properties with a comparable precision normally takes days, weeks or even months, whereas the present prediction method takes less than a second.) A crystal structure in either CIF or POSCAR format is required. The following physical properties are predicted:
+This is the proof of concept, how a relatively unsophisticated statistical model (namely, _random forest regressor_) trained on the large MPDS dataset predicts a set of physical properties from the only crystalline structure. Similarly to _ab initio_, this method could be called _ab datum_. (Note however that the simulation of physical properties with a comparable precision normally takes days, weeks or even months, whereas the present method takes less than a second!) A crystal structure in either CIF or POSCAR format is required. The following physical properties are predicted:
 
 - isothermal bulk modulus
 - enthalpy of formation
 - heat capacity at constant pressure
 - melting temperature
+- Debye temperature
+- Seebeck coefficient
+- linear thermal expansion coefficient
+- band gap (or its absense, _i.e._ whether a crystal is conductor or insulator)
 
 Installation
 ------
@@ -34,24 +38,26 @@ Currently only *Python 2* is supported (*Python 3* support is coming).
 Preparation
 ------
 
-The model is trained on the MPDS data using the MPDS API and the script `ml_mpds.py`. Some subset of the full MPDS data is opened and possible to obtain via MPDS API [for free](https://mpds.io/open-data-api).
+The model is trained on the MPDS data using the MPDS API and the scripts `train_regressor.py` and `train_classifier.py`. Some subset of the full MPDS data is opened and possible to obtain via MPDS API [for free](https://mpds.io/open-data-api).
 
 Architecture and usage
 ------
 
-This is the client-server application. The client is not required although, and it is possible to employ the server code as a standalone command-line application. The client is used for a convenience only. The client and the server communicate using HTTP. Any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. As an example of the latter, a simple HTML5 app `index.html` is supplied. Server part is a Flask app, loading the pre-trained ML models:
+Can be used either as a standalone command-line application or as a client-server application. In the latter case, the client and the server communicate over HTTP, and any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. As an example of the latter, a simple HTML5 app `index.html` is supplied in the `webassets` folder. Server part is a Flask app:
 
 ```python
-python index.py /tmp/path_to_model_one /tmp/path_to_model_two
+python mpds_ml_labs/app.py
 ```
 
-Web-browser user interface is then available under `http://localhost:5000`. To serve the requests the development Flask server is used. Therefore an _AS-IS_ deployment in an online environment without the suitable WSGI container is highly discouraged. Serving of the ML models is very simple. For the production environments under high load it is recommended to follow e.g. [TensorFlow Serving](https://www.tensorflow.org/serving).
+Web-browser user interface is then available under `http://localhost:5000`. By default, to serve the requests the development Flask server is used. Therefore an _AS-IS_ deployment in an online environment without the suitable WSGI container is **highly discouraged**. For the production environments under the high load it is recommended to use something like [TensorFlow Serving](https://www.tensorflow.org/serving).
 
 
 Used descriptor and model details
 ------
 
-The term _descriptor_ stands for the compact information-rich representation, allowing the convenient mathematical treatment of the encoded complex data (_i.e._ crystalline structure). Any crystalline structure is populated to a certain relatively big fixed volume of minimum one cubic nanometer. Then the descriptor is constructed using the periodic numbers of atoms and the lengths of their radius-vectors. The details are in the file `prediction.py`. As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset is used for training. In order to estimate the prediction quality, the metrics of _mean absolute error_ and _R2 coefficient of determination_ are used. The evaluation process is repeated at least 30 times to achieve a statistical reliability.
+The term _descriptor_ stands for the compact information-rich representation, allowing the convenient mathematical treatment of the encoded complex data (_i.e._ crystalline structure). Any crystalline structure is populated to a certain relatively big fixed volume of minimum one cubic nanometer. Then the descriptor is constructed using the periodic numbers of atoms and the lengths of their radius-vectors. The details are in the file `mpds_ml_labs/prediction.py`.
+
+As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset can be used for training. In order to estimate the prediction quality of the _regressor_ model, the metrics of _mean absolute error_ and _R2 coefficient of determination_ are used. In order to estimate the prediction quality of the _classifier_ model (binary case), the simple error percentage is used (`(false positives + false negatives)/all outcome`). The evaluation process is repeated at least 30 times to achieve a statistical reliability.
 
 API
 ------
@@ -91,5 +97,7 @@ License
 Citation
 ------
 
-Please feel free to cite:
-- Blokhin E, Villars P, PAULING FILE and MPDS materials data infrastructure, in preparation, 2018
+[![DOI](https://zenodo.org/badge/110734326.svg)](https://zenodo.org/badge/latestdoi/110734326)
+
+Also please feel free to cite:
+- Blokhin E, Villars P, PAULING FILE and MPDS materials data infrastructure, in preparation, **2018**
diff --git a/data/settings.ini.sample b/data/settings.ini.sample
@@ -3,3 +3,5 @@ serve_ui = true
 ml_models =
     /path_to_models/model_one.pkl
     /path_to_models/model_two.pkl
+api_key =
+api_endpoint =
diff --git a/mpds_ml_labs/app.py b/mpds_ml_labs/app.py
@@ -7,13 +7,13 @@
 
 from struct_utils import detect_format, poscar_to_ase, symmetrize, get_formula
 from cif_utils import cif_to_ase, ase_to_eq_cif
-from prediction import ase_to_ml_model, get_legend, load_ml_model
+from prediction import ase_to_prediction, get_legend, load_ml_models
 from common import SERVE_UI, ML_MODELS
 
 
 app_labs = Blueprint('app_labs', __name__)
 static_path = os.path.realpath(os.path.join(os.path.dirname(__file__), '../webassets'))
-active_ml_model = None
+active_ml_models = None
 
 def fmt_msg(msg, http_code=400):
     return Response('{"error":"%s"}' % msg, content_type='application/json', status=http_code)
@@ -85,7 +85,7 @@ def predict():
     if error:
         return fmt_msg(error)
 
-    prediction, error = ase_to_ml_model(ase_obj, active_ml_model)
+    prediction, error = ase_to_prediction(ase_obj, active_ml_models)
     if error:
         return fmt_msg(error)
 
@@ -110,11 +110,11 @@ def predict():
 if __name__ == '__main__':
     if sys.argv[1:]:
         print("Models to load:\n" + "\n".join(sys.argv[1:]))
-        active_ml_model = load_ml_model(sys.argv[1:])
+        active_ml_models = load_ml_models(sys.argv[1:])
 
     elif ML_MODELS:
         print("Models to load:\n" + "\n".join(ML_MODELS))
-        active_ml_model = load_ml_model(ML_MODELS)
+        active_ml_models = load_ml_models(ML_MODELS)
 
     else:
         print("No models to load")
diff --git a/mpds_ml_labs/common.py b/mpds_ml_labs/common.py
@@ -9,11 +9,18 @@
 
 if os.path.exists(config_path):
     config.read(config_path)
+
     SERVE_UI = config.get('mpds_ml_labs', 'serve_ui')
-    ML_MODELS = [path.strip() for path in filter(
-        None,
-        config.get('mpds_ml_labs', 'ml_models').split()
-    )]
+    ML_MODELS = config.get('mpds_ml_labs', 'ml_models') or ''
+    API_KEY = config.get('mpds_ml_labs', 'api_key')
+    API_ENDPOINT = config.get('mpds_ml_labs', 'api_endpoint')
+
+    ML_MODELS = [
+        path.strip() for path in filter(None, ML_MODELS.split())
+    ]
+
 else:
     SERVE_UI = True
     ML_MODELS = []
+    API_KEY = None
+    API_ENDPOINT = None
diff --git a/mpds_ml_labs/prediction.py b/mpds_ml_labs/prediction.py
@@ -5,12 +5,16 @@
 
 import numpy as np
 
+from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import mean_absolute_error, r2_score, confusion_matrix
 
-prop_semantics = {
+
+prop_models = {
     'w': {
-        'name': 'band gap for direct transition',
+        'name': 'band gap',
         'units': 'eV',
-        'symbol': 'e<sub>dir.</sub>',
+        'symbol': 'e<sub>dir. or indir.</sub>',
         'rounding': 1,
         'interval': [0.01, 20]
     },
@@ -124,35 +128,35 @@ def get_descriptor(ase_obj, kappa=None, overreach=False):
     return np.array(DV).flatten()
 
 
-def load_ml_model(prop_model_files):
-    ml_model = {}
+def load_ml_models(prop_model_files):
+    ml_models = {}
     for n, file_name in enumerate(prop_model_files, start=1):
         if not os.path.exists(file_name):
             print("No file %s" % file_name)
             continue
 
         basename = file_name.split(os.sep)[-1]
-        if basename.startswith('ml') and basename[3:4] == '_' and basename[2:3] in prop_semantics:
+        if basename.startswith('ml') and basename[3:4] == '_' and basename[2:3] in prop_models:
             prop_id = basename[2:3]
-            print("Detected property %s in file %s" % (prop_semantics[prop_id]['name'], basename))
+            print("Detected property %s in file %s" % (prop_models[prop_id]['name'], basename))
         else:
             prop_id = str(n)
             print("No property name detected in file %s" % basename)
 
         with open(file_name, 'rb') as f:
             model = cPickle.load(f)
             if hasattr(model, 'predict') and hasattr(model, 'metadata'):
-                ml_model[prop_id] = model
+                ml_models[prop_id] = model
                 print("Model metadata: %s" % model.metadata)
 
-    print("Loaded property models: %s" % len(ml_model))
-    return ml_model
+    print("Loaded property models: %s" % len(ml_models))
+    return ml_models
 
 
 def get_legend(pred_dict):
     legend = {}
     for key in pred_dict.keys():
-        legend[key] = prop_semantics.get(key, {
+        legend[key] = prop_models.get(key, {
             'name': 'Unspecified property ' + str(key),
             'units': 'arb.u.',
             'symbol': 'P' + str(key),
@@ -161,32 +165,127 @@ def get_legend(pred_dict):
     return legend
 
 
-def ase_to_ml_model(ase_obj, ml_model):
+def ase_to_prediction(ase_obj, ml_models):
+    """
+    Execute all the regressor models againts a given structure desriptor;
+    the results of the "w" regressor model will depend on the
+    output of the binary classifier model
+    """
     result = {}
     descriptor = get_descriptor(ase_obj, overreach=True)
     d_dim = len(descriptor)
+    should_invoke_clfr = 'w' in prop_models.keys()
+
+    # testing
+    if not ml_models:
+        result = {prop_id: {'value': 42, 'mae': 0, 'r2': 0} for prop_id in prop_models.keys()}
 
-    if not ml_model: # testing
-        return {prop_id: {'value': 42, 'mae': 0, 'r2': 0} for prop_id in prop_semantics.keys()}, None
+        if should_invoke_clfr:
+            result['w'] = {'value': 0, 'mae': 0, 'r2': 0}
 
-    for prop_id, regr in ml_model.items(): # production
+    # production
+    for prop_id, model in ml_models.items():
 
-        if d_dim < regr.n_features_:
+        if d_dim < model.n_features_:
             continue
-        elif d_dim > regr.n_features_:
-            d_input = descriptor[:regr.n_features_]
+        elif d_dim > model.n_features_:
+            d_input = descriptor[:model.n_features_]
         else:
             d_input = descriptor[:]
 
         try:
-            prediction = regr.predict([d_input])[0]
+            prediction = model.predict([d_input])[0]
         except Exception as e:
             return None, str(e)
 
-        result[prop_id] = {
-            'value': round(prediction, prop_semantics[prop_id]['rounding']),
-            'mae': round(regr.metadata['mae'], prop_semantics[prop_id]['rounding']),
-            'r2': regr.metadata['r2']
-        }
+        # classifier
+        if model.metadata.get('error_percentage'):
+
+            if should_invoke_clfr:
+
+                if prediction == 0:
+                    result['w'] = {'value': 0, 'mae': 0, 'r2': 0}
+
+        # regressor
+        else:
+            if prop_id not in prop_models or \
+            (prop_id == 'w' and prop_id in result):
+                continue
+
+            result[prop_id] = {
+                'value': round(prediction, prop_models[prop_id]['rounding']),
+                'mae': round(model.metadata['mae'], prop_models[prop_id]['rounding']),
+                'r2': model.metadata['r2']
+            }
 
     return result, None
+
+
+def get_regr(a=None, b=None):
+
+    if not a: a = 100
+    if not b: b = 2
+
+    return RandomForestRegressor(
+        n_estimators=a,
+        max_features=b,
+        max_depth=None,
+        min_samples_split=2, # recommended value
+        min_samples_leaf=5, # recommended value
+        bootstrap=True, # recommended value
+        n_jobs=-1
+    )
+
+
+def get_clfr(a=None, b=None):
+
+    if not a: a = 100
+    if not b: b = 2
+
+    return RandomForestClassifier(
+        n_estimators=a,
+        max_features=b,
+        max_depth=None,
+        min_samples_split=2, # recommended value
+        min_samples_leaf=5, # recommended value
+        bootstrap=True, # recommended value
+        n_jobs=-1
+    )
+
+
+def estimate_regr_quality(algo, args, values, attempts=30, nsamples=0.33):
+
+    results = []
+
+    for _ in range(attempts):
+        X_train, X_test, y_train, y_test = train_test_split(args, values, test_size=nsamples)
+        algo.fit(X_train, y_train)
+
+        prediction = algo.predict(X_test)
+
+        mae = mean_absolute_error(y_test, prediction)
+        r2 = r2_score(y_test, prediction)
+        results.append([mae, r2])
+
+    results = list(map(list, zip(*results))) # transpose
+
+    avg_mae = np.median(results[0])
+    avg_r2 = np.median(results[1])
+    return avg_mae, avg_r2
+
+
+def estimate_clfr_quality(algo, args, values, attempts=30, nsamples=0.33):
+
+    results = []
+
+    for _ in range(attempts):
+        X_train, X_test, y_train, y_test = train_test_split(args, values, test_size=nsamples)
+        algo.fit(X_train, y_train)
+
+        prediction = algo.predict(X_test)
+
+        tn, fp, fn, tp = confusion_matrix(y_test, prediction).ravel()
+        error_percentage = (fp + fn)/(tn + fp + fn + tp)
+        results.append(error_percentage)
+
+    return np.median(results)
diff --git a/train_regressor.py b/train_regressor.py