mpds-io
diff --git a/‎README.md‎
Lines changed: 30 additions & 15 deletions b/‎README.md‎
Lines changed: 30 additions & 15 deletions
diff --git a/‎data/settings.ini.sample‎
Lines changed: 10 additions & 1 deletion b/‎data/settings.ini.sample‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎model_importer.py‎
Lines changed: 87 additions & 0 deletions b/‎model_importer.py‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎mpds_ml_labs/app.py‎
Lines changed: 135 additions & 8 deletions b/‎mpds_ml_labs/app.py‎
Lines changed: 135 additions & 8 deletions
@@ -1,12 +1,16 @@
-Data-driven predictions from the crystalline structure
+Data-driven predictions: from crystal structure to physical properties and vice versa
 ======
 
+[![DOI](https://zenodo.org/badge/110734326.svg)](https://zenodo.org/badge/latestdoi/110734326)
+
 ![Materials simulations ab datum](https://raw.githubusercontent.com/mpds-io/mpds-ml-labs/master/crystallographer_mpds_cc_by_40.png "Materials simulation ab datum")
 
-Live demo
+
+Live demos
 ------
 
-[mpds.io/ml](https://mpds.io/ml)
+[mpds.io/ml](https://mpds.io/ml) and [mpds.io/materials-design](https://mpds.io/materials-design)
+
 
 Rationale
 ------
@@ -22,6 +26,9 @@ This is the proof of concept, how a relatively unsophisticated statistical model
 - linear thermal expansion coefficient
 - band gap (or its absense, _i.e._ whether a crystal is conductor or insulator)
 
+Further, a reverse task of predicting the possible crystalline structure from a set of given properties is solved. The suitable chemical elements are found, and the resulted structure is generated (if possible) based on the available MPDS prototypes.
+
+
 Installation
 ------
 
@@ -33,17 +40,19 @@ cd REPO_FOLDER
 pip install -r requirements.txt
 ```
 
-Currently only *Python 2* is supported (*Python 3* support is coming).
+Currently only *Python 2* is supported (*Python 3* support is almost there).
+
 
 Preparation
 ------
 
 The model is trained on the MPDS data using the MPDS API and the scripts `train_regressor.py` and `train_classifier.py`. Some subset of the full MPDS data is opened and possible to obtain via MPDS API [for free](https://mpds.io/open-data-api).
 
+
 Architecture and usage
 ------
 
-Can be used either as a standalone command-line application or as a client-server application. In the latter case, the client and the server communicate over HTTP, and any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. As an example of the latter, a simple HTML5 app `index.html` is supplied in the `webassets` folder. Server part is a Flask app:
+Can be used either as a standalone command-line application or as a client-server application. In the latter case, the client and the server communicate over HTTP, and any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. For example, the simple HTML5 apps `props.html` and `design.html` are supplied in the `webassets` folder. Server part is a Flask app:
 
 ```python
 python mpds_ml_labs/app.py
@@ -57,7 +66,10 @@ Used descriptor and model details
 
 The term _descriptor_ stands for the compact information-rich representation, allowing the convenient mathematical treatment of the encoded complex data (_i.e._ crystalline structure). Any crystalline structure is populated to a certain relatively big fixed volume of minimum one cubic nanometer. Then the descriptor is constructed using the periodic numbers of atoms and the lengths of their radius-vectors. The details are in the file `mpds_ml_labs/prediction.py`.
 
-As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset can be used for training. In order to estimate the prediction quality of the _regressor_ model, the _mean absolute error_ and _R2 coefficient of determination_ is saved. In order to estimate the prediction quality of the binary _classifier_ model, the _fraction incorrect_ (_i.e._ _error percentage_) is saved. The evaluation process is repeated at least 30 times to achieve a statistical reliability.
+As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset can be used for training. In order to estimate the prediction quality of the _regressor_ model, the _mean absolute error_ and _R2 coefficient of determination_ is saved. In order to estimate the prediction quality of the binary _classifier_ model, the _fraction incorrect_ (_i.e._ the _error percentage_) is saved. The evaluation process is repeated at least 30 times to achieve a statistical reliability.
+
+For generating the crystal structure from the physical properties, see `mpds_ml_labs/test_design.py`.
+
 
 API
 ------
@@ -66,18 +78,21 @@ At the local server:
 
 ```shell
 curl -XPOST http://localhost:5000/predict -d "structure=data_in_CIF_or_POSCAR"
+curl -XPOST http://localhost:5000/design -d "numerics=ranges_of_values_of_the_8_properties_in_JSON"
 ```
 
-At the demonstration Tilde server (may be switched off):
+At the demonstration MPDS server (may be switched off):
 
 ```shell
-curl -XPOST https://tilde.pro/services/predict -d "structure=data_in_CIF_or_POSCAR"
+curl -XPOST https://labs.mpds.io/predict -d "structure=data_in_CIF_or_POSCAR"
+curl -XPOST https://labs.mpds.io/design -d "numerics=ranges_of_values_of_the_8_properties_in_JSON"
 ```
 
+
 Credits
 ------
 
-This project is built on top of the following open-source scientific software:
+This project is built on top of the open-source scientific software, such as:
 
 - [scikit-learn](http://scikit-learn.org)
 - [pandas](https://pandas.pydata.org)
@@ -87,17 +102,17 @@ This project is built on top of the following open-source scientific software:
 - [cifplayer](http://tilde-lab.github.io/player.html)
 - [MPDS API client](http://developer.mpds.io)
 
+
 License
 ------
 
 - The client and the server code: *LGPL-2.1+*
-- The [open part](https://mpds.io/open-data-api) of the MPDS data (5%): *CC BY 4.0*
-- The closed part of the MPDS data (95%): *commercial*
+- The machine-learning MPDS data generated as presented here: *CC BY 4.0*
+- The [open part](https://mpds.io/open-data-api) of the experimental MPDS data (5%): *CC BY 4.0*
+- The closed part of the experimental MPDS data (95%): *commercial*
+
 
 Citation
 ------
 
-[![DOI](https://zenodo.org/badge/110734326.svg)](https://zenodo.org/badge/latestdoi/110734326)
-
-Also please feel free to cite:
-- Blokhin E, Villars P, PAULING FILE and MPDS materials data infrastructure, in preparation, **2018**
+- Blokhin E, Villars P, Quantitative trends in physical properties of inorganic compounds via machine learning, [arXiv](https://arxiv.org/abs/1806.03553), **2018**
@@ -4,4 +4,13 @@ ml_models =
     /path_to_models/model_one.pkl
     /path_to_models/model_two.pkl
 api_key =
-api_endpoint =
+api_endpoint = https://api.mpds.io/v0/download/facet
+els_endpoint = https://api.mpds.io/v0/download/els_comb
+
+[db]
+user = postgres
+password =
+database = materials_ai
+table = ml_knn
+host = localhost
+port = 5432
@@ -0,0 +1,87 @@
+"""
+Use to migrate and deploy models at the servers
+with the different architecture, since the sklearn models
+are not transferable
+"""
+import os
+import json
+
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
+
+from imblearn.over_sampling import RandomOverSampler
+
+from mpds_client import MPDSExport
+from mpds_ml_labs.prediction import estimate_regr_quality, estimate_clfr_quality
+
+
+def get_regr(params={}):
+    return RandomForestRegressor(**params)
+
+def get_clfr(params={}):
+    return RandomForestClassifier(**params)
+
+
+results = []
+
+DATA_DIR = '/data'
+
+f = open('ml_export.json', 'r')
+final_values = json.loads(f.read())
+f.close()
+
+for key, value in final_values.items():
+    print("*"*100)
+    print("Importing model-%s" % key)
+
+    if key == '0':
+        white_data_file, black_data_file = os.path.join(DATA_DIR, value['white']), os.path.join(DATA_DIR, value['black'])
+        white_df, black_df = pd.read_pickle(white_data_file), pd.read_pickle(black_data_file)
+        white_df['Class'] = 0
+        black_df['Class'] = 1
+        all_df = pd.concat([white_df, black_df])
+        X = all_df['Descriptor'].tolist()
+        y = all_df['Class'].tolist()
+
+        min_x_len = min([len(j) for j in X])
+        for n in range(len(X)):
+            if len(X[n]) > min_x_len:
+                X[n] = X[n][:min_x_len]
+
+        X = np.array(X, dtype=float)
+        ros = RandomOverSampler()
+        X_resampled, y_resampled = ros.fit_sample(X, y)
+
+        error_percentage = estimate_clfr_quality(get_clfr(value['params']), X_resampled, y_resampled)
+        print("Avg. error percentage: %.3f" % error_percentage)
+
+        algo = get_clfr(value['params'])
+        algo.fit(X_resampled, y_resampled)
+        algo.metadata = {'error_percentage': error_percentage}
+
+        export_file = MPDSExport.save_model(algo, 0)
+        print("Saving %s" % export_file)
+        results.append(export_file)
+
+    else:
+        data_file = os.path.join(DATA_DIR, value['file'])
+        df = pd.read_pickle(data_file)
+        X = np.array(df['Descriptor'].tolist())
+        n_samples, n_x, n_y = X.shape
+        X = X.reshape(n_samples, n_x * n_y)
+        y = df['Avgvalue'].tolist()
+
+        avg_mae, avg_r2 = estimate_regr_quality(get_regr(value['params']), X, y)
+        print("Avg. MAE: %.2f; avg. R2 score: %.2f" % (avg_mae, avg_r2))
+
+        algo = get_regr(value['params'])
+        algo.fit(X, y)
+        algo.metadata = {'mae': avg_mae, 'r2': round(avg_r2, 2)}
+
+        export_file = MPDSExport.save_model(algo, key)
+        print("Saving %s" % export_file)
+        results.append(export_file)
+
+for f in results:
+    print f
@@ -5,10 +5,13 @@
 
 from flask import Flask, Blueprint, Response, request, send_from_directory
 
-from struct_utils import detect_format, poscar_to_ase, refine, get_formula
+from struct_utils import detect_format, poscar_to_ase, refine, get_formula, order_disordered
 from cif_utils import cif_to_ase, ase_to_eq_cif
-from prediction import get_prediction, get_aligned_descriptor, get_ordered_descriptor, get_legend, load_ml_models
-from common import SERVE_UI, ML_MODELS
+from prediction import prop_models, get_prediction, get_aligned_descriptor, get_ordered_descriptor, get_legend, load_ml_models
+from common import SERVE_UI, ML_MODELS, connect_database
+from knn_sample import knn_sample
+from similar_els import materialize, score
+from prediction_ranges import TOL_QUALITY
 
 
 app_labs = Blueprint('app_labs', __name__)
@@ -41,27 +44,47 @@ def html_formula(string):
 
 if SERVE_UI:
     @app_labs.route('/', methods=['GET'])
+    @app_labs.route('/props.html', methods=['GET'])
     def index():
-        return send_from_directory(static_path, 'index.html')
-    @app_labs.route('/index.css', methods=['GET'])
-    def style():
-        return send_from_directory(static_path, 'index.css')
+        return send_from_directory(static_path, 'props.html')
+
+    @app_labs.route('/common.css', methods=['GET'])
+    def css():
+        return send_from_directory(static_path, 'common.css')
+
     @app_labs.route('/player.html', methods=['GET'])
     def player():
         return send_from_directory(static_path, 'player.html')
 
+    @app_labs.route('/design.html', methods=['GET'])
+    def md():
+        return send_from_directory(static_path, 'design.html')
+
+    @app_labs.route('/jquery.min.js', methods=['GET'])
+    def jquery():
+        return send_from_directory(static_path, 'jquery.min.js')
+
+    @app_labs.route('/nouislider.min.js', methods=['GET'])
+    def nouislider():
+        return send_from_directory(static_path, 'nouislider.min.js')
+
 @app_labs.after_request
 def add_cors_header(response):
     response.headers['Access-Control-Allow-Origin'] = '*'
     return response
 
 @app_labs.route("/predict", methods=['POST'])
 def predict():
+    """
+    A main endpoint for the properties
+    prediction, based on the provided CIF
+    or POSCAR
+    """
     if 'structure' not in request.values:
         return fmt_msg('Invalid request')
 
     structure = request.values.get('structure')
-    if not 0 < len(structure) < 32768:
+    if not 0 < len(structure) < 200000:
         return fmt_msg('Request size is invalid')
 
     if not is_plain_text(structure):
@@ -116,6 +139,110 @@ def predict():
         content_type='application/json'
     )
 
+@app_labs.route("/download_cif", methods=['POST'])
+def download_cif():
+    """
+    An utility endpoint to force
+    a browser file (CIF) download
+    """
+    structure = request.values.get('structure')
+    title = request.values.get('title')
+
+    if not structure or not title:
+        return fmt_msg('Invalid request')
+
+    if not 0 < len(structure) < 100000:
+        return fmt_msg('Request size is invalid')
+
+    return Response(structure, mimetype="chemical/x-cif", headers={
+        "Content-Disposition": "attachment;filename=%s.cif" % title
+    })
+
+@app_labs.route("/design", methods=['POST'])
+def design():
+    """
+    A main endpoint for generating
+    the CIF structure based on
+    the provided values of the properties
+    """
+    if 'numerics' not in request.values:
+        return fmt_msg('Invalid request')
+
+    try: numerics = json.loads(request.values.get('numerics'))
+    except:
+        return fmt_msg('Invalid request')
+    if type(numerics) != dict:
+        return fmt_msg('Invalid request')
+
+    user_ranges_dict = {}
+
+    for prop_id in prop_models:
+        if prop_id not in numerics or type(numerics[prop_id]) != list or len(numerics[prop_id]) != 2:
+            return fmt_msg('Invalid request')
+        try: user_ranges_dict[prop_id + '_min'], user_ranges_dict[prop_id + '_max'] = float(numerics[prop_id][0]), float(numerics[prop_id][1])
+        except:
+            return fmt_msg('Invalid request')
+
+    if user_ranges_dict['w_min'] == 0 and user_ranges_dict['w_max'] == 0:
+        user_ranges_dict['w_min'], user_ranges_dict['w_max'] = -100, 100 # NB. any band gap is allowed
+
+    cursor, connection = connect_database()
+
+    result, error = None, "No results (outside of prediction capabilities)"
+
+    els_samples = knn_sample(cursor, user_ranges_dict)
+    for els_sample in els_samples:
+        #print "TRYING TO MATERIALIZE", ", ".join(els_sample)
+
+        scoring, error = materialize(els_sample, active_ml_models)
+        if error or not scoring:
+            continue
+
+        result = score(scoring, user_ranges_dict)
+        break
+
+    connection.close()
+
+    if result:
+        answer_props = {prop_id: result['prediction'][prop_id]['value'] for prop_id in result['prediction']}
+        answer_props['t'] /= 100000 # normalization 10**5
+
+        if 'disordered' in result['structure'].info:
+            result['structure'], error = order_disordered(result['structure'])
+            if error: return fmt_msg(error)
+            result['structure'].center(about=0.0)
+
+        formula = get_formula(result['structure'])
+
+        result_quality, aux_info = 0, []
+        for k, v in answer_props.items():
+            aux_info.append([
+                prop_models[k]['name'].replace(' ', '_'),
+                sample[k + '_min'],
+                v,
+                sample[k + '_max'],
+                prop_models[k]['units']
+            ])
+            tol = (sample[k + '_max'] - sample[k + '_min']) * TOL_QUALITY
+            if sample[k + '_min'] - tol < v < sample[k + '_max'] + tol:
+                result_quality += 1
+
+        return Response(
+            json.dumps({
+                'vis_cif': ase_to_eq_cif(
+                    result['structure'],
+                    supply_sg=False,
+                    mpds_labs_loop=[result_quality] + aux_info
+                ),
+                'props': answer_props,
+                'formula': html_formula(formula),
+                'title': formula
+                }, indent=4, escape_forward_slashes=False
+            ),
+            content_type='application/json'
+        )
+    return fmt_msg(error)
+
 
 if __name__ == '__main__':
     if sys.argv[1:]: