Skip to content
This repository was archived by the owner on Jul 20, 2025. It is now read-only.

Commit dc98568

Browse files
committed
Merge commits from an MPDS branch
2 parents 6566f7e + 32ecfd5 commit dc98568

19 files changed

+1443
-145
lines changed

README.md

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
1-
Data-driven predictions from the crystalline structure
1+
Data-driven predictions: from crystal structure to physical properties and vice versa
22
======
33

4+
[![DOI](https://zenodo.org/badge/110734326.svg)](https://zenodo.org/badge/latestdoi/110734326)
5+
46
![Materials simulations ab datum](https://raw.githubusercontent.com/mpds-io/mpds-ml-labs/master/crystallographer_mpds_cc_by_40.png "Materials simulation ab datum")
57

6-
Live demo
8+
9+
Live demos
710
------
811

9-
[mpds.io/ml](https://mpds.io/ml)
12+
[mpds.io/ml](https://mpds.io/ml) and [mpds.io/materials-design](https://mpds.io/materials-design)
13+
1014

1115
Rationale
1216
------
@@ -22,6 +26,9 @@ This is the proof of concept, how a relatively unsophisticated statistical model
2226
- linear thermal expansion coefficient
2327
- band gap (or its absense, _i.e._ whether a crystal is conductor or insulator)
2428

29+
Further, a reverse task of predicting the possible crystalline structure from a set of given properties is solved. The suitable chemical elements are found, and the resulted structure is generated (if possible) based on the available MPDS prototypes.
30+
31+
2532
Installation
2633
------
2734

@@ -33,17 +40,19 @@ cd REPO_FOLDER
3340
pip install -r requirements.txt
3441
```
3542

36-
Currently only *Python 2* is supported (*Python 3* support is coming).
43+
Currently only *Python 2* is supported (*Python 3* support is almost there).
44+
3745

3846
Preparation
3947
------
4048

4149
The model is trained on the MPDS data using the MPDS API and the scripts `train_regressor.py` and `train_classifier.py`. Some subset of the full MPDS data is opened and possible to obtain via MPDS API [for free](https://mpds.io/open-data-api).
4250

51+
4352
Architecture and usage
4453
------
4554

46-
Can be used either as a standalone command-line application or as a client-server application. In the latter case, the client and the server communicate over HTTP, and any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. As an example of the latter, a simple HTML5 app `index.html` is supplied in the `webassets` folder. Server part is a Flask app:
55+
Can be used either as a standalone command-line application or as a client-server application. In the latter case, the client and the server communicate over HTTP, and any client able to execute HTTP requests is supported, be it a `curl` command-line client or rich web-browser user interface. For example, the simple HTML5 apps `props.html` and `design.html` are supplied in the `webassets` folder. Server part is a Flask app:
4756

4857
```python
4958
python mpds_ml_labs/app.py
@@ -57,7 +66,10 @@ Used descriptor and model details
5766

5867
The term _descriptor_ stands for the compact information-rich representation, allowing the convenient mathematical treatment of the encoded complex data (_i.e._ crystalline structure). Any crystalline structure is populated to a certain relatively big fixed volume of minimum one cubic nanometer. Then the descriptor is constructed using the periodic numbers of atoms and the lengths of their radius-vectors. The details are in the file `mpds_ml_labs/prediction.py`.
5968

60-
As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset can be used for training. In order to estimate the prediction quality of the _regressor_ model, the _mean absolute error_ and _R2 coefficient of determination_ is saved. In order to estimate the prediction quality of the binary _classifier_ model, the _fraction incorrect_ (_i.e._ _error percentage_) is saved. The evaluation process is repeated at least 30 times to achieve a statistical reliability.
69+
As a machine-learning model an ensemble of decision trees ([random forest regressor](http://scikit-learn.org/stable/modules/ensemble.html)) is used, as implemented in [scikit-learn](http://scikit-learn.org) Python machine-learning toolkit. The whole MPDS dataset can be used for training. In order to estimate the prediction quality of the _regressor_ model, the _mean absolute error_ and _R2 coefficient of determination_ is saved. In order to estimate the prediction quality of the binary _classifier_ model, the _fraction incorrect_ (_i.e._ the _error percentage_) is saved. The evaluation process is repeated at least 30 times to achieve a statistical reliability.
70+
71+
For generating the crystal structure from the physical properties, see `mpds_ml_labs/test_design.py`.
72+
6173

6274
API
6375
------
@@ -66,18 +78,21 @@ At the local server:
6678

6779
```shell
6880
curl -XPOST http://localhost:5000/predict -d "structure=data_in_CIF_or_POSCAR"
81+
curl -XPOST http://localhost:5000/design -d "numerics=ranges_of_values_of_the_8_properties_in_JSON"
6982
```
7083

71-
At the demonstration Tilde server (may be switched off):
84+
At the demonstration MPDS server (may be switched off):
7285

7386
```shell
74-
curl -XPOST https://tilde.pro/services/predict -d "structure=data_in_CIF_or_POSCAR"
87+
curl -XPOST https://labs.mpds.io/predict -d "structure=data_in_CIF_or_POSCAR"
88+
curl -XPOST https://labs.mpds.io/design -d "numerics=ranges_of_values_of_the_8_properties_in_JSON"
7589
```
7690

91+
7792
Credits
7893
------
7994

80-
This project is built on top of the following open-source scientific software:
95+
This project is built on top of the open-source scientific software, such as:
8196

8297
- [scikit-learn](http://scikit-learn.org)
8398
- [pandas](https://pandas.pydata.org)
@@ -87,17 +102,17 @@ This project is built on top of the following open-source scientific software:
87102
- [cifplayer](http://tilde-lab.github.io/player.html)
88103
- [MPDS API client](http://developer.mpds.io)
89104

105+
90106
License
91107
------
92108

93109
- The client and the server code: *LGPL-2.1+*
94-
- The [open part](https://mpds.io/open-data-api) of the MPDS data (5%): *CC BY 4.0*
95-
- The closed part of the MPDS data (95%): *commercial*
110+
- The machine-learning MPDS data generated as presented here: *CC BY 4.0*
111+
- The [open part](https://mpds.io/open-data-api) of the experimental MPDS data (5%): *CC BY 4.0*
112+
- The closed part of the experimental MPDS data (95%): *commercial*
113+
96114

97115
Citation
98116
------
99117

100-
[![DOI](https://zenodo.org/badge/110734326.svg)](https://zenodo.org/badge/latestdoi/110734326)
101-
102-
Also please feel free to cite:
103-
- Blokhin E, Villars P, PAULING FILE and MPDS materials data infrastructure, in preparation, **2018**
118+
- Blokhin E, Villars P, Quantitative trends in physical properties of inorganic compounds via machine learning, [arXiv](https://arxiv.org/abs/1806.03553), **2018**

data/settings.ini.sample

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,13 @@ ml_models =
44
/path_to_models/model_one.pkl
55
/path_to_models/model_two.pkl
66
api_key =
7-
api_endpoint =
7+
api_endpoint = https://api.mpds.io/v0/download/facet
8+
els_endpoint = https://api.mpds.io/v0/download/els_comb
9+
10+
[db]
11+
user = postgres
12+
password =
13+
database = materials_ai
14+
table = ml_knn
15+
host = localhost
16+
port = 5432

model_importer.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
"""
2+
Use to migrate and deploy models at the servers
3+
with the different architecture, since the sklearn models
4+
are not transferable
5+
"""
6+
import os
7+
import json
8+
9+
import numpy as np
10+
import pandas as pd
11+
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
12+
13+
from imblearn.over_sampling import RandomOverSampler
14+
15+
from mpds_client import MPDSExport
16+
from mpds_ml_labs.prediction import estimate_regr_quality, estimate_clfr_quality
17+
18+
19+
def get_regr(params={}):
20+
return RandomForestRegressor(**params)
21+
22+
def get_clfr(params={}):
23+
return RandomForestClassifier(**params)
24+
25+
26+
results = []
27+
28+
DATA_DIR = '/data'
29+
30+
f = open('ml_export.json', 'r')
31+
final_values = json.loads(f.read())
32+
f.close()
33+
34+
for key, value in final_values.items():
35+
print("*"*100)
36+
print("Importing model-%s" % key)
37+
38+
if key == '0':
39+
white_data_file, black_data_file = os.path.join(DATA_DIR, value['white']), os.path.join(DATA_DIR, value['black'])
40+
white_df, black_df = pd.read_pickle(white_data_file), pd.read_pickle(black_data_file)
41+
white_df['Class'] = 0
42+
black_df['Class'] = 1
43+
all_df = pd.concat([white_df, black_df])
44+
X = all_df['Descriptor'].tolist()
45+
y = all_df['Class'].tolist()
46+
47+
min_x_len = min([len(j) for j in X])
48+
for n in range(len(X)):
49+
if len(X[n]) > min_x_len:
50+
X[n] = X[n][:min_x_len]
51+
52+
X = np.array(X, dtype=float)
53+
ros = RandomOverSampler()
54+
X_resampled, y_resampled = ros.fit_sample(X, y)
55+
56+
error_percentage = estimate_clfr_quality(get_clfr(value['params']), X_resampled, y_resampled)
57+
print("Avg. error percentage: %.3f" % error_percentage)
58+
59+
algo = get_clfr(value['params'])
60+
algo.fit(X_resampled, y_resampled)
61+
algo.metadata = {'error_percentage': error_percentage}
62+
63+
export_file = MPDSExport.save_model(algo, 0)
64+
print("Saving %s" % export_file)
65+
results.append(export_file)
66+
67+
else:
68+
data_file = os.path.join(DATA_DIR, value['file'])
69+
df = pd.read_pickle(data_file)
70+
X = np.array(df['Descriptor'].tolist())
71+
n_samples, n_x, n_y = X.shape
72+
X = X.reshape(n_samples, n_x * n_y)
73+
y = df['Avgvalue'].tolist()
74+
75+
avg_mae, avg_r2 = estimate_regr_quality(get_regr(value['params']), X, y)
76+
print("Avg. MAE: %.2f; avg. R2 score: %.2f" % (avg_mae, avg_r2))
77+
78+
algo = get_regr(value['params'])
79+
algo.fit(X, y)
80+
algo.metadata = {'mae': avg_mae, 'r2': round(avg_r2, 2)}
81+
82+
export_file = MPDSExport.save_model(algo, key)
83+
print("Saving %s" % export_file)
84+
results.append(export_file)
85+
86+
for f in results:
87+
print f

mpds_ml_labs/app.py

Lines changed: 135 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,13 @@
55

66
from flask import Flask, Blueprint, Response, request, send_from_directory
77

8-
from struct_utils import detect_format, poscar_to_ase, refine, get_formula
8+
from struct_utils import detect_format, poscar_to_ase, refine, get_formula, order_disordered
99
from cif_utils import cif_to_ase, ase_to_eq_cif
10-
from prediction import get_prediction, get_aligned_descriptor, get_ordered_descriptor, get_legend, load_ml_models
11-
from common import SERVE_UI, ML_MODELS
10+
from prediction import prop_models, get_prediction, get_aligned_descriptor, get_ordered_descriptor, get_legend, load_ml_models
11+
from common import SERVE_UI, ML_MODELS, connect_database
12+
from knn_sample import knn_sample
13+
from similar_els import materialize, score
14+
from prediction_ranges import TOL_QUALITY
1215

1316

1417
app_labs = Blueprint('app_labs', __name__)
@@ -41,27 +44,47 @@ def html_formula(string):
4144

4245
if SERVE_UI:
4346
@app_labs.route('/', methods=['GET'])
47+
@app_labs.route('/props.html', methods=['GET'])
4448
def index():
45-
return send_from_directory(static_path, 'index.html')
46-
@app_labs.route('/index.css', methods=['GET'])
47-
def style():
48-
return send_from_directory(static_path, 'index.css')
49+
return send_from_directory(static_path, 'props.html')
50+
51+
@app_labs.route('/common.css', methods=['GET'])
52+
def css():
53+
return send_from_directory(static_path, 'common.css')
54+
4955
@app_labs.route('/player.html', methods=['GET'])
5056
def player():
5157
return send_from_directory(static_path, 'player.html')
5258

59+
@app_labs.route('/design.html', methods=['GET'])
60+
def md():
61+
return send_from_directory(static_path, 'design.html')
62+
63+
@app_labs.route('/jquery.min.js', methods=['GET'])
64+
def jquery():
65+
return send_from_directory(static_path, 'jquery.min.js')
66+
67+
@app_labs.route('/nouislider.min.js', methods=['GET'])
68+
def nouislider():
69+
return send_from_directory(static_path, 'nouislider.min.js')
70+
5371
@app_labs.after_request
5472
def add_cors_header(response):
5573
response.headers['Access-Control-Allow-Origin'] = '*'
5674
return response
5775

5876
@app_labs.route("/predict", methods=['POST'])
5977
def predict():
78+
"""
79+
A main endpoint for the properties
80+
prediction, based on the provided CIF
81+
or POSCAR
82+
"""
6083
if 'structure' not in request.values:
6184
return fmt_msg('Invalid request')
6285

6386
structure = request.values.get('structure')
64-
if not 0 < len(structure) < 32768:
87+
if not 0 < len(structure) < 200000:
6588
return fmt_msg('Request size is invalid')
6689

6790
if not is_plain_text(structure):
@@ -116,6 +139,110 @@ def predict():
116139
content_type='application/json'
117140
)
118141

142+
@app_labs.route("/download_cif", methods=['POST'])
143+
def download_cif():
144+
"""
145+
An utility endpoint to force
146+
a browser file (CIF) download
147+
"""
148+
structure = request.values.get('structure')
149+
title = request.values.get('title')
150+
151+
if not structure or not title:
152+
return fmt_msg('Invalid request')
153+
154+
if not 0 < len(structure) < 100000:
155+
return fmt_msg('Request size is invalid')
156+
157+
return Response(structure, mimetype="chemical/x-cif", headers={
158+
"Content-Disposition": "attachment;filename=%s.cif" % title
159+
})
160+
161+
@app_labs.route("/design", methods=['POST'])
162+
def design():
163+
"""
164+
A main endpoint for generating
165+
the CIF structure based on
166+
the provided values of the properties
167+
"""
168+
if 'numerics' not in request.values:
169+
return fmt_msg('Invalid request')
170+
171+
try: numerics = json.loads(request.values.get('numerics'))
172+
except:
173+
return fmt_msg('Invalid request')
174+
if type(numerics) != dict:
175+
return fmt_msg('Invalid request')
176+
177+
user_ranges_dict = {}
178+
179+
for prop_id in prop_models:
180+
if prop_id not in numerics or type(numerics[prop_id]) != list or len(numerics[prop_id]) != 2:
181+
return fmt_msg('Invalid request')
182+
try: user_ranges_dict[prop_id + '_min'], user_ranges_dict[prop_id + '_max'] = float(numerics[prop_id][0]), float(numerics[prop_id][1])
183+
except:
184+
return fmt_msg('Invalid request')
185+
186+
if user_ranges_dict['w_min'] == 0 and user_ranges_dict['w_max'] == 0:
187+
user_ranges_dict['w_min'], user_ranges_dict['w_max'] = -100, 100 # NB. any band gap is allowed
188+
189+
cursor, connection = connect_database()
190+
191+
result, error = None, "No results (outside of prediction capabilities)"
192+
193+
els_samples = knn_sample(cursor, user_ranges_dict)
194+
for els_sample in els_samples:
195+
#print "TRYING TO MATERIALIZE", ", ".join(els_sample)
196+
197+
scoring, error = materialize(els_sample, active_ml_models)
198+
if error or not scoring:
199+
continue
200+
201+
result = score(scoring, user_ranges_dict)
202+
break
203+
204+
connection.close()
205+
206+
if result:
207+
answer_props = {prop_id: result['prediction'][prop_id]['value'] for prop_id in result['prediction']}
208+
answer_props['t'] /= 100000 # normalization 10**5
209+
210+
if 'disordered' in result['structure'].info:
211+
result['structure'], error = order_disordered(result['structure'])
212+
if error: return fmt_msg(error)
213+
result['structure'].center(about=0.0)
214+
215+
formula = get_formula(result['structure'])
216+
217+
result_quality, aux_info = 0, []
218+
for k, v in answer_props.items():
219+
aux_info.append([
220+
prop_models[k]['name'].replace(' ', '_'),
221+
sample[k + '_min'],
222+
v,
223+
sample[k + '_max'],
224+
prop_models[k]['units']
225+
])
226+
tol = (sample[k + '_max'] - sample[k + '_min']) * TOL_QUALITY
227+
if sample[k + '_min'] - tol < v < sample[k + '_max'] + tol:
228+
result_quality += 1
229+
230+
return Response(
231+
json.dumps({
232+
'vis_cif': ase_to_eq_cif(
233+
result['structure'],
234+
supply_sg=False,
235+
mpds_labs_loop=[result_quality] + aux_info
236+
),
237+
'props': answer_props,
238+
'formula': html_formula(formula),
239+
'title': formula
240+
}, indent=4, escape_forward_slashes=False
241+
),
242+
content_type='application/json'
243+
)
244+
return fmt_msg(error)
245+
119246

120247
if __name__ == '__main__':
121248
if sys.argv[1:]:

0 commit comments

Comments
 (0)