Skip to content

Commit cc98f12

Browse files
authored
Update db.rst
1 parent f2e9941 commit cc98f12

File tree

1 file changed

+35
-42
lines changed

1 file changed

+35
-42
lines changed

docs/modules/db.rst

Lines changed: 35 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ API - Database
44
This is the alpha version of database management system.
55
If you have trouble, you can ask for help on `[email protected] <[email protected]>`_ .
66

7-
87
.. note::
98
We are writing up the documentation, please wait in patient.
109

@@ -13,39 +12,37 @@ Why TensorDB
1312
----------------
1413

1514
TensorLayer is designed for production, aiming to be applied large scale machine learning application.
16-
TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as
15+
TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as:
1716

18-
1. how to mangage the training data and load the training datasets
17+
1. How to mangage the training data and load the training datasets
1918
2. When the dataset is so large that beyonds the storage limitation of one computer
20-
3. how shoud we managment different models and version, and comparing different models.
19+
3. How shoud we managment different models and version, and comparing different models.
2120
4. How to automate the whole training, evaluaiton and deploy machine learning model automatically.
2221

23-
In the TensorLayer system, we introduce the database technology to the issues above.
22+
In TensorLayer system, we introduce the database technology to the issues above.
2423

2524
TensorDB is designed by following three principles.
2625

2726
Everything is Data
2827
^^^^^^^^^^^^^^^^^^
2928

30-
TensorDB is a data warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as
31-
32-
1. Data and Labels. Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
33-
2. Model Architecture.This group store the different model architecture, which user can select to use
34-
3. Model Parameters.This tables stores all the model parameters of echo in the training step.
35-
4. Jobs.all the computation is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs.
36-
5. Logs. The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.
29+
TensorDB is a data warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as:
3730

38-
TensorDB is in principal is a key-word based search engine. each model, parameters, or training data are assigned many tags.
39-
The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data. which is implemented based on NoSQL document database such as mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system. Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is stored in the documents.
31+
1. Data and Labels: Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
32+
2. Model Architecture: This group store the different model architecture, which user can select to use
33+
3. Model Parameters: This tables stores all the model parameters of echo in the training step.
34+
4. Jobs: All the computation is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs.
35+
5. Logs: The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.
4036

37+
TensorDB in principal is a key-word based search engine. Each model, parameters, or training data are assigned many tags.
38+
The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data, which is implemented based on NoSQL document database such as Mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system. Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is stored in the documents.
4139

4240

4341
Everying is identified by Query
4442
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4543

46-
47-
Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language.
48-
The first advantage is the query is more efficient in space and can specify multiple objects in a concise way.
44+
Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language.
45+
The first advantage is the query is more efficient in space and can specify multiple objects in a concise way.
4946
The advantage such a design is to enable a highly flexible software system.
5047
data, model architecture and training are interchangeable.
5148
Many work can be implemented by simply rewire different components.
@@ -61,10 +58,10 @@ Also the training system have no clue of epochs, instead, it knows batchize and
6158
Many techniques are introduced behind the streaming interface.
6259
The stream is implemented based on the database cursor technology, so for every search, only the cursors are returned, not the actual data.
6360
Only when the generator is evaluated, the acutal data is loaded.
64-
The data loading is further optimise
61+
The data loading is further optimise:
6562

66-
1. data are compressed and decompressed,
67-
2. the dataloaded in bulk model to optimise the IO traffic
63+
1. Data are compressed and decompressed,
64+
2. The dataloaded in bulk model to optimise the IO traffic
6865
3. The argumentation or random sample are computed on the fly after the data are loaded into the local computer.
6966
4. To optimise the space, the will also be a cache system that only store the recent blob data.
7067

@@ -82,8 +79,7 @@ The exisitng implementation is based on Mongodb.
8279
Further implementaiton on other database will be released depends on progress.
8380
It will be stragihtford to port the tensorDB system to google cloud , aws and azure.
8481

85-
86-
The following tutorial is based on the MongoDb implmenetation.
82+
The following tutorials are based on the MongoDb implmenetation.
8783

8884

8985
Install MongoDB
@@ -93,13 +89,11 @@ The installation instruction of Mongodb can be found at
9389
`MongoDB Docs <https://docs.mongodb.com/manual/installation/>`_
9490
there are also managed mongodb service from amazon or gcp, or mongo atlas from mongodb
9591

92+
User can also user docker, which is a powerful tool for `deploy software <https://hub.docker.com/_/mongo/>`_ .
9693

97-
User can also user docker, which is a powerful tool for `deploy software <https://hub.docker.com/_/mongo/>`_
94+
After install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.
9895

99-
100-
after install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.
101-
102-
users can install the Studio3T( mongochef), which is free for none commerical user interface.
96+
Users can install the Studio3T( mongochef), which is free for none commerical user interface.
10397
`studio3t <https://studio3t.com/>`_
10498

10599

@@ -125,9 +119,10 @@ To use TensorDB mongodb implmentaiton, you need pymongo client.
125119

126120
you can install it by
127121

128-
``pip install pymongo``
122+
.. code-block:: bash
129123
130-
``pip install lz4``
124+
pip install pymongo
125+
pip install lz4
131126
132127
133128
it is very strateford to connected to the TensorDB system.
@@ -139,12 +134,12 @@ you can try the following code
139134
db = TensorDB(ip='127.0.0.1', port=27017, db_name='your_db', user_name=None, password=None, studyID='ministMLP')
140135
141136
142-
the ip is the ip address of the database, and port number is number of mongodb.
143-
you may need to specificy the database name and studyid.
144-
the study id is an unique identifier for an experiement.
137+
The ``ip`` is the ip address of the database, and ``port`` number is number of mongodb.
138+
You may need to specificy the database name and studyid.
139+
The study id is an unique identifier for an experiement.
145140

146141
TensorDB stores different study in one data warehouse.
147-
This has pros and cons, the benefits is that suppose the each study we try a different model architecutre
142+
This has pros and cons, the benefits is that suppose the each study we try a different model architecutre,
148143
it is very easy for us to evaluate different model architecture.
149144

150145

@@ -154,7 +149,7 @@ log and parameters
154149
The basic application is use TensorDB to save the model parameters and training/evaluation/testing logs.
155150
to use tensorDB, this can be easily done by replacing the print function by the db.log function
156151

157-
for save the trainning log, we have
152+
For save the trainning log, we have
158153
``db.train_log``
159154

160155
and
@@ -163,7 +158,7 @@ and
163158

164159
methods
165160

166-
suppose we save the log each step and save the parameters each epoch, we can have the code like this
161+
Suppose we save the log each step and save the parameters each epoch, we can have the code like this
167162

168163
.. code-block:: python
169164
@@ -189,31 +184,29 @@ for example, in many our our cases, we just simpliy specify the python code.
189184
'''
190185
db.save_model_architecutre(code,{'name':'print'}
191186
192-
c,fid=db.find_model_architecutre({'name':'print'})
187+
c,fid = db.find_model_architecutre({'name':'print'})
193188
exec c
194189
195190
db.push_job(code,{'type':'train'})
196191
197-
##worker
198-
code=db.pop_job()
192+
## worker
193+
code = db.pop_job()
199194
exec code
200195
201196
202197
Database Interface
203198
------------------
204199
205200
The trainning set is managed by a seperate database.
206-
each application has its own database,
207-
however, all the database interface should support two interface,
201+
each application has its own database.
202+
However, all the database interface should support two interface,
208203
1. find_data,
209204
2. data_generator
210205
211206
and example for minist dataset is include in the TensorLabDemo code
212207
213208
214209
215-
216-
217210
Data Importing
218211
^^^^^^^^^^^^^^^
219212
With a database, the development workflow is very flexible.

0 commit comments

Comments
 (0)