Update db.rst

zsdonghao · web-flow · commit cc98f12e7125 · 2017-06-02T12:06:01.000+01:00
diff --git a/docs/modules/db.rst b/docs/modules/db.rst
@@ -4,7 +4,6 @@ API - Database
 This is the alpha version of database management system.
 If you have trouble, you can ask for help on `fangde.liu@imperial.ac.uk <fangde.liu@imperial.ac.uk>`_ .
 
-
 .. note::
    We are writing up the documentation, please wait in patient.
 
@@ -13,39 +12,37 @@ Why TensorDB
 ----------------
 
 TensorLayer is designed for production, aiming to be applied large scale machine learning application. 
-TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as 
+TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as: 
 
-1. how to mangage the training data and load the training datasets
+1. How to mangage the training data and load the training datasets
 2. When the dataset is so large that beyonds the storage limitation of one computer
-3. how shoud we managment different models and version, and comparing different models.
+3. How shoud we managment different models and version, and comparing different models.
 4. How to automate the whole training, evaluaiton and deploy machine learning model automatically.
 
-In the TensorLayer system, we introduce the database technology to the issues above.
+In TensorLayer system, we introduce the database technology to the issues above.
 
 TensorDB is designed by following three principles.
 
 Everything is Data
 ^^^^^^^^^^^^^^^^^^
 
-TensorDB is a data  warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as
-
-1. Data and Labels. Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
-2. Model Architecture.This group store the different model architecture, which user can select to use
-3. Model Parameters.This tables stores all the model parameters of echo in the training step.
-4.  Jobs.all the computation  is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs. 
-5. Logs. The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.
+TensorDB is a data  warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as:
 
-TensorDB is in principal is a key-word based search engine.  each model, parameters, or training data are assigned many tags.
-The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data. which is implemented based on NoSQL document database such as mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system.   Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is  stored in the documents.
+1. Data and Labels: Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
+2. Model Architecture: This group store the different model architecture, which user can select to use
+3. Model Parameters: This tables stores all the model parameters of echo in the training step.
+4. Jobs: All the computation  is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs. 
+5. Logs: The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.
 
+TensorDB in principal is a key-word based search engine. Each model, parameters, or training data are assigned many tags.
+The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data, which is implemented based on NoSQL document database such as Mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system. Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is stored in the documents.
 
 
 Everying is identified by Query
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-
-Within TensorDB framework, any entity within the data warehouse, such as  the data, model or jobs are specified by the database query language.  
-The first advantage is the query is more efficient  in space and can specify multiple objects in a concise way.
+Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language.  
+The first advantage is the query is more efficient in space and can specify multiple objects in a concise way.
 The advantage such a design is to enable a highly flexible software system.
 data, model architecture and training are interchangeable.
 Many work can be implemented by simply rewire different components. 
@@ -61,10 +58,10 @@ Also the training system have no clue of epochs, instead, it knows batchize and
 Many techniques are introduced behind the streaming interface. 
 The stream is implemented based on the database cursor technology,  so for every search, only the cursors are returned, not the actual data. 
 Only when the generator is evaluated, the acutal data is loaded. 
-The data loading is further optimise 
+The data loading is further optimise:
 
-1. data are compressed and decompressed, 
-2. the dataloaded in bulk model to optimise the IO traffic 
+1. Data are compressed and decompressed, 
+2. The dataloaded in bulk model to optimise the IO traffic 
 3. The argumentation or random sample are computed on the fly after the data are loaded into the local computer. 
 4. To optimise the space, the will also be a cache system that only store the recent blob data.
 
@@ -82,8 +79,7 @@ The exisitng implementation is based on Mongodb.
 Further implementaiton on other database will be released depends on progress.
 It will be stragihtford to port the tensorDB system to google cloud , aws and azure.
 
-
-The following tutorial is based on the MongoDb implmenetation.
+The following tutorials are based on the MongoDb implmenetation.
 
 
 Install MongoDB
@@ -93,13 +89,11 @@ The installation instruction of Mongodb can be found at
 `MongoDB Docs <https://docs.mongodb.com/manual/installation/>`_
 there are also managed mongodb service from amazon or gcp, or mongo atlas from mongodb
 
+User can also user docker, which is a powerful tool for `deploy software <https://hub.docker.com/_/mongo/>`_ .
 
-User can also user docker, which is a powerful tool for `deploy software <https://hub.docker.com/_/mongo/>`_
+After install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.
 
-
-after install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.
-
-users can install the Studio3T( mongochef), which is free for none commerical user interface.
+Users can install the Studio3T( mongochef), which is free for none commerical user interface.
 `studio3t <https://studio3t.com/>`_
 
 
@@ -125,9 +119,10 @@ To use TensorDB mongodb implmentaiton,  you need pymongo client.
 
 you can install it by 
 
-``pip install pymongo``
+.. code-block:: bash
 
-``pip install lz4``
+  pip install pymongo
+  pip install lz4
 
 
 it is very strateford to connected to the TensorDB system.
@@ -139,12 +134,12 @@ you can try the following code
   db = TensorDB(ip='127.0.0.1', port=27017, db_name='your_db', user_name=None, password=None, studyID='ministMLP')
   
 
-the ip is the ip address of the database, and port number is number of mongodb.
-you may need to specificy the database name and studyid.
-the study id is an unique identifier for an experiement.
+The ``ip`` is the ip address of the database, and ``port`` number is number of mongodb.
+You may need to specificy the database name and studyid.
+The study id is an unique identifier for an experiement.
 
 TensorDB stores different study in one data warehouse. 
-This has pros and cons, the benefits is that suppose the each study we try a different model architecutre
+This has pros and cons, the benefits is that suppose the each study we try a different model architecutre,
 it is very easy for us to evaluate different model architecture.
 
 
@@ -154,7 +149,7 @@ log and parameters
 The basic application is use TensorDB to save the model parameters and training/evaluation/testing logs.
 to use tensorDB, this can be easily done by replacing the print function by the db.log function
 
-for save the trainning log, we have
+For save the trainning log, we have
 ``db.train_log``
 
 and 
@@ -163,7 +158,7 @@ and
 
 methods
 
-suppose we save the log each step and save the parameters each epoch, we can have the code like this
+Suppose we save the log each step and save the parameters each epoch, we can have the code like this
 
 .. code-block:: python
 
@@ -189,31 +184,29 @@ for example, in many our our cases, we just simpliy specify the python code.
    '''
    db.save_model_architecutre(code,{'name':'print'}
    
-   c,fid=db.find_model_architecutre({'name':'print'})
+   c,fid = db.find_model_architecutre({'name':'print'})
    exec c
    
    db.push_job(code,{'type':'train'})
    
-   ##worker
-   code=db.pop_job()
+   ## worker
+   code = db.pop_job()
    exec code
    
    
 Database Interface
 ------------------
 
 The trainning set is managed by a seperate database.
-each application has its own database,
-however, all the database interface should support two interface,
+each application has its own database.
+However, all the database interface should support two interface,
 1. find_data,
 2. data_generator
 
 and example for minist dataset is include in the TensorLabDemo code
 
 
 
-
-
 Data Importing
 ^^^^^^^^^^^^^^^
 With a database, the development workflow is very flexible.