tensorlayer3/docs/modules/db.rst

261 lines
11 KiB
ReStructuredText

API - Database
=========================
This is the alpha version of database management system.
If you have any trouble, please ask for help at `tensorlayer@gmail.com <tensorlayer@gmail.com>`_ .
Why Database
----------------
TensorLayer is designed for real world production, capable of large scale machine learning applications.
TensorLayer database is introduced to address the many data management challenges in the large scale machine learning projects, such as:
1. Finding training data from an enterprise data warehouse.
2. Loading large datasets that are beyond the storage limitation of one computer.
3. Managing different models with version control, and comparing them(e.g. accuracy).
4. Automating the process of training, evaluating and deploying machine learning models.
With the TensorLayer system, we introduce this database technology to address the challenges above.
The database management system is designed with the following three principles in mind.
Everything is Data
^^^^^^^^^^^^^^^^^^
Data warehouses can store and capture the entire machine learning development process. The data can be categorized as:
1. Dataset: This includes all the data used for training, validation and prediction. The labels can be manually specified or generated by model prediction.
2. Model architecture: The database includes a table that stores different model architectures, enabling users to reuse the many model development works.
3. Model parameters: This database stores all the model parameters of each epoch in the training step.
4. Tasks: A project usually include many small tasks. Each task contains the necessary information such as hyper-parameters for training or validation. For a training task, typical information includes training data, the model parameter, the model architecture, how many epochs the training task has. Validation, testing and inference are also supported by the task system.
5. Loggings: The logs store all the metrics of each machine learning model, such as the time stamp, loss and accuracy of each batch or epoch.
TensorLayer database in principle is a keyword based search engine. Each model, parameter, or training data is assigned many tags.
The storage system organizes data into two layers: the index layer, and the blob layer. The index layer stores all the tags and references to the blob storage. The index layer is implemented based on NoSQL document database such as MongoDB. The blob layer stores videos, medical images or label masks in large chunk size, which is usually implemented based on a file system. Our database is based on MongoDB. The blob system is based on the GridFS while the indexes are stored as documents.
Everything is identified by Query
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Within the database framework, any entity within the data warehouse, such as the data, model or tasks is specified by the database query language.
As a reference, the query is more space efficient for storage and it can specify multiple objects in a concise way.
Another advantage of such a design is enabling a highly flexible software system.
Many system can be implemented by simply rewriting different components, with many new applications can be implemented just by update the query without modification of any application code.
..
A pulling based Stream processing pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For large training datasets, we provide a stream interface, which can in theory support an unlimitedly large dataset.
A stream interface, implemented as a python generators, keeps on generating new data during training.
When using the stream interface, the idea of epoch does not apply anymore, instead, we specify the batch size and image a epoch will have a fixed large number of steps.
Many techniques are introduced behind the stream interface for performance optimization.
The stream interface is based on the database cursor technology.
For every data query, only the cursors are returned immediately, not the actual query results.
The actual data are loaded later when the generators are evaluated.
The data loading is further optimized in many ways:
1. Data are compressed and decompressed,
2. The data are loaded in bulk model to further optimize the IO traffic
3. The data argumentation or random sampling are computed on the fly, only after the data are loaded into the local computer memory.
4. We also introduced simple cache system that stores the recent blob data.
Based on the stream interface, a continuos machine learning system can be easily implemented.
On a distributed system, the model training, validation and deployment can be run by different computing node which are all running continuously.
The trainer keeps on optimizing the models with new added data, the evaluation node keeps on evaluating the recent generated models and the deployment system keeps pulling the best models from the our database warehouse for application.
Preparation
--------------
In principle, the database can be implemented by any document oriented NoSQL database system.
The existing implementation is based on MongoDB.
Further implementations on other databases will be released depending on the progress.
It will be straightforward to port our database system to Google Cloud, AWS and Azure.
The following tutorials are based on the MongoDB implementation.
Installing and running MongoDB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The installation instruction of MongoDB can be found at
`MongoDB Docs <https://docs.MongoDB.com/manual/installation/>`__.
There are also many MongoDB services from Amazon or GCP, such as Mongo Atlas from MongoDB
User can also use docker, which is a powerful tool for `deploying software <https://hub.docker.com/_/mongo/>`_ .
After installing MongoDB, a MongoDB management tool with graphic user interface will be extremely useful.
Users can also install Studio3T(MongoChef), which is powerful user interface tool for MongoDB and is free for non-commercial use `studio3t <https://studio3t.com/>`_.
Tutorials
----------
Connect to the database
^^^^^^^^^^^^^^^^^^^^^^^^^
Similar with MongoDB management tools, IP and port number are required for connecting to the database.
To distinguish the different projects, the database instances have a ``project_name`` argument.
In the following example, we connect to MongoDB on a local machine with the IP ``localhost``, and port ``27017`` (this is the default port number of MongoDB).
.. code-block:: python
db = tl.db.TensorHub(ip='localhost', port=27017, dbname='temp',
username=None, password='password', project_name='tutorial')
Dataset management
^^^^^^^^^^^^^^^^^^^^
You can save a dataset into the database and allow all machines to access it.
Apart from the dataset key, you can also insert a custom argument such as version and description, for better managing the datasets.
Note that, all saving functions will automatically save a timestamp, allowing you to load staff (data, model, task) using the timestamp.
.. code-block:: python
db.save_dataset(dataset=[X_train, y_train, X_test, y_test], dataset_name='mnist', description='this is a tutorial')
After saving the dataset, others can access the dataset as followed:
.. code-block:: python
dataset = db.find_dataset('mnist')
dataset = db.find_dataset('mnist', version='1.0')
If you have multiple datasets that use the same dataset key, you can get all of them as followed:
.. code-block:: python
datasets = db.find_all_datasets('mnist')
Model management
^^^^^^^^^^^^^^^^^^^^^^^^^
Save model architecture and parameters into database.
The model architecture is represented by a TL graph, and the parameters are stored as a list of array.
.. code-block:: python
db.save_model(net, accuracy=0.8, loss=2.3, name='second_model')
After saving the model into database, we can load it as follow:
.. code-block:: python
net = db.find_model(sess=sess, accuracy=0.8, loss=2.3)
If there are many models, you can use MongoDB's 'sort' method to find the model you want.
To get the newest or oldest model, you can sort by time:
.. code-block:: python
## newest model
net = db.find_model(sess=sess, sort=[("time", pymongo.DESCENDING)])
net = db.find_model(sess=sess, sort=[("time", -1)])
## oldest model
net = db.find_model(sess=sess, sort=[("time", pymongo.ASCENDING)])
net = db.find_model(sess=sess, sort=[("time", 1)])
If you save the model along with accuracy, you can get the model with the best accuracy as followed:
.. code-block:: python
net = db.find_model(sess=sess, sort=[("test_accuracy", -1)])
To delete all models in a project:
.. code-block:: python
db.delete_model()
If you want to specify which model you want to delete, you need to put arguments inside.
Event / Logging management
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Save training log:
.. code-block:: python
db.save_training_log(accuracy=0.33)
db.save_training_log(accuracy=0.44)
Delete logs that match the requirement:
.. code-block:: python
db.delete_training_log(accuracy=0.33)
Delete all logging of this project:
.. code-block:: python
db.delete_training_log()
db.delete_validation_log()
db.delete_testing_log()
Task distribution
^^^^^^^^^^^^^^^^^^^^^^^^^^^
A project usually consists of many tasks such as hyper parameter selection.
To make it easier, we can distribute these tasks to several GPU servers.
A task consists of a task script, hyper parameters, desired result and a status.
A task distributor can push both dataset and tasks into a database, allowing task runners on GPU servers to pull and run.
The following is an example that pushes 3 tasks with different hyper parameters.
.. code-block:: python
## save dataset into database, then allow other servers to use it
X_train, y_train, X_val, y_val, X_test, y_test = tl.files.load_mnist_dataset(shape=(-1, 784))
db.save_dataset((X_train, y_train, X_val, y_val, X_test, y_test), 'mnist', description='handwriting digit')
## push tasks into database, then allow other servers pull tasks to run
db.create_task(
task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=800, n_units2=800),
saved_result_keys=['test_accuracy'], description='800-800'
)
db.create_task(
task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=600, n_units2=600),
saved_result_keys=['test_accuracy'], description='600-600'
)
db.create_task(
task_name='mnist', script='task_script.py', hyper_parameters=dict(n_units1=400, n_units2=400),
saved_result_keys=['test_accuracy'], description='400-400'
)
## wait for tasks to finish
while db.check_unfinished_task(task_name='mnist'):
print("waiting runners to finish the tasks")
time.sleep(1)
## you can get the model and result from database and do some analysis at the end
The task runners on GPU servers can monitor the database, and run the tasks immediately when they are made available.
In the task script, we can save the final model and results to the database, this allows task distributors to get the desired model and results.
.. code-block:: python
## monitors the database and pull tasks to run
while True:
print("waiting task from distributor")
db.run_task(task_name='mnist', sort=[("time", -1)])
time.sleep(1)
Example codes
^^^^^^^^^^^^^^^^
See `here <https://github.com/tensorlayer/tensorlayer/tree/master/example/database>`__.
TensorHub API
---------------------
.. automodule:: tensorlayer.db
.. autoclass:: TensorHub
:members: