You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/sysadmin/dba.md
+289-3Lines changed: 289 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,167 @@
1
-
# User Management
1
+
# Database Administration
2
+
3
+
## Hosting
4
+
5
+
Let’s say a person, a lab, or a multi-lab consortium decide to use DataJoint as their
6
+
data pipeline platform.
7
+
What IT resources and support will be required?
8
+
9
+
DataJoint uses a MySQL-compatible database server such as MySQL, MariaDB, Percona
10
+
Server, or Amazon Aurora to store the structured data used for all relational
11
+
operations.
12
+
Large blocks of data associated with these records such as multidimensional numeric
13
+
arrays (signals, images, scans, movies, etc) can be stored within the database or
14
+
stored in additionally configured [bulk storage](../client/stores.md).
15
+
16
+
The first decisions you need to make are where this server will be hosted and how it
17
+
will be administered.
18
+
The server may be hosted on your personal computer, on a dedicated machine in your lab,
19
+
or in a cloud-based database service.
20
+
21
+
### Cloud hosting
22
+
23
+
Increasingly, many teams make use of cloud-hosted database services, which allow great
24
+
flexibility and easy administration of the database server.
25
+
A cloud hosting option will be provided through https://works.datajoint.com.
26
+
DataJoint Works simplifies the setup for labs that wish to host their data pipelines in
27
+
the cloud and allows sharing pipelines between multiple groups and locations.
28
+
Being an open-source solution, other cloud services such as Amazon RDS can also be used
29
+
in this role, albeit with less DataJoint-centric customization.
30
+
31
+
### Self hosting
32
+
33
+
In the most basic configuration, the relational database software and DataJoint are
34
+
installed onto a single computer which is used by an individual user.
35
+
To support a small group of users, a larger computer can be used instead and configured
36
+
for remote access.
37
+
As the number of users grows, individual workstations can be installed with the
38
+
DataJoint software and used to connect to a larger and more specialized centrally
39
+
located database server machine.
40
+
41
+
For even larger groups or multi-site collaborations, multiple database servers may be
42
+
configured in a replicated fashion to support larger workloads and simultaneous
43
+
multi-site access.
44
+
The following section provides some basic guidelines for these configurations here and
45
+
in the subsequent sections of the documentation.
46
+
47
+
### General server / hardware support requirements
48
+
49
+
The following table lists some likely scenarios for DataJoint database server
50
+
deployments and some reasonable estimates of the required computer hardware.
51
+
The required IT/systems support needed to ensure smooth operations in the absence of
52
+
local database expertise is also listed.
53
+
54
+
#### IT infrastructures
55
+
56
+
| Usage Scenario | DataJoint Database Computer | Required IT Support |
57
+
| -- | -- | -- |
58
+
| Single User | Personal Laptop or Workstation | Self-Supported or Ad-Hoc General IT Support |
59
+
| Small Group (e.g. 2-10 Users) | Workstation or Small Server | Ad-Hoc General or Experienced IT Support |
60
+
| Medium Group (e.g. 10-30 Users) | Small to Medium Server | Ad-Hoc/Part Time Experienced or Specialized IT Support |
61
+
| Large Group/Department (e.g. 30-50+ Users) | Medium/Large Server or Multi-Server Replication | Part Time/Dedicated Experienced or Specialized IT Support |
62
+
| Multi-Location Collaboration (30+ users, Geographically Distributed) | Large Server, Advanced Replication | Dedicated Specialized IT Support |
63
+
64
+
## Configuration
65
+
66
+
### Hardware considerations
67
+
68
+
As in any computer system, CPU, RAM memory, disk storage, and network speed are
69
+
important components of performance.
70
+
The relational database component of DataJoint is no exception to this rule.
71
+
This section discusses the various factors relating to selecting a server for your
72
+
DataJoint pipelines.
73
+
74
+
#### CPU
75
+
76
+
CPU speed and parallelism (number of cores/threads) will impact the speed of queries
77
+
and the number of simultaneous queries which can be efficiently supported by the system.
78
+
It is a good rule of thumb to have enough cores to support the number of active users
79
+
and background tasks you expect to have running during a typical 'busy' day of usage.
80
+
For example, a team of 10 people might want to have 8 cores to support a few active
81
+
queries and background tasks.
82
+
83
+
#### RAM
84
+
85
+
The amount of RAM will impact the amount of DataJoint data kept in memory, allowing for
86
+
faster querying of data since the data can be searched and returned to the user without
87
+
needing to access the slower disk drives.
88
+
It is a good idea to get enough memory to fully store the more important and frequently
89
+
accessed portions of your dataset with room to spare, especially if in-database blob
90
+
storage is used instead of external [bulk storage](bulk-storage.md).
91
+
92
+
#### Disk
93
+
94
+
The disk storage for a DataJoint database server should have fast random access,
95
+
ideally with flash-based storage to eliminate the rotational delay of mechanical hard
96
+
drives.
97
+
98
+
#### Networking
99
+
100
+
When network connections are used, network speed and latency are important to ensure
101
+
that large query results can be quickly transferred across the network and that delays
102
+
due to data entry/query round-trip have minimal impact on the runtime of the program.
103
+
104
+
#### General recommendations
105
+
106
+
DataJoint datasets can consist of many thousands or even millions of records.
107
+
Generally speaking one would want to make sure that the relational database system has
108
+
sufficient CPU speed and parallelism to support a typical number of concurrent users
109
+
and to execute searches quickly.
110
+
The system should have enough RAM to store the primary key values of commonly used
111
+
tables and operating system caches.
112
+
Disk storage should be fast enough to support quick loading of and searching through
113
+
the data.
114
+
Lastly, network bandwidth must be sufficient to support transferring user records
115
+
quickly.
116
+
117
+
### Large-scale installations
118
+
119
+
Database replication may be beneficial if system downtime or precise database
120
+
responsiveness is a concern
121
+
Replication can allow for easier coordination of maintenance activities, faster
122
+
recovery in the event of system problems, and distribution of the database workload
123
+
across server machines to increase throughput and responsiveness.
124
+
125
+
#### Multi-master replication
126
+
127
+
Multi-master replication configurations allow for all replicas to be used in a read/
128
+
write fashion, with the workload being distributed among all machines.
129
+
However, multi-master replication is also more complicated, requiring front-end
130
+
machines to distribute the workload, similar performance characteristics on all
131
+
replicas to prevent bottlenecks, and redundant network connections to ensure the
132
+
replicated machines are always in sync.
133
+
134
+
### Recommendations
135
+
136
+
It is usually best to go with the simplest solution which can suit the requirements of
137
+
the installation, adjusting workloads where possible and adding complexity only as
138
+
needs dictate.
139
+
140
+
Resource requirements of course depend on the data collection and processing needs of
141
+
the given pipeline, but there are general size guidelines that can inform any system
142
+
configuration decisions.
143
+
A reasonably powerful workstation or small server should support the needs of a small
144
+
group (2-10 users).
145
+
A medium or large server should support the needs of a larger user community (10-30
146
+
users).
147
+
A replicated or distributed setup of 2 or more medium or large servers may be required
148
+
in larger cases.
149
+
These requirements can be reduced through the use of external or cloud storage, which
| Single User | Personal Laptop or Workstation | 4 Cores, 8-16GB or more of RAM, SSD or better storage |
155
+
| Small Group (e.g. 2-10 Users) | Workstation or Small Server | 8 or more Cores, 16GB or more of RAM, SSD or better storage |
156
+
| Medium Group (e.g. 10-30 Users) | Small to Medium Server | 8-16 or more Cores, 32GB or more of RAM, SSD/RAID or better storage |
157
+
| Large Group/Department (e.g. 30-50+ Users) | Medium/Large Server or Multi-Server Replication | 16-32 or more Cores, 64GB or more of RAM, SSD Raid storage, multiple machines |
158
+
| Multi-Location Collaboration (30+ users, Geographically Distributed) | Large Server, Advanced Replication | 16-32 or more Cores, 64GB or more of RAM, SSD Raid storage, multiple machines; potentially multiple machines in multiple locations |
159
+
160
+
### Docker
161
+
162
+
A Docker image is available for a MySQL server configured to work with DataJoint: https://github.com/datajoint/mysql-docker.
163
+
164
+
## User Management
2
165
3
166
Create user accounts on the MySQL server. For example, if your
4
167
username is alice, the SQL code for this step is:
@@ -42,7 +205,7 @@ statement.
42
205
SHOW GRANTS FOR 'alice'@'%';
43
206
```
44
207
45
-
## Grouping with Wildcards
208
+
###Grouping with Wildcards
46
209
47
210
Depending on the complexity of your installation, using additional
48
211
wildcards to group access rules together might make managing user
@@ -61,7 +224,7 @@ GRANT SELECT ON `user\_%\_%`.* TO 'bob'@'%';
61
224
62
225
to enable `bob` to query all other users tables using the
63
226
`user_username_database` convention without needing to explicitly
64
-
give him access to ``alice\_%``, ``charlie\_%``, and so on.
227
+
give him access to `alice\_%`, `charlie\_%`, and so on.
65
228
66
229
This convention can be further expanded to create notions of groups
67
230
and protected schemas for background processing, etc. For example:
@@ -78,3 +241,126 @@ could allow both bob an alice to read/write into the
78
241
```group\_shared``` databases, but in the case of the
79
242
```group\_wonderland``` databases, read write access is restricted
80
243
to alice.
244
+
245
+
## Backups and Recovery
246
+
247
+
Backing up your DataJoint installation is critical to ensuring that your work is safe
248
+
and can be continued in the event of system failures, and several mechanisms are
249
+
available to use.
250
+
251
+
Much like your live installation, your backup will consist of two portions:
252
+
253
+
- Backup of the Relational Data
254
+
- Backup of optional external bulk storage
255
+
256
+
This section primarily deals with backup of the relational data since most of the
257
+
optional bulk storage options use "regular" flat-files for storage and can be backed up
258
+
via any "normal" disk backup regime.
259
+
260
+
There are many options to backup MySQL; subsequent sections discuss a few options.
261
+
262
+
### Cloud hosted backups
263
+
264
+
In the case of cloud-hosted options, many cloud vendors provide automated backup of
265
+
your data, and some facility for downloading such backups externally.
266
+
Due to the wide variety of cloud-specific options, discussion of these options falls
267
+
outside of the scope of this documentation.
268
+
However, since the cloud server is also a MySQL server, other options listed here may
269
+
work for your situation.
270
+
271
+
### Disk-based backup
272
+
273
+
The simplest option for many cases is to perform a disk-level backup of your MySQL
274
+
installation using standard disk backup tools.
275
+
It should be noted that all database activity should be stopped for the duration of the
276
+
backup to prevent errors with the backed up data.
277
+
This can be done in one of two ways:
278
+
279
+
- Stopping the MySQL server program
280
+
- Using database locks
281
+
282
+
These methods are required since MySQL data operations can be ongoing in the background
283
+
even when no user activity is ongoing.
284
+
To use a database lock to perform a backup, the following commands can be used as the
285
+
MySQL administrator:
286
+
287
+
```mysql
288
+
FLUSH TABLES WITH READ LOCK;
289
+
UNLOCK TABLES;
290
+
```
291
+
292
+
The backup should be performed between the issuing of these two commands, ensuring the
293
+
database data is consistent on disk when it is backed up.
294
+
295
+
### MySQLDump
296
+
297
+
Disk based backups may not be feasible for every installation, or a database may
298
+
require constant activity such that stopping it for backups is not feasible.
0 commit comments