panorama-bigdata/CM3.Rmd at master · katossky/panorama-bigdata · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Big data tools"
subtitle: "Lesson 2 — Parallelized computing on distributed data"
author: Arthur Katossky & Rémi Pépin
institute: "ENSAI (Rennes, France)"
date: "March 2020"
output:
  xaringan::moon_reader:
    css: ["default", "css/presentation.css"]
    nature:
      ratio: 16:10
      scroll: false

---

# Forewords

---

## Course outline

1. Refresher `<= Last week`
2. File systems vs. databases
3. The fundamental problems of distribution
4. Distributing file systems
5. Distributing databases `<= Today`
6. Distributing tasks
7. Parallelizing computation on distributed data
8. Statistical applications
9. Conclusions and perspectives

---

## QCM

Go on Moodle. You have 10 mins.

---

# 1. Refresher

<!-- vocabulary: cluster, node, etc. -->

---

## Storage improvement and limits

???

Storage has improved a lot, but not as fast as data size.
We generate much more data than we can store on a single hard disk.

---

.height600[![](img/storage-cost.png)]

---

.height600[![](img/storage-average.png)]


---
## Parallelisation

???

Parallelisation is not the ideal solution, it most of the time does not bring as much improvement of a program as you might expect, since some parts of a programmed cannot be parallelised. When you parallelise, a lot of time is pent moving data around.

The map-then-reduce scheme is a clever and genral way to parallelise algorithms.

---
## Cloud computing

???

Cloud computing allows you to operate much of your computing tasks in the cloud.
In particular, you can store data too big to be stored on your personal computer.
You can launch a full cluster of machines, and arrange them as you wish, to do whatever tasks you want.
Also, they provide readily available solutions for storage and for computation at large scales.
However, this storage solutions hides a lot of the complexity that it is to store data on several computers.

---

# 2. File systems vs. databases

---

## File systems vs. databases

**Goal:** what do we do when data exceeds the physical limits of our storage?

-> We first have to understand how things are stored in the normal case.

???

Last time we did not really talk about either file systems or databases.
This will be the focus of this course.

---

## File systems

- Manages files on storage space (persistant memory, i.e. most often hard disk)
  - Makes the "physical file" (bits) and the "logical file" (fragments) transparent for the user
  - Does not care about file formats
  - Can handle _heterogeneous_, _unstructured_ data (image, sound, text, application)

- Principal use cases :
  - Read
  - Write
  - Security (access rights)

???

Transparent here means "not visible", "not disturbing".


---

## File systems

### Storage schema

.height400.center[![](img/fragments.jpg)]


---

## File systems

### Traditionnal file processing

.height400.center[![](img/filesystem.jpg)]


---
## File Systems

### Key features

- Is responsible for the correspondance between the physical and logical file
- Maintains the **index** of where data is physicaly situated on disk
- Maintains the **namespace**, a virtual hierarchy of files and directory making it possible to identify one single file by a string, called a **path**
- Doesn't undestand file contents, only understands meta data (e.g. size, owner, time stamp of last change)
- Responsible of the integrity of files
- Responsible of the redundancy of files (_not needed on your personal computer_)
- Security (access rights)

**Provides an useful abstraction for the clients (=programmes)**. It's not the programm responsability to know how files are stored (physical location, redundancy, access right). But it's the programm responsability to know how to read files.

???

All files are just bytes for it, the FS open a stream between the file and the application which want to read it. The application know how to read it.

---

## Data bases

- Manages files on storage space (persistant memory, i.e. most often hard disk)
  - Makes the "physical file" (bits) and the "logical file" (fragments) transparent for the user
  - Possibly care about data format (numbers, texts, dates...)
  - Can handle _homogeneous_, _structured_ data

- Principal use cases :
  - Unique entry point to access data
  - Knows how to read/write data
  - Exposes new tools to manipulate data
      - Specific lanquage (SQL for instance)
      - Transactions and concurrency management
      - Models to organize data

???

We mix the database (storage part) and the DBMS (management system). It's not a mistake, just a simplification.

---

## Data bases

### Why use a data base ?

- Integrated tool to process data
- "Centralize" the data

---

## Data bases

### Data base processsing

.height400.center[![](img/database.jpg)]

---

## Data bases

### Data base transaction

- Data base responsibility to guarentee the coherence and validity of data
- **Transaction** : unit of work
  - **A**tomicity : complete entirely or not at all
  - **C**onsistency : changes affected data only in allowed ways
  - **I**solation : must no affect other transcations
  - **D**urability : changes must be written on persistant storage

---

background-image: url(https://media2.giphy.com/media/5C472t1RGNuq4/giphy.gif?cid=790b76114b58b79ee92f2fa959969b9d111633782b5a5a7a&rid=giphy.gif)
background-size: cover

---

# 3. The fundamental problems of distribution

---

## Are the problems fundamentally different for file systems and databases?

The distinction becomes blur, especially when you start giving up on the _relational_ aspect of databases.

> The difference between a distributed file system and a distributed [database] is that a distributed file system allows files to be accessed using the same interfaces and semantics as local files.

Such semantics include using folder-like hierarchy, attributing permissions to individual files, copying files to different locations, etc.

We will thus start by the _common_ problems.

---

## The fundamental problems of distribution


- **Availability:** you want your data available 24/7

--

  - **Latency**: you don't want to wait for hours to get answered (read and write)

--
  - **Throughput:** maybe you want to read ALL your data or write GB at once

--
  - **Fault tolerance:** a single node failure shouldn't make your data unavailable

--

- **Coherence:** the same query with the same data return the same result

--
  - **Atomicity*:** complete entirely or not at all

--
  - **Durability*:** your data mustn't get corrupted over time

--
  - **Inter-node consistency:** at the same time, all the node see the same data

--

- **Schema consistency*:** changes affected data only in allowed ways


--

- **Isolation*:** transactions mustn't overlap

--

- **Elasticity/Scalability:**(under constraint of constant / acceptably-increasing request time)

--
  - can it store more or bigger files ?

--
  - can it process more requests ? (reads / writes)

--
  - can I add more nodes ? (_horizontal scaling_)

--
  - does scaling require a lot of _ad hoc_ work? can the scaling happen automatically? (_elasticity_)


???

* are ACID properties
---
## The fundamental problems of distribution


Plus the usual questions of large-scale systems :

- **Confidentiality:** only authorized person can access your data
- **Data gouvernance:** under which law are your data ?
- **Environment:** is your system not too big ? Does your system use too much energy ?
- **Economy:** isn't it too expensive to manage?
- ...

---

## The fundamental problems of distribution

### Some solutions

- **Redundancy / replication:** keep copies of the data in far away nodes, so that you don't lose information under hardware failure (.green[++ availability], .green[++ fault-tolerance], .green[++ durability], .red[-- inter-node consistency], .red[-- schema-consistency], .red[--cost], .red[--environment])

--
- **Balancing\rebalancing:** use all your node fairly (.red[--availability now], .green[++ availability later], .green[++ scalability])

--
- **Timestamp-based concurrency control:** use timestamp to resolve conflict (first in first out)  (.green[++isolation], .red[--availability])

--
- **Get the closest data to the client:** if the data are close to the client, there is less network time (.green[++ availability], .red[-inter-node consistency], .red[-governance]) (*edge, fog, mist computing*)

--
- **Have a master:** it organizes the work to avoid conflict (.green[+consistency], .red[-availability], .red[-fault-tolerance])

--
- **Asynchronous processing:** nodes can accept change locally, and consolidate the transactions only in a second phase (.green[++availability], .red[--inter-node consistency])

--
- **First-class actions:** you may chose to priviledge reads over writes, or to completely prevent modifying files, for instance (.green[++availability], .green[++consistency])

???


In this part we do not distinguish between file systems and databases.

Homegeneous (all run with the same sowftware / OS) vs. inhomogeneous (diff. software / OS).


> Confidentiality, availability and integrity are the main keys for a secure system.

> A server belongs to a rack, a room, a data center, a country, and a continent, in order to precisely identify its geographical location
> The need to support append operations and allow file contents to be visible even while a file is being written
> Communication is reliable among working machines: TCP/IP is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately when there is a problem and a need to make a new connection.

**Source:** https://en.wikipedia.org/wiki/Distributed_file_system_for_cloud


> Distributed file systems may aim for "transparency" in a number of aspects. That is, they aim to be "invisible" to client programs, which "see" a system which is similar to a local file system. Behind the scenes, the distributed file system handles locating files, transporting data, and potentially providing other features listed below.
>


---
## The CAP theorem

![](img\CAP theorem.jpg)

???

Brewer's theorem, published as a conjecture in 1999, proved in 2002
Source : https://en.wikipedia.org/wiki/CAP_theorem

---
## Transparency requirements
- **Access transparency:** clients are unaware that files/data are distributed and can access them in the same way as local files are accessed

.footnote[**Souce:** https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems]

--
- **Location transparency:** the way to refer to a file/data does not depend on its location

--
- **Concurrency transparency:** all clients should have the same view of the state of the file system / database

--
- **Failure transparency:** clients should not notice a single node failure

--
- **Heterogeneity and scale transparency:** it should not matter on which specific machines or on how many machines the file system / the database is distributed

--
- **Replication transparency:** clients should be unaware of the file replication performed across multiple servers to support scalability

--
- **Migration transparency:** files should be able to move between different servers without the client's knowledge (*interoperability*).

--
- ...

---
# 4. Distributed file systems

???

**Source:**

- https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
- https://en.wikipedia.org/wiki/Distributed_file_system_for_cloud

---

## Distributed file systems

**Distributed file systems** are also known historically as **network file systems**

Dates back from the beginning of "the network" in the 1960s.

In 1985, Sun Microsystems creates "Network File System" (NFS), still in use today.

---

## An exemple of distributed file system: HDFS

HDFS stands for "Hadoop Distributed File System."

It is an open-source project financed by the Apache Fundation.

.footnote[

![](img/hadoop.png)

This section is heavily inspired from HDFS documentation pages ([link](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)). Asterisks (*) denote (almost) exact citations.

]

???

**Sources:**

- https://en.wikipedia.org/wiki/Apache_Hadoop (the explanation seems to be about Hadoop v1 since HDFS is refered to as the job tracker)
- https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

---

## An exemple of distributed file system: HDFS

_Hadoop_ is actually a complete suite of _modules_, of which HDFS and YARN are the basic components.

But "_Hadoop_" in a broader sense refers to a complete software ecosystem, most of which is also supported by the Apache fundation. This ecosystem encompasses the _Hodoop_ modules MapReduce, Ozone and Submarine, and the libraries Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Tez and ZooKeeper. More information on [Hadoop's website](https://hadoop.apache.org).

---

![](img/hadoop-ecosystem.png)

.footnote[


**Source:** https://www.oreilly.com/library/view/apache-hive-essentials/9781788995092/e846ea02-6894-45c9-983a-03875076bb5b.xhtml

]

???

<!-- CHECK : what do the libraries mentionned on the previous slide do, that are not displayed on this table (e.g. Tez)  ? Conversely, what are the software mentionned in this table that are not in the previous list (e.g. Flume) ? -->

---

## An exemple of distributed file system: HDFS

_Hadoop_ was first released in 2006, and has evolved a lot since.

HDFS's developement was inspired by the publication of _Google File System_, a now deprecated¹ file system developped by Google, to serve as infrastructure for the _Apache Nutch_ web search engine project.

We will here focus on the latest version, Hadoop 3.

.footnote[_Hadoop_ is mostly coded in _Java_. <br/>
¹source :https://www.systutorials.com/colossus-successor-to-google-file-system-gfs/ <br/>
map-reduce principles : https://static.googleusercontent.com/media/research.google.com/fr//archive/mapreduce-osdi04.pdf <br/>
Google File System : https://static.googleusercontent.com/media/research.google.com/fr//archive/gfs-sosp2003.pdf]

---

## An exemple of distributed file system: HDFS

### Why HDFS?

- open-source

- well-documented

- well-spread

---

## An exemple of distributed file system: HDFS

### The architecture

HDFS has a master/slave architecture.

**NameNode:** a master server that manages the file system namespace and regulates access to files by clients*

**DataNodes:** slaves which manage storage attached to the nodes that they run on*

---

## An exemple of distributed file system: HDFS

### Key ideas

When given a new file, the **NameNode** splits it into one or more **blocks** (default size 128MB) and gives these blocks to be stored into a set of **DataNodes**. Each block is stored several times, on different DataNodes, and this number is the **replication factor** of that file.

**Files in HDFS cannot be modified**, except for appends and truncates. Emphasis is indeed on the reading part, a scheme HDFS calls "write-once-read-many".

By default, the replication factor is 3, and the NameNode tries to allocate the replicas intelligently: one on a given node, that then sends a copy to a close node (faster but less fault-tolerant) and an other to a further away node (slower but more fault-tolerant).

---

## An exemple of distributed file system: HDFS

### Key ideas

If a client wants to write a file, they ask the NameNode.

1. The NameNode splits the file into blocks.

2. For each block, it selects a number of DataNode to write onto (typically 3), based on:
    1. disk space (more space is better)
    2. proximity to the client (closer is better)
    3. proximity to each other (one replication close, one far)
    4. distribution of blocks (blocks of the same file should be on different nodes)

3. It then passes the block split and the lists to the client.

4. The client writes each block on the first block on the list.

5. The DataNode passes the block and the list on to the next DataNode on the list. (This minimizes the use of the network in between the cluster and the client, likely to be slower than the network inside the cluster.)

.footnote[The "client" is often not directly the user, but some other module asking for read / write access.]

---

## An exemple of distributed file system: HDFS

### Key ideas

If a client wants to read a file, they ask the NameNode.

1. The NameNode looks into the index to associate the file path to blocks.
2. The NameNode locates DataNodes containing the blocks, closest to the client.
3. It passes the information to the client, who in turn reads directly from the specified DataNodes.

.footnote[The "client" is often not directly the user, but some other module asking for read / write access.]


---

## An exemple of distributed file system: HDFS

### Key ideas

.center.height400[![](img/hadoop example.jpg)]

---


## An exemple of distributed file system: HDFS

### Role of the master (NameNode)

The **NameNode** :

- is the entry point for a client's requests
- decides how to split files into blocks and on which DataNodes to store these blocks
- never handles actual files, only stores metadata
- knows at any time the correspondance between a file's name and the block's identifiers (aka **namespace**, stored on disk) and the the physical location of blocks (aka **block index**, kept in memory)
- detects DataNode failure by listenning to their **heartbeat** and demands block replication as necessary


???

It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
- splits
- manages and stores the data registry


The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.*

The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.*

---

## An exemple of distributed file system: HDFS

### Role of the master (NameNode)

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.*

The **namespace** is persisted to disk at regular intervals (every X seconds or every Y changes). In between, a record of the changes is also written onto disk, in a file know as the edit log, so that at all time, the entire file system is preserved. In a case of a failure of the master, the data on disk is restored.

_Where_ exactly the blocks are stored, however, is not stored on disk, but kept in memory.

---

## An exemple of distributed file system: HDFS

### Role of the slaves

The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.*

**DataNodes:**
- emit a regular **heartbeat** containing the list of all the blocks stored locally
- send copies directly to each other when the NameNode requires a copy to be made
- give access (in read or write) directly to the client

---

## An exemple of distributed file system: HDFS

### Properties

#### Fault-tolerance

> The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.

---

## An exemple of distributed file system: HDFS

### Properties

#### Fault-tolerance

1. Hadoop modules have **location awareness**, i.e. they use the locations of the nodes relative to each other. This enables HDFS to obtain **safe redundancy** by replicating data in different locations. By default, there is one copy in vicinity (typically in a server room, the same rack) and one copy in further away (typically, an other rack).

---

## An exemple of distributed file system: HDFS

### Properties

#### Fault-tolerance

2. Failure is explicitly taken into account in the design of the file system. There may be 3 kinds of failure:

  - **NameNode failure:** when the NameNode fails, it is restarted, and:

      - restores the latest index saved on disk,
      - applies all the changes that are recorded in the edit log, also saved on disk
      - reconstitute the location of blocks in memory from the DataNodes' heartbeat

    Running several NameNodes with a distritbuted edit log is also possible.

  - **Network failure:** when a subset of DataNodes lose connectivity with the NameNode (aka _network partition_), the NameNode detects the absence of _heartbeat_ and immediately demands replication of blocks below the replication factor. Since replication happens in distinct locations, it is unlikely that all of the replicated blocks become unavailable. (By default, a DataNode is considered dead after 10 min of silence.)
  - **DataNode failure:** this is a special case of _network partition_ with only one node disconnected from the rest

---

## An exemple of distributed file system: HDFS

### Properties

#### Availability

1. **Latency / throughput** is achieved by delegating to the slave all the communication intensive work. Yet, the NameNode is a bottleneck, since all communication goes through it. **High throughput** is obtained at the expense of a (relatively) **low latency**. HDFS is not conceived for interactive use.

2. HDFS supports natively **balancing**: it will use first the least used resources.

3. HDFS is also compatible with _rebalancing_.

???

In some cases, user requests may start to concentrate on only a few NameNodes, and cause over-load and slow-down. Maybe a specific file is requested particularly often. Or maybe some DataNodes are full, whereas some other are almost empty — this might happen because you just added a new node, or because you deleted some voluminous file.  (i.e. moving data blocks from over-used nodes to under-used ones, or replicating often-requested blocks), even though it is not implemented by default as of today.

---

## An exemple of distributed file system: HDFS

### Properties

#### Scalability

1. By splitting files into blocks, HDFS does not limit the size of a single file, nor the number of files.
2. You can add slave servers at any time.
3. Users interact only briefly with the NameNode, and most of the interaction with the cluster is decentralized, hence enabling a large amount of simultaneous users.

However, HDFS is not meant for frequent writes, and does not scale in this regard. Indeed, even though data does not go through the DataNode, editting the index — which is, please remember, copied onto disk — and demanding copies to be made to the DataNodes take time. Multiple NameNode can cope with this but is not the standard installation. HDFS calls it "write-once-read-many".

---

## An exemple of distributed file system: HDFS

### Properties

#### Integrity

HDFS insures integrity by storing the _**checksum**_ of a block alongside the block itself. If the block gets corrupted, the computed checksum does not match the stored one. (It is possible, but much less likely, that the checksum itselft gets corrupted.)

Clients checksum the block they download and inform the DataNode of the result. DataNodes also run regular checsksum of all unchecked blocks.

DataNodes store information about the checks already made, and thus can narrow the date about which data has become corrupted.

In case of checksum mismatch, clients flag the incriminated block as corrupted to the NameNode, in which case a new copy is demanded.

---
## Other distributed file systems

- Proprietary private file systems such Amazon S3, Azure Storage, Google Cloud Storage, etc.
- **NFS** (Linux), **SMB** (Windows) or **AFP** (Apple), typically used in Network-attached storages (NAS's)
- **Lustre**, typically used in clusters of super computers
- ...

???

**Source:**https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

<!--Highlight some differences?-->

> Network File System (NFS) uses a client-server architecture, which allows sharing files between a number of machines on a network as if they were located locally, providing a standardized view. The NFS protocol allows heterogeneous clients' processes, probably running on different machines and under different operating systems, to access files on a distant server, ignoring the actual location of files. Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability. Using multiple servers does not solve the availability problem since each server is working independently.[5] The model of NFS is a remote file service. This model is also called the remote access model, which is in contrast with the upload/download model:
> Remote access model: Provides transparency, the client has access to a file. He send requests to the remote file (while the file remains on the server).[6]
> Upload/download model: The client can access the file only locally. It means that the client has to download the file, make modifications, and upload it again, to be used by others' clients.
> The file system used by NFS is almost the same as the one used by Unix systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes.

---
# 5. Distributed databases

.footnote[This section is heavily inspired from « Cassandra: The Definitive Guide: Distributed Data at Web Scale. » by Jeff Carpenter]

???

https://en.wikipedia.org/wiki/Distributed_database

---
## Specific problems of distributed databases

A key feature of (relational) databases is their **consistency**.

For instance, if your schema requires an foreign key, you are not allowed to add a line to a given table without linking to a record from an other table.

However, when distributing databases, it becomes increasingly difficult to maintain **consistency** accross copies of many tables. Indeed, the mere lookup of a foreign key in a table distributed across many nodes may take a long time, and one is supposed to do so for each record created / updated.

In order to maintain **consistency**, databases thus generally give up the relational component, and are thus known as **no-SQL databases**.

Doing so, they become mere **distributed data stores**, and **the frontier with a file system tends to blur**. They often consist in just a requesting programme and an interface with the underlying file system.

---
## Specific problems of distributed databases

More fundamentally, **strict consistency** means every node containing the data gets updated simultaneously ; this means that:

- there is some global time that the nodes can refer to ;
- that they can agree in a sure way that some change will be acknowledged at a given instant (and that the other nodes know that this node knows that hey know it knows... ¹) ;

The only way to insure this form of consistency is through locking-releasing resources.

.footnote[¹ This is known as **the Byzantine Generals Problem**, where to generals refuse to attack until they get completely sure that the other is also sure that they both agree to attack.]


---
## Specific solutions for distributed databases

### Denormalisation

.center.height400[![](img/join db.jpg)]


---
## Specific solutions for distributed databases

### Denormalisation

When data becomes large, join operations becomes expensive. (So do consistency checks on foreign keys.)

A solution is to pre-compute joins, so that the lookup is performed only once. This is known as denormalisation.

But then maintaining consistency accross the tables become a problem.


---
## Specific solutions for distributed databases

### 2 phase-commit

In order to maintain schema-consistency, you lock a resource until you finished the transaction.

Downsides:

- The resource is locked in between, preventing other clients to access it. It is thus only acceptable for transactions occuring super fast.
- Infinite waiting is possible.

---
## Specific solutions for distributed databases

### Compensation

In order to maintain schema-consistency, you undo a previously done change, in case that the transaction finally ends up with an error. Like soustracting -10 if you initially did +10.

Downsides:

- does not work for legally binding transactions (stock sales, bank transfers...)

---
## Specific solutions for distributed databases

### Sharding

<!-- read more -->

Sharding is the idea to split data over multiple nodes.

1. functionnal segmentation
2. (manual) key-based sharding
3. lookup-table

Downsides:

- inbalance / no duplication
- single point of failure
- every request goes through the lookup table
- sharding has to be cleverly chosen

<!--

## Specific solutions for distributed databases

### _share-nothing_ architecture

A _share-nothing_ architecture is an architecure where each node is independant from each other, so that consistency becomes irrelevant.

.footnote[Michael Stonebraker, 1986, "The Case for Shared Nothing."]
-->

---
## Specific solutions for distributed databases

### Sharding

.pull-left[

Product table, store on one node

| Product | Price |
|---------|-------|
| p1      | 45    |
| p2      | 50    |
| p3      | 200   |
| p4      | 230   |
| p5      | 500   |
| p6      | 12    |
| p7      | 56    |
| p8      | 1000  |
]

.pull-right[
Product table sharded. Each shard is stored on one node (or multiple for redundancy)

.pull-left[
Shard 1, price < 100

| Product | Price |
|---------|-------|
| p1      | 45    |
| p2      | 50    |
| p6      | 12    |
| p7      | 56    |
]
.pull-right[
Shard 1, price >= 100

| Product | Price |
|---------|-------|
| p3      | 200   |
| p4      | 230   |
| p5      | 500   |
| p8      | 1000  |
]
]

---
## Specific solutions for distributed databases

### No-SQL databases

If we accept some departure from the ACID requirement, it is possible to reach (horizontal) scalability and high availability though distribution.

This has always existed, but the unifying "no-SQL" labels was a 2010s phenomenon.

- key-value store (ex: Voldemort)
- column store (ex: Cassandra)
- document store (ex: MongoDB)
- graph database (ex: Neo4J)

---
## An exemple of distributed database: Cassandra

Cassandra is part of the Hadoop ecosystem (it support map reduce task, and cassandra node can be installed on top of hadoop).

Born in Facebook in 2007, Cassandra bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable.

Some of the largest production deployments use Cassandra, including Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix's (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou's (270 nodes, 300 TB, over 800 million requests per day), and eBay's (over 100 nodes, 250 TB).

.footnote[

![](img/cassandra.png)

]

???

Distributed
Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.
Supports replication and multi data center replication
Replication strategies are configurable.[16] Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.
Scalability

Cassandra has innate balancing and rebalancing capabilities in the case of adding or removing nodes.

Cassandra uses **location awareness**, and distinguishes between two levels: the rack, and the data center.

<!-- flexible schema -->

1. The node launches a "snitch" (FR: cafteur), whose task is to find the fastest available node having the requested records
2. The client downloads the records.
3. Simmultaneously, the node asks for the checksums of the same records from the other (slower) nodes that contain the same data. As many checksums are askes as the consistency level specified in the request.

### Gossipping

Once per second, a node will send a message to a random node of the cluster.

If the node does not answer, it marks it locally as down.

<!--Not so clear...-->

.footnote[The term "gossip protocol" dates back from 1987, in an article by Alan Demers.]

Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra,[17] Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle.

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

Designed to have read and write throughput both increase linearly as new machines are added, with the aim of no downtime or interruption to applications.

---
## An exemple of distributed database: Cassandra

### Key ideas

Cassandra is **decentralized**: it is a peer-to-peer database, at the opposite spectrum of the a master-slave architecture. Each node functions exactly the same as the other, and no single one is necessary to the overall functionning of the database (no **single point of failure**).

The nodes maintain their knowledge of the network through **gossip**.

Cassandra is a **sparse**, **row-oriented** database, meaning that not all columns should exist for each record. Each row gets a key, that is used to distribute and replicate the records.This allows Cassandra to save space of non-assigned values.

Cassandra fundamental choice is giving up consistency in echange for availability: there is not guarantee that data will be up to date when reading. Cassandra controls the trade-off between consistency and fault-tolerance with 2 parameters:
- **replication factor** is the number of copies the database has to maintain
- **consistency level** is specified at each interaction: it is the number of nodes the databse has to consult and that have to agree for the operation to be considered a success.

---

## An exemple of distributed database: Cassandra

### Typical execution

When a clients wants to write or read, he adresses himself to any of the nodes, while specifying a desired consistency level.

1. The node becomes coordinator for this request.
2. The coordinator forwards the request to all the <!--known?--> replicas hosting the nodes.
3. In case of a write, the replica logs the request in its "commit log" on disk, performs the write in memory, then confirms the transaction success, and returns a result if any.
4. The coordinator waits for as many confirmations as the client has requested with its consistency level before confirming the reading/writing and returns the consolidated response.

---

## An exemple of distributed database: Cassandra

### Replicas

When an write instruction is given, a replica:

1. logs it in a log (called "commit log")
2. pushes the changes in memory ("memtable")
3. if the current memtable reaches a threshold, creates a new memtable
4. when idle, copies the memtable to disk

This ensures a trade-off between availability and fault-tolerance.

---

## An exemple of distributed database: Cassandra

### Properties

#### Fault-tolerance

---
## An exemple of distributed database: Cassandra

### Properties

#### Scalability

---
## An exemple of distributed database: Cassandra

### Properties

#### Consistency

---
# 6. Distributing tasks


---
## Scheduling