basics-docs/Big Data Basics.txt at master · lwooden/basics-docs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Big Data Basics

4 V's of Big Data
------------------
Volume - large volumes of data
Variety - various forms; structured, semi-structured, unstructured
Velocity - batch, streaming, real time
Variability - inconsistent, sometimes inaccurate, varying data

Big Data is a result of technology advances and innovations made in the past few decades
Cheap and faster storage + the internet + virtualization + cloud computing

Sources of Big Data
-------------------
Social Media
Machine-Generated Data
Business Transactions
IOT
Sensors

Big Data Example
----------------
UPS uses sensors on its vehicles to monitor speed, miles per gallon, mileage, number of stops and engine health.
 - UPS collects [200 data points] for each 80,000 vehicles every day


EDW = Enterprise Data Warehouse


Cycle of Big Data Management
----------------------------
1. Capture - depends of the problem to be solved
2. Organize - cleanse, organize, and validate data
3. Integrate - integrate with business rules or data warehouses
4. Analyze - real-time analysis, batch, reports, visualizations
5. Act - use analysis to solve the business problem

Layers of Big Data Architecture
--------------------------------
Layer 1: Physical Infrastructure - clusters, network, in-memory analytics
Layer 2: Security & Governance
Layer 3: Data - Text/Audio, Social Media, Machine Generated, Human Generated
Layer 4: Capture - Distributed File Systems, Streaming, NoSQL, RDBMS
Layer 5: Organize & Integrate - Apache Spark, Hadoop, ETL, Data Warehouse
Layer 6: Analyze - Predictive Analytics, Advanced Analytics, Alerts/Recommendations, Visualizations/Reports/Dashboards

*Not an architecture, but a strategy!

Organizing & Integrate Data

1. Cleaning data - identifying missing or incomplete data
2. Transformation - morphing data to a suitable form for analysis
3. Normalization - modifying structure of database schema to minimize redundancy


Data Lake -> ETL -> Hadoop/Apache Spark -> Data Warehouse -> Analyze
Text/Audio  						     Visualizations
Social Media						     Reports
Machine Gen						     Dashboards
Human Gen

ETL = Extract Transform Load

1. Extract - read data from the data source
2. Transform - convert the format of the data so that it conforms to the requirements of the target database
3. Load - write data to the target database

Apache Hadoop Map Reduce - good for batch processing large volumes of data in  scalable and resilient style
Apace Spark - good for applying complex analytics using machine learning models

Data Warehouse
--------------
A relational database that is designed for query and analysis rather than for transaction processing.
Usually contains historical data derived from transaction data from other sources. It seperates analysis workload from
transaction workload.

1. Organized by subject area, it contains an abstracted view of the business
2. Highly transformed and structured
3. Data loaded in the data warehouse is based on a strictly defined use case

Platforms: Redshift, MongoDB, SQL

Data Lake
-----------
A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured,
and unstructured data. The data structure and requirements are not defined until the data is needed

Platforms: S3, HDFS, Azure


Data Lake Vs. Data Warehouse
----------------------------

[Data Lake]
1. Retains all data
2. Data types consist of data sources such as web server logs, sensor data, text, images
3. Processing of data follows "Schema-on-read"
4. Extremely agile and can adapt to changes in businesss requirements quickly
5. Harder to secure
6. Hardware must be clustered to support a data lake
7. Used for advanced analytics


[Data Warehouse]
1. Only data that is required for the business use case and is highly modeled and structured
2. Consists of data extracted from transactional systems and consist of quantitative metrics and the attributes
3. Processing of data follows "Schema-on-write"
4. Due to strucuted modeling, time consuming to modify business processes
5. Older more mature technology; secure
6. Hardware consists of enterprise grade servers
7. Used for operational analysis to report on KPIs and slices of data

Types of Analytics
------------------
Predictive - using statistical models to predict the outcome of a task
Advanced - deep learning for apps like speech/face recognition
Social Media - determine market trends, forecast demand, etc
Text - derive analytics from natural language processing
Alerts/Recommendations - fraud detection and recommendation engines
Reports, Dashboards, and Visualizations = descriptive; answers what, when, where