-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathBig Data Basics.txt
More file actions
120 lines (92 loc) · 4.53 KB
/
Big Data Basics.txt
File metadata and controls
120 lines (92 loc) · 4.53 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Big Data Basics
4 V's of Big Data
------------------
Volume - large volumes of data
Variety - various forms; structured, semi-structured, unstructured
Velocity - batch, streaming, real time
Variability - inconsistent, sometimes inaccurate, varying data
Big Data is a result of technology advances and innovations made in the past few decades
Cheap and faster storage + the internet + virtualization + cloud computing
Sources of Big Data
-------------------
Social Media
Machine-Generated Data
Business Transactions
IOT
Sensors
Big Data Example
----------------
UPS uses sensors on its vehicles to monitor speed, miles per gallon, mileage, number of stops and engine health.
- UPS collects [200 data points] for each 80,000 vehicles every day
EDW = Enterprise Data Warehouse
Cycle of Big Data Management
----------------------------
1. Capture - depends of the problem to be solved
2. Organize - cleanse, organize, and validate data
3. Integrate - integrate with business rules or data warehouses
4. Analyze - real-time analysis, batch, reports, visualizations
5. Act - use analysis to solve the business problem
Layers of Big Data Architecture
--------------------------------
Layer 1: Physical Infrastructure - clusters, network, in-memory analytics
Layer 2: Security & Governance
Layer 3: Data - Text/Audio, Social Media, Machine Generated, Human Generated
Layer 4: Capture - Distributed File Systems, Streaming, NoSQL, RDBMS
Layer 5: Organize & Integrate - Apache Spark, Hadoop, ETL, Data Warehouse
Layer 6: Analyze - Predictive Analytics, Advanced Analytics, Alerts/Recommendations, Visualizations/Reports/Dashboards
*Not an architecture, but a strategy!
Organizing & Integrate Data
1. Cleaning data - identifying missing or incomplete data
2. Transformation - morphing data to a suitable form for analysis
3. Normalization - modifying structure of database schema to minimize redundancy
Data Lake -> ETL -> Hadoop/Apache Spark -> Data Warehouse -> Analyze
Text/Audio Visualizations
Social Media Reports
Machine Gen Dashboards
Human Gen
ETL = Extract Transform Load
1. Extract - read data from the data source
2. Transform - convert the format of the data so that it conforms to the requirements of the target database
3. Load - write data to the target database
Apache Hadoop Map Reduce - good for batch processing large volumes of data in scalable and resilient style
Apace Spark - good for applying complex analytics using machine learning models
Data Warehouse
--------------
A relational database that is designed for query and analysis rather than for transaction processing.
Usually contains historical data derived from transaction data from other sources. It seperates analysis workload from
transaction workload.
1. Organized by subject area, it contains an abstracted view of the business
2. Highly transformed and structured
3. Data loaded in the data warehouse is based on a strictly defined use case
Platforms: Redshift, MongoDB, SQL
Data Lake
-----------
A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured,
and unstructured data. The data structure and requirements are not defined until the data is needed
Platforms: S3, HDFS, Azure
Data Lake Vs. Data Warehouse
----------------------------
[Data Lake]
1. Retains all data
2. Data types consist of data sources such as web server logs, sensor data, text, images
3. Processing of data follows "Schema-on-read"
4. Extremely agile and can adapt to changes in businesss requirements quickly
5. Harder to secure
6. Hardware must be clustered to support a data lake
7. Used for advanced analytics
[Data Warehouse]
1. Only data that is required for the business use case and is highly modeled and structured
2. Consists of data extracted from transactional systems and consist of quantitative metrics and the attributes
3. Processing of data follows "Schema-on-write"
4. Due to strucuted modeling, time consuming to modify business processes
5. Older more mature technology; secure
6. Hardware consists of enterprise grade servers
7. Used for operational analysis to report on KPIs and slices of data
Types of Analytics
------------------
Predictive - using statistical models to predict the outcome of a task
Advanced - deep learning for apps like speech/face recognition
Social Media - determine market trends, forecast demand, etc
Text - derive analytics from natural language processing
Alerts/Recommendations - fraud detection and recommendation engines
Reports, Dashboards, and Visualizations = descriptive; answers what, when, where