Skip to content

Commit 1a3916e

Browse files
authored
Merge pull request #103224 from mikeburek/patch-1
Fixed list of template folders
2 parents d7d6e5c + f067047 commit 1a3916e

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

articles/storage/blobs/data-lake-storage-best-practices.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,11 @@ Every workload has different requirements on how the data is consumed, but these
112112

113113
In IoT workloads, there can be a great deal of data being ingested that spans across numerous products, devices, organizations, and customers. It's important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:
114114

115-
`*{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/*`
115+
- *{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/*
116116

117117
For example, landing telemetry for an airplane engine within the UK might look like the following structure:
118118

119-
`*UK/Planes/BA1293/Engine1/2017/08/11/12/*`
119+
- *UK/Planes/BA1293/Engine1/2017/08/11/12/*
120120

121121
In this example, by putting the date at the end of the directory structure, you can use ACLs to more easily secure regions and subject matters to specific users and groups. If you put the data structure at the beginning, it would be much more difficult to secure these regions and subject matters. For example, if you wanted to provide access only to UK data or certain planes, you'd need to apply a separate permission for numerous directories under every hour directory. This structure would also exponentially increase the number of directories as time went on.
122122

@@ -126,14 +126,14 @@ A commonly used approach in batch processing is to place data into an "in" direc
126126

127127
Sometimes file processing is unsuccessful due to data corruption or unexpected formats. In such cases, a directory structure might benefit from a **/bad** folder to move the files to for further inspection. The batch job might also handle the reporting or notification of these *bad* files for manual intervention. Consider the following template structure:
128128

129-
`*{Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/*\`
130-
`*{Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/*\`
131-
`*{Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/*`
129+
- *{Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/*
130+
- *{Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/*
131+
- *{Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/*
132132

133133
For example, a marketing firm receives daily data extracts of customer updates from their clients in North America. It might look like the following snippet before and after being processed:
134134

135-
`*NA/Extracts/ACMEPaperCo/In/2017/08/14/updates_08142017.csv*\`
136-
`*NA/Extracts/ACMEPaperCo/Out/2017/08/14/processed_updates_08142017.csv*`
135+
- *NA/Extracts/ACMEPaperCo/In/2017/08/14/updates_08142017.csv*
136+
- *NA/Extracts/ACMEPaperCo/Out/2017/08/14/processed_updates_08142017.csv*
137137

138138
In the common case of batch data being processed directly into databases such as Hive or traditional SQL databases, there isn't a need for an **/in** or **/out** directory because the output already goes into a separate folder for the Hive table or external database. For example, daily extracts from customers would land into their respective directories. Then, a service such as [Azure Data Factory](../../data-factory/introduction.md), [Apache Oozie](https://oozie.apache.org/), or [Apache Airflow](https://airflow.apache.org/) would trigger a daily Hive or Spark job to process and write the data into a Hive table.
139139

0 commit comments

Comments
 (0)