Skip to content

Commit 659af58

Browse files
authored
Merge pull request #2 from AIML4OS/Austria
merge Austria chapter 1 - data to main
2 parents 33f6eee + ce24d25 commit 659af58

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

chapters/chapter1.qmd

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ title: "Data"
33
format: html # This qmd file will be compiled as an HTML web page
44
---
55

6+
67
# Data storage
78

89
## Germany
@@ -12,6 +13,10 @@ For programming and data processing, we use Cloudera Machine Learning (CML) with
1213
We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
1314
For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
1415

16+
## Austria
17+
Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.
18+
19+
1520
# Data cleaning
1621

1722
## Germany
@@ -21,11 +26,17 @@ First, data augmentation is performed by adding new text entries (e.g. text like
2126
Adding a variety of new textual entries helps the model to generelize better.
2227
Secondly, we clean the data by removing punctation and handling missing values.
2328

29+
## Austria
30+
Duplicated entries are removed from the data. Text inputs are transformed into all lower-case letters. Further, we remove stop words, umlaut-charaters (ä,ö,ü), special characters (e.g. -,+,#,), gender-specific words endings (e.g. "-in", ":innen"), and numbers.
31+
Each categorical variable has a predefined set of valid input classes, since the model can only handle known classes. All known inputs are translated into this set of classes. Unknown inputs are set to their respective "unknown" category.
32+
33+
2434
# Data versioning
2535

2636
## Germany
2737

2838
None
2939

30-
40+
## Austria
41+
None
3142

0 commit comments

Comments
 (0)