Skip to content

Commit 1d12cba

Browse files
committed
updated docs regarding double parsing
1 parent 89e0d41 commit 1d12cba

File tree

2 files changed

+68
-5
lines changed

2 files changed

+68
-5
lines changed

docs/StardustDocs/topics/convert.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ df.convert { name }.asFrame { it.add("fullName") { "$firstName $lastName" } }
4444
* `Int` (and `Char`)
4545
* `Long`
4646
* `Float`
47-
* `Double`
47+
* `Double` (See [parsing doubles](parse.md#parsing-doubles) for `String` to `Double` conversion)
4848
* `BigDecimal`
4949
* `BigInteger`
5050
* `LocalDateTime` (kotlinx.datetime and java.time)

docs/StardustDocs/topics/parse.md

Lines changed: 67 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Returns a [`DataFrame`](DataFrame.md) in which the given `String` columns are pa
55

66
This is a special case of the [convert](convert.md) operation.
77

8+
This parsing operation is sometimes executed implicitly, for example, when [reading from CSV](read.md) or
9+
[type converting from `String` columns](convert.md).
10+
You can recognize this by the `locale` or `parserOptions` arguments in these functions.
11+
812
<!---FUN parseAll-->
913

1014
```kotlin
@@ -25,6 +29,8 @@ df.parse { age and weight }
2529
<dataFrame src="org.jetbrains.kotlinx.dataframe.samples.api.Modify.parseSome.html"/>
2630
<!---END-->
2731

32+
### Parsing Order
33+
2834
`parse` tries to parse every `String` column into one of supported types in the following order:
2935
* `Int`
3036
* `Long`
@@ -34,16 +40,30 @@ df.parse { age and weight }
3440
* `Duration` (`kotlin.time` and `java.time`)
3541
* `LocalTime` (`java.time`)
3642
* `URL` (`java.net`)
37-
* `Double` (with optional locale settings)
43+
* [`Double` (with optional locale settings)](#parsing-doubles)
3844
* `Boolean`
3945
* `BigDecimal`
4046
* `JSON` (arrays and objects)
4147

48+
### Parser Options
49+
50+
DataFrame supports multiple parser options that can be used to customize the parsing behavior.
51+
These can be supplied to the `parse` function (or any other function that can implicitly parse `Strings`)
52+
as an argument:
53+
4254
Available parser options:
43-
* `locale: Locale` is used to parse doubles
55+
* `locale: Locale` is used to [parse doubles](#parsing-doubles)
56+
* Default locale is `Locale.getDefault()`
4457
* `dateTimePattern: String` is used to parse date and time
4558
* `dateTimeFormatter: DateTimeFormatter` is used to parse date and time
46-
* `nullStrings: List<String>` is used to treat particular strings as `null` value. Default null strings are **"null"** and **"NULL"**
59+
* `nullStrings: List<String>` is used to treat particular strings as `null` value
60+
* Default null strings are **"null"** and **"NULL"**
61+
* When [reading from CSV](read.md), we include even more defaults, like **""**, and **"NA"**.
62+
See the KDocs there for the exact details
63+
* `skipTypes: Set<KType>` types that should be skipped during parsing
64+
* Empty set by default; parsing can result in any supported type
65+
* `useFastDoubleParser: Boolean` is used to enable or disable the [new fast double parser](#parsing-doubles)
66+
* Enabled by default
4767

4868
<!---FUN parseWithOptions-->
4969

@@ -54,8 +74,13 @@ df.parse(options = ParserOptions(locale = Locale.CHINA, dateTimeFormatter = Date
5474
<dataFrame src="org.jetbrains.kotlinx.dataframe.samples.api.Modify.parseWithOptions.html"/>
5575
<!---END-->
5676

77+
### Global Parser Options
78+
5779
You can also set global parser options that will be used by default in [`read`](read.md), [`convert`](convert.md),
58-
and `parse` operations:
80+
and other `parse` operations.
81+
These can be seen as a global fallback for the `parserOptions` argument.
82+
83+
For example, to change the locale to French and add a custom date-time pattern:
5984

6085
<!---FUN globalParserOptions-->
6186

@@ -64,4 +89,42 @@ DataFrame.parser.locale = Locale.FRANCE
6489
DataFrame.parser.addDateTimePattern("dd.MM.uuuu HH:mm:ss")
6590
```
6691

92+
This means that the locale being used by the parser is defined as:
93+
94+
↪ The locale given as function argument directly, or in `parserOptions`, if it is not `null`, else
95+
96+
&nbsp;&nbsp;&nbsp;&nbsp;↪ The locale set by `DataFrame.parser.locale = ...`, if it is not `null`, else
97+
98+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`Locale.getDefault()`, which is the system's default locale that can be changed with `Locale.setDefault()`.
99+
100+
### Parsing Doubles
101+
102+
DataFrame has a new fast and powerful double parser enabled by default.
103+
It is based on [the FastDoubleParser library](https://github.com/wrandelshofer/FastDoubleParser) for its
104+
high performance and configurability
105+
(in the future, we might expand this support to `Float`, `BigDecimal`, and `BigInteger` as well).
106+
107+
The parser is locale-aware; it will use the locale set by the [parser options](#parser-options) to parse the doubles.
108+
It also has a fallback mechanism built in, meaning it can recognize characters from
109+
all other locales (and some from [Wikipedia](https://en.wikipedia.org/wiki/Decimal_separator))
110+
and parse them correctly as long as they don't conflict with the current locale.
111+
112+
For example, if your locale uses ',' as decimal separator, it will not recognize ',' as thousands separator, but it will
113+
recognize ''', ' ', '٬', '_', ' ', etc. as such.
114+
The same holds for characters like "e", "inf", "×10^", "NaN", etc. (ignoring case).
115+
116+
This means you can safely parse `"123'456 789,012.345×10^6"` with a US locale but not `"1.234,5"`.
117+
118+
Aside from this, DataFrame also explicitly recognizes "∞", "inf", "infinity", and "infty" as `Double.POSITIVE_INFINITY`
119+
(as well as their negative counterparts), "nan", "na", and "n/a" as `Double.NaN`,
120+
and all forms of whitespace are treated equally.
121+
122+
If `FastDoubleParser` fails to parse a `String` as `Double`, DataFrame will try
123+
to parse it using the standard `NumberFormat.parse()` function as a last resort.
124+
125+
If you experience any issues with the new parser, you can turn it off by setting
126+
`useFastDoubleParser = false`, which will use the old `NumberFormat.parse()` function instead.
127+
128+
Please [report](https://github.com/Kotlin/dataframe/issues) any issues you encounter.
129+
67130
<!---END-->

0 commit comments

Comments
 (0)