Skip to content

Commit 8c11679

Browse files
committed
Datatype conversions
1 parent 9516060 commit 8c11679

File tree

5 files changed

+179
-1
lines changed

5 files changed

+179
-1
lines changed

docs/_quarto.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,15 @@ book:
121121
- href: notes/data-wrangling/overview.qmd
122122
text: "Data Wrangling Overview"
123123

124+
- href: notes/data-wrangling/datatypes.qmd
125+
text: "Datatype Conversions"
126+
127+
#- href: notes/data-wrangling/handling-nulls.qmd
128+
# text: "Handling Nulls"
129+
130+
#- href: notes/data-wrangling/handling-duplicates.qmd
131+
# text: "Handling Duplicates"
132+
124133
- href: notes/data-wrangling/grouping-pivoting.qmd
125134
text: "Grouping and Pivoting"
126135

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
2+
# Datatype Conversions
3+
4+
## Converting to Numeric
5+
6+
For this example, we will use a dataset of publicly traded companies, from the Nasdaq:
7+
8+
```{python}
9+
#| code-fold: true
10+
11+
from pandas import read_csv
12+
13+
request_url = "https://raw.githubusercontent.com/prof-rossetti/applied-data-science-python-book/refs/heads/main/docs/data/nasdaq_screener_1735923271750.csv"
14+
15+
df = read_csv(request_url)
16+
df.rename(columns={"% Change": "Pct Change", "Last Sale": "Latest Close"}, inplace=True)
17+
df.drop(columns=["Market Cap", "Country", "IPO Year", "Volume", "Industry", "Sector"], inplace=True)
18+
df.head()
19+
```
20+
21+
In this case we see the "Latest Close" column contains values like "$134.125" (including the dollar sign), and the "Pct Change" column contains values like "2.386%" (including the percent sign). These values are likely to be string datatypes. This unfortunately would prevent us from performing numeric calculations with the values in these columns.
22+
23+
We can more formally inspect the datatypes of each column, using the [`dtypes` property](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html):
24+
25+
```{python}
26+
df.dtypes
27+
```
28+
29+
Here we see the datatype for the "Latest Close" and "Pct Change" columns are both a generic "object", which is often used to represent strings. However we need to convert them to numeric datatypes instead.
30+
31+
:::{.callout-note}
32+
With [`pandas` datatypes](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes), the "object" datatype "can hold any Python object, including strings."
33+
34+
To disambiguate, we can always ask for the datatype of one of the values itself:
35+
36+
```{python}
37+
val = df["Pct Change"].values[0]
38+
print(val)
39+
print(type(val))
40+
```
41+
42+
Here we see these values are indeed string datatypes.
43+
:::
44+
45+
We can use the [`to_numeric` function](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html) from pandas to convert a string datatype to numeric:
46+
47+
```{python}
48+
from pandas import to_numeric
49+
50+
df["Latest Close Numeric"] = to_numeric(df["Latest Close"].str.lstrip("$"))
51+
52+
df["Pct Change Numeric"] = (to_numeric(df["Pct Change"].str.rstrip("%")) / 100)
53+
54+
df[["Symbol", "Latest Close", "Latest Close Numeric", "Pct Change", "Pct Change Numeric"]].head()
55+
```
56+
57+
:::{.callout-tip}
58+
In this example we are using string methods such as `lstrip` and `rstrip` to remove a character from the beginning or end of a string, respectively. However there are many other helpful string manipulation methods in pandas. For more information about string column operations, see [Working with Text Data](https://pandas.pydata.org/docs/user_guide/text.html#text-data-types).
59+
:::
60+
61+
After converting to numeric datatypes, we see the new "Latest Close Numeric" column contains float values like 134.125, and the new "Pct Change Numeric" column contains float values like 0.02386.
62+
63+
We can now use these numeric values to perform calculations, for example calculating the average return, and determining whether there was a gain or loss:
64+
65+
```{python}
66+
df["Pct Change Numeric"].mean().round(4)
67+
```
68+
69+
```{python}
70+
df["Gain"] = df["Pct Change Numeric"] > 0
71+
df.head()
72+
```
73+
74+
75+
## Datatype Casting
76+
77+
78+
In the previous example, we used the `to_numeric` function to convert strings to numbers, however we can alternatively perform a wider variety of datatype casting using the [`astype` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype).
79+
80+
For example, casting between strings and numbers:
81+
82+
```{python}
83+
df["Latest Close Numeric"] = df["Latest Close"].str.lstrip("$").astype(float)
84+
85+
df["Latest Close Reconstructed"] = "$" + df["Latest Close Numeric"].astype(str)
86+
87+
df[["Symbol", "Latest Close", "Latest Close Numeric", "Latest Close Reconstructed"]].head()
88+
```
89+
90+
```{python}
91+
df["Pct Change Numeric"] = df["Pct Change"].str.rstrip("%").astype(float) / 100
92+
93+
df["Pct Change Reconstructed"] = (df["Pct Change Numeric"] * 100).astype(str) + "%"
94+
95+
df[["Symbol", "Pct Change", "Pct Change Numeric", "Pct Change Reconstructed"]].head()
96+
```
97+
98+
And casting between booleans and integers:
99+
100+
```{python}
101+
df["Gain Binary"] = df["Gain"].astype(int)
102+
df["Gain Reconstructed"] = df["Gain Binary"].astype(bool)
103+
104+
df[["Symbol", "Pct Change", "Gain", "Gain Binary", "Gain Reconstructed"]].head()
105+
```
106+
107+
108+
After all these conversations, we can confirm the datatypes for good measure:
109+
110+
```{python}
111+
df.dtypes.sort_index()
112+
```

docs/notes/data-wrangling/handling-duplicates.qmd

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,23 @@
11

22
# Handling Duplicate Records
33

4+
```
5+
from pandas import read_csv
6+
7+
request_url = ""
8+
df = read_csv(request_url)
9+
df.head()
10+
```
411

12+
In this particular dataset, there are X records total:
513

614
```
715
len(df)
816
```
917

18+
We see the dataset has columns for "", "", and "". We should try to figure out which of the columns might be a unique identifier.
19+
20+
We see "cik_str" which could be a likely candidate to uniquely represent each row in the dataset. However upon further investigation, we see there are less unique values than the number of rows, which means some values are null, or some values are shared by more than one row.
1021

1122
```
1223
df["cik_str"].nunique()
@@ -27,3 +38,7 @@ Filtering using a mask:
2738
dups = df[df.duplicated(subset="cik_str", keep=False)]
2839
dups.sort_values(by="ticker").head(20)
2940
```
41+
42+
43+
44+
df[df.duplicated('A', keep=False)]
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,45 @@
11

22
# Handling Null Values
33

4+
5+
```
6+
from pandas import read_csv
7+
8+
request_url = "https://raw.githubusercontent.com/prof-rossetti/applied-data-science-python-book/refs/heads/main/docs/data/nasdaq_screener_1735923271750.csv"
9+
10+
df = read_csv(request_url)
11+
df.head()
12+
```
13+
14+
```
15+
len(df)
16+
```
17+
18+
```
19+
df.info()
20+
```
21+
22+
23+
24+
```
25+
df.isna()
26+
```
27+
28+
29+
```
30+
df.isna().sum()
31+
```
32+
33+
34+
```
35+
df[ nasdaq["Symbol"].isna() ]
36+
```
37+
38+
39+
```
40+
df.dropna(subset=["Symbol", "Market Cap", "Country", "Industry", "Sector"])
41+
```
42+
43+
```
44+
df.dropna()
45+
```

docs/notes/data-wrangling/joining-merging-2.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ For example, a financial analyst may merge transaction-level data with customer
1818

1919
If you've used the `VLOOKUP` function in spreadsheet software, or the `JOIN` clause in SQL, you've already merged datasets. Let's explore how to merge datasets in Python.
2020

21-
## Merging Data Frames
21+
## Merging DataFrames
2222

2323
We can use the [`merge` function](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) from `pandas` to join two datasets together. The `merge` function accepts two `DataFrame` objects as initial parameters, as well as a `how` parameter to indicate the join strategy (see "Join Strategies" section below). There are additional parameters to denote which columns or indices to join on.
2424

0 commit comments

Comments
 (0)