Skip to content

Commit 87ef4f0

Browse files
committed
fillna for Categorical columns added
1 parent d58a5f6 commit 87ef4f0

File tree

1 file changed

+294
-0
lines changed

1 file changed

+294
-0
lines changed

2-Working-With-Data/08-data-preparation/notebook.ipynb

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1614,6 +1614,300 @@
16141614
"You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice."
16151615
]
16161616
},
1617+
{
1618+
"cell_type": "markdown",
1619+
"metadata": {
1620+
"id": "CE8S7louLezV"
1621+
},
1622+
"source": [
1623+
"First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n",
1624+
"\n",
1625+
"In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n",
1626+
"\n",
1627+
"Again, here we can use domain knowledge here. Let us consider an example of filling with the mode."
1628+
]
1629+
},
1630+
{
1631+
"cell_type": "code",
1632+
"metadata": {
1633+
"id": "MY5faq4yLdpQ",
1634+
"outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc",
1635+
"colab": {
1636+
"base_uri": "https://localhost:8080/",
1637+
"height": 204
1638+
}
1639+
},
1640+
"source": [
1641+
"fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n",
1642+
" [3,4,None],\n",
1643+
" [5,6,\"False\"],\n",
1644+
" [7,8,\"True\"],\n",
1645+
" [9,10,\"True\"]])\n",
1646+
"\n",
1647+
"fill_with_mode"
1648+
],
1649+
"execution_count": 28,
1650+
"outputs": [
1651+
{
1652+
"output_type": "execute_result",
1653+
"data": {
1654+
"text/html": [
1655+
"<div>\n",
1656+
"<style scoped>\n",
1657+
" .dataframe tbody tr th:only-of-type {\n",
1658+
" vertical-align: middle;\n",
1659+
" }\n",
1660+
"\n",
1661+
" .dataframe tbody tr th {\n",
1662+
" vertical-align: top;\n",
1663+
" }\n",
1664+
"\n",
1665+
" .dataframe thead th {\n",
1666+
" text-align: right;\n",
1667+
" }\n",
1668+
"</style>\n",
1669+
"<table border=\"1\" class=\"dataframe\">\n",
1670+
" <thead>\n",
1671+
" <tr style=\"text-align: right;\">\n",
1672+
" <th></th>\n",
1673+
" <th>0</th>\n",
1674+
" <th>1</th>\n",
1675+
" <th>2</th>\n",
1676+
" </tr>\n",
1677+
" </thead>\n",
1678+
" <tbody>\n",
1679+
" <tr>\n",
1680+
" <th>0</th>\n",
1681+
" <td>1</td>\n",
1682+
" <td>2</td>\n",
1683+
" <td>True</td>\n",
1684+
" </tr>\n",
1685+
" <tr>\n",
1686+
" <th>1</th>\n",
1687+
" <td>3</td>\n",
1688+
" <td>4</td>\n",
1689+
" <td>None</td>\n",
1690+
" </tr>\n",
1691+
" <tr>\n",
1692+
" <th>2</th>\n",
1693+
" <td>5</td>\n",
1694+
" <td>6</td>\n",
1695+
" <td>False</td>\n",
1696+
" </tr>\n",
1697+
" <tr>\n",
1698+
" <th>3</th>\n",
1699+
" <td>7</td>\n",
1700+
" <td>8</td>\n",
1701+
" <td>True</td>\n",
1702+
" </tr>\n",
1703+
" <tr>\n",
1704+
" <th>4</th>\n",
1705+
" <td>9</td>\n",
1706+
" <td>10</td>\n",
1707+
" <td>True</td>\n",
1708+
" </tr>\n",
1709+
" </tbody>\n",
1710+
"</table>\n",
1711+
"</div>"
1712+
],
1713+
"text/plain": [
1714+
" 0 1 2\n",
1715+
"0 1 2 True\n",
1716+
"1 3 4 None\n",
1717+
"2 5 6 False\n",
1718+
"3 7 8 True\n",
1719+
"4 9 10 True"
1720+
]
1721+
},
1722+
"metadata": {},
1723+
"execution_count": 28
1724+
}
1725+
]
1726+
},
1727+
{
1728+
"cell_type": "markdown",
1729+
"metadata": {
1730+
"id": "MLAoMQOfNPlA"
1731+
},
1732+
"source": [
1733+
"Now, lets first find the mode before filling the `None` value with the mode."
1734+
]
1735+
},
1736+
{
1737+
"cell_type": "code",
1738+
"metadata": {
1739+
"id": "WKy-9Y2tN5jv",
1740+
"outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f",
1741+
"colab": {
1742+
"base_uri": "https://localhost:8080/"
1743+
}
1744+
},
1745+
"source": [
1746+
"fill_with_mode[2].value_counts()"
1747+
],
1748+
"execution_count": 29,
1749+
"outputs": [
1750+
{
1751+
"output_type": "execute_result",
1752+
"data": {
1753+
"text/plain": [
1754+
"True 3\n",
1755+
"False 1\n",
1756+
"Name: 2, dtype: int64"
1757+
]
1758+
},
1759+
"metadata": {},
1760+
"execution_count": 29
1761+
}
1762+
]
1763+
},
1764+
{
1765+
"cell_type": "markdown",
1766+
"metadata": {
1767+
"id": "6iNz_zG_OKrx"
1768+
},
1769+
"source": [
1770+
"So, we will replace None with True"
1771+
]
1772+
},
1773+
{
1774+
"cell_type": "code",
1775+
"metadata": {
1776+
"id": "TxPKteRvNPOs"
1777+
},
1778+
"source": [
1779+
"fill_with_mode[2].fillna('True',inplace=True)"
1780+
],
1781+
"execution_count": 30,
1782+
"outputs": []
1783+
},
1784+
{
1785+
"cell_type": "code",
1786+
"metadata": {
1787+
"id": "tvas7c9_OPWE",
1788+
"outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164",
1789+
"colab": {
1790+
"base_uri": "https://localhost:8080/",
1791+
"height": 204
1792+
}
1793+
},
1794+
"source": [
1795+
"fill_with_mode"
1796+
],
1797+
"execution_count": 31,
1798+
"outputs": [
1799+
{
1800+
"output_type": "execute_result",
1801+
"data": {
1802+
"text/html": [
1803+
"<div>\n",
1804+
"<style scoped>\n",
1805+
" .dataframe tbody tr th:only-of-type {\n",
1806+
" vertical-align: middle;\n",
1807+
" }\n",
1808+
"\n",
1809+
" .dataframe tbody tr th {\n",
1810+
" vertical-align: top;\n",
1811+
" }\n",
1812+
"\n",
1813+
" .dataframe thead th {\n",
1814+
" text-align: right;\n",
1815+
" }\n",
1816+
"</style>\n",
1817+
"<table border=\"1\" class=\"dataframe\">\n",
1818+
" <thead>\n",
1819+
" <tr style=\"text-align: right;\">\n",
1820+
" <th></th>\n",
1821+
" <th>0</th>\n",
1822+
" <th>1</th>\n",
1823+
" <th>2</th>\n",
1824+
" </tr>\n",
1825+
" </thead>\n",
1826+
" <tbody>\n",
1827+
" <tr>\n",
1828+
" <th>0</th>\n",
1829+
" <td>1</td>\n",
1830+
" <td>2</td>\n",
1831+
" <td>True</td>\n",
1832+
" </tr>\n",
1833+
" <tr>\n",
1834+
" <th>1</th>\n",
1835+
" <td>3</td>\n",
1836+
" <td>4</td>\n",
1837+
" <td>True</td>\n",
1838+
" </tr>\n",
1839+
" <tr>\n",
1840+
" <th>2</th>\n",
1841+
" <td>5</td>\n",
1842+
" <td>6</td>\n",
1843+
" <td>False</td>\n",
1844+
" </tr>\n",
1845+
" <tr>\n",
1846+
" <th>3</th>\n",
1847+
" <td>7</td>\n",
1848+
" <td>8</td>\n",
1849+
" <td>True</td>\n",
1850+
" </tr>\n",
1851+
" <tr>\n",
1852+
" <th>4</th>\n",
1853+
" <td>9</td>\n",
1854+
" <td>10</td>\n",
1855+
" <td>True</td>\n",
1856+
" </tr>\n",
1857+
" </tbody>\n",
1858+
"</table>\n",
1859+
"</div>"
1860+
],
1861+
"text/plain": [
1862+
" 0 1 2\n",
1863+
"0 1 2 True\n",
1864+
"1 3 4 True\n",
1865+
"2 5 6 False\n",
1866+
"3 7 8 True\n",
1867+
"4 9 10 True"
1868+
]
1869+
},
1870+
"metadata": {},
1871+
"execution_count": 31
1872+
}
1873+
]
1874+
},
1875+
{
1876+
"cell_type": "markdown",
1877+
"metadata": {
1878+
"id": "SktitLxxOR16"
1879+
},
1880+
"source": [
1881+
"As we can see, the null value has been replaced. Needless to say, we could have written anything in place or `'True'` and it would have got substituted."
1882+
]
1883+
},
1884+
{
1885+
"cell_type": "markdown",
1886+
"metadata": {
1887+
"id": "heYe1I0dOmQ_"
1888+
},
1889+
"source": [
1890+
"Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n",
1891+
"\n",
1892+
"1. Replace with Median of the row\n",
1893+
"2. Replace with Mean of the row \n",
1894+
"\n",
1895+
"We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n",
1896+
"\n",
1897+
"When the data is normalized, we can use mean, as in that case, mean and median would be pretty close."
1898+
]
1899+
},
1900+
{
1901+
"cell_type": "code",
1902+
"metadata": {
1903+
"id": "09HM_2feOj5Y"
1904+
},
1905+
"source": [
1906+
""
1907+
],
1908+
"execution_count": null,
1909+
"outputs": []
1910+
},
16171911
{
16181912
"cell_type": "code",
16191913
"metadata": {

0 commit comments

Comments
 (0)