|
1614 | 1614 | "You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice." |
1615 | 1615 | ] |
1616 | 1616 | }, |
| 1617 | + { |
| 1618 | + "cell_type": "markdown", |
| 1619 | + "metadata": { |
| 1620 | + "id": "CE8S7louLezV" |
| 1621 | + }, |
| 1622 | + "source": [ |
| 1623 | + "First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.\n", |
| 1624 | + "\n", |
| 1625 | + "In most of these cases, we replace missing values with the `mode` of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column. \n", |
| 1626 | + "\n", |
| 1627 | + "Again, here we can use domain knowledge here. Let us consider an example of filling with the mode." |
| 1628 | + ] |
| 1629 | + }, |
| 1630 | + { |
| 1631 | + "cell_type": "code", |
| 1632 | + "metadata": { |
| 1633 | + "id": "MY5faq4yLdpQ", |
| 1634 | + "outputId": "c3838b07-0d15-471e-8dad-370de91d4bdc", |
| 1635 | + "colab": { |
| 1636 | + "base_uri": "https://localhost:8080/", |
| 1637 | + "height": 204 |
| 1638 | + } |
| 1639 | + }, |
| 1640 | + "source": [ |
| 1641 | + "fill_with_mode = pd.DataFrame([[1,2,\"True\"],\n", |
| 1642 | + " [3,4,None],\n", |
| 1643 | + " [5,6,\"False\"],\n", |
| 1644 | + " [7,8,\"True\"],\n", |
| 1645 | + " [9,10,\"True\"]])\n", |
| 1646 | + "\n", |
| 1647 | + "fill_with_mode" |
| 1648 | + ], |
| 1649 | + "execution_count": 28, |
| 1650 | + "outputs": [ |
| 1651 | + { |
| 1652 | + "output_type": "execute_result", |
| 1653 | + "data": { |
| 1654 | + "text/html": [ |
| 1655 | + "<div>\n", |
| 1656 | + "<style scoped>\n", |
| 1657 | + " .dataframe tbody tr th:only-of-type {\n", |
| 1658 | + " vertical-align: middle;\n", |
| 1659 | + " }\n", |
| 1660 | + "\n", |
| 1661 | + " .dataframe tbody tr th {\n", |
| 1662 | + " vertical-align: top;\n", |
| 1663 | + " }\n", |
| 1664 | + "\n", |
| 1665 | + " .dataframe thead th {\n", |
| 1666 | + " text-align: right;\n", |
| 1667 | + " }\n", |
| 1668 | + "</style>\n", |
| 1669 | + "<table border=\"1\" class=\"dataframe\">\n", |
| 1670 | + " <thead>\n", |
| 1671 | + " <tr style=\"text-align: right;\">\n", |
| 1672 | + " <th></th>\n", |
| 1673 | + " <th>0</th>\n", |
| 1674 | + " <th>1</th>\n", |
| 1675 | + " <th>2</th>\n", |
| 1676 | + " </tr>\n", |
| 1677 | + " </thead>\n", |
| 1678 | + " <tbody>\n", |
| 1679 | + " <tr>\n", |
| 1680 | + " <th>0</th>\n", |
| 1681 | + " <td>1</td>\n", |
| 1682 | + " <td>2</td>\n", |
| 1683 | + " <td>True</td>\n", |
| 1684 | + " </tr>\n", |
| 1685 | + " <tr>\n", |
| 1686 | + " <th>1</th>\n", |
| 1687 | + " <td>3</td>\n", |
| 1688 | + " <td>4</td>\n", |
| 1689 | + " <td>None</td>\n", |
| 1690 | + " </tr>\n", |
| 1691 | + " <tr>\n", |
| 1692 | + " <th>2</th>\n", |
| 1693 | + " <td>5</td>\n", |
| 1694 | + " <td>6</td>\n", |
| 1695 | + " <td>False</td>\n", |
| 1696 | + " </tr>\n", |
| 1697 | + " <tr>\n", |
| 1698 | + " <th>3</th>\n", |
| 1699 | + " <td>7</td>\n", |
| 1700 | + " <td>8</td>\n", |
| 1701 | + " <td>True</td>\n", |
| 1702 | + " </tr>\n", |
| 1703 | + " <tr>\n", |
| 1704 | + " <th>4</th>\n", |
| 1705 | + " <td>9</td>\n", |
| 1706 | + " <td>10</td>\n", |
| 1707 | + " <td>True</td>\n", |
| 1708 | + " </tr>\n", |
| 1709 | + " </tbody>\n", |
| 1710 | + "</table>\n", |
| 1711 | + "</div>" |
| 1712 | + ], |
| 1713 | + "text/plain": [ |
| 1714 | + " 0 1 2\n", |
| 1715 | + "0 1 2 True\n", |
| 1716 | + "1 3 4 None\n", |
| 1717 | + "2 5 6 False\n", |
| 1718 | + "3 7 8 True\n", |
| 1719 | + "4 9 10 True" |
| 1720 | + ] |
| 1721 | + }, |
| 1722 | + "metadata": {}, |
| 1723 | + "execution_count": 28 |
| 1724 | + } |
| 1725 | + ] |
| 1726 | + }, |
| 1727 | + { |
| 1728 | + "cell_type": "markdown", |
| 1729 | + "metadata": { |
| 1730 | + "id": "MLAoMQOfNPlA" |
| 1731 | + }, |
| 1732 | + "source": [ |
| 1733 | + "Now, lets first find the mode before filling the `None` value with the mode." |
| 1734 | + ] |
| 1735 | + }, |
| 1736 | + { |
| 1737 | + "cell_type": "code", |
| 1738 | + "metadata": { |
| 1739 | + "id": "WKy-9Y2tN5jv", |
| 1740 | + "outputId": "41f5064e-502d-4aec-dc2d-86f885068b4f", |
| 1741 | + "colab": { |
| 1742 | + "base_uri": "https://localhost:8080/" |
| 1743 | + } |
| 1744 | + }, |
| 1745 | + "source": [ |
| 1746 | + "fill_with_mode[2].value_counts()" |
| 1747 | + ], |
| 1748 | + "execution_count": 29, |
| 1749 | + "outputs": [ |
| 1750 | + { |
| 1751 | + "output_type": "execute_result", |
| 1752 | + "data": { |
| 1753 | + "text/plain": [ |
| 1754 | + "True 3\n", |
| 1755 | + "False 1\n", |
| 1756 | + "Name: 2, dtype: int64" |
| 1757 | + ] |
| 1758 | + }, |
| 1759 | + "metadata": {}, |
| 1760 | + "execution_count": 29 |
| 1761 | + } |
| 1762 | + ] |
| 1763 | + }, |
| 1764 | + { |
| 1765 | + "cell_type": "markdown", |
| 1766 | + "metadata": { |
| 1767 | + "id": "6iNz_zG_OKrx" |
| 1768 | + }, |
| 1769 | + "source": [ |
| 1770 | + "So, we will replace None with True" |
| 1771 | + ] |
| 1772 | + }, |
| 1773 | + { |
| 1774 | + "cell_type": "code", |
| 1775 | + "metadata": { |
| 1776 | + "id": "TxPKteRvNPOs" |
| 1777 | + }, |
| 1778 | + "source": [ |
| 1779 | + "fill_with_mode[2].fillna('True',inplace=True)" |
| 1780 | + ], |
| 1781 | + "execution_count": 30, |
| 1782 | + "outputs": [] |
| 1783 | + }, |
| 1784 | + { |
| 1785 | + "cell_type": "code", |
| 1786 | + "metadata": { |
| 1787 | + "id": "tvas7c9_OPWE", |
| 1788 | + "outputId": "7282c4f7-0e59-4398-b4f2-5919baf61164", |
| 1789 | + "colab": { |
| 1790 | + "base_uri": "https://localhost:8080/", |
| 1791 | + "height": 204 |
| 1792 | + } |
| 1793 | + }, |
| 1794 | + "source": [ |
| 1795 | + "fill_with_mode" |
| 1796 | + ], |
| 1797 | + "execution_count": 31, |
| 1798 | + "outputs": [ |
| 1799 | + { |
| 1800 | + "output_type": "execute_result", |
| 1801 | + "data": { |
| 1802 | + "text/html": [ |
| 1803 | + "<div>\n", |
| 1804 | + "<style scoped>\n", |
| 1805 | + " .dataframe tbody tr th:only-of-type {\n", |
| 1806 | + " vertical-align: middle;\n", |
| 1807 | + " }\n", |
| 1808 | + "\n", |
| 1809 | + " .dataframe tbody tr th {\n", |
| 1810 | + " vertical-align: top;\n", |
| 1811 | + " }\n", |
| 1812 | + "\n", |
| 1813 | + " .dataframe thead th {\n", |
| 1814 | + " text-align: right;\n", |
| 1815 | + " }\n", |
| 1816 | + "</style>\n", |
| 1817 | + "<table border=\"1\" class=\"dataframe\">\n", |
| 1818 | + " <thead>\n", |
| 1819 | + " <tr style=\"text-align: right;\">\n", |
| 1820 | + " <th></th>\n", |
| 1821 | + " <th>0</th>\n", |
| 1822 | + " <th>1</th>\n", |
| 1823 | + " <th>2</th>\n", |
| 1824 | + " </tr>\n", |
| 1825 | + " </thead>\n", |
| 1826 | + " <tbody>\n", |
| 1827 | + " <tr>\n", |
| 1828 | + " <th>0</th>\n", |
| 1829 | + " <td>1</td>\n", |
| 1830 | + " <td>2</td>\n", |
| 1831 | + " <td>True</td>\n", |
| 1832 | + " </tr>\n", |
| 1833 | + " <tr>\n", |
| 1834 | + " <th>1</th>\n", |
| 1835 | + " <td>3</td>\n", |
| 1836 | + " <td>4</td>\n", |
| 1837 | + " <td>True</td>\n", |
| 1838 | + " </tr>\n", |
| 1839 | + " <tr>\n", |
| 1840 | + " <th>2</th>\n", |
| 1841 | + " <td>5</td>\n", |
| 1842 | + " <td>6</td>\n", |
| 1843 | + " <td>False</td>\n", |
| 1844 | + " </tr>\n", |
| 1845 | + " <tr>\n", |
| 1846 | + " <th>3</th>\n", |
| 1847 | + " <td>7</td>\n", |
| 1848 | + " <td>8</td>\n", |
| 1849 | + " <td>True</td>\n", |
| 1850 | + " </tr>\n", |
| 1851 | + " <tr>\n", |
| 1852 | + " <th>4</th>\n", |
| 1853 | + " <td>9</td>\n", |
| 1854 | + " <td>10</td>\n", |
| 1855 | + " <td>True</td>\n", |
| 1856 | + " </tr>\n", |
| 1857 | + " </tbody>\n", |
| 1858 | + "</table>\n", |
| 1859 | + "</div>" |
| 1860 | + ], |
| 1861 | + "text/plain": [ |
| 1862 | + " 0 1 2\n", |
| 1863 | + "0 1 2 True\n", |
| 1864 | + "1 3 4 True\n", |
| 1865 | + "2 5 6 False\n", |
| 1866 | + "3 7 8 True\n", |
| 1867 | + "4 9 10 True" |
| 1868 | + ] |
| 1869 | + }, |
| 1870 | + "metadata": {}, |
| 1871 | + "execution_count": 31 |
| 1872 | + } |
| 1873 | + ] |
| 1874 | + }, |
| 1875 | + { |
| 1876 | + "cell_type": "markdown", |
| 1877 | + "metadata": { |
| 1878 | + "id": "SktitLxxOR16" |
| 1879 | + }, |
| 1880 | + "source": [ |
| 1881 | + "As we can see, the null value has been replaced. Needless to say, we could have written anything in place or `'True'` and it would have got substituted." |
| 1882 | + ] |
| 1883 | + }, |
| 1884 | + { |
| 1885 | + "cell_type": "markdown", |
| 1886 | + "metadata": { |
| 1887 | + "id": "heYe1I0dOmQ_" |
| 1888 | + }, |
| 1889 | + "source": [ |
| 1890 | + "Now, coming to numeric data. Here, we have a two common ways of replacing missing values:\n", |
| 1891 | + "\n", |
| 1892 | + "1. Replace with Median of the row\n", |
| 1893 | + "2. Replace with Mean of the row \n", |
| 1894 | + "\n", |
| 1895 | + "We replace with Median, in case of skewed data with outliers. This is beacuse median is robust to outliers.\n", |
| 1896 | + "\n", |
| 1897 | + "When the data is normalized, we can use mean, as in that case, mean and median would be pretty close." |
| 1898 | + ] |
| 1899 | + }, |
| 1900 | + { |
| 1901 | + "cell_type": "code", |
| 1902 | + "metadata": { |
| 1903 | + "id": "09HM_2feOj5Y" |
| 1904 | + }, |
| 1905 | + "source": [ |
| 1906 | + "" |
| 1907 | + ], |
| 1908 | + "execution_count": null, |
| 1909 | + "outputs": [] |
| 1910 | + }, |
1617 | 1911 | { |
1618 | 1912 | "cell_type": "code", |
1619 | 1913 | "metadata": { |
|
0 commit comments