-pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these
-data sources is provided by function with the prefix ``read_*``. Similarly, the ``to_*`` methods are used to store data.
+pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). The ability to import data from each of these
+data sources is provided by functions with the prefix, ``read_*``. Similarly, the ``to_*`` methods are used to store data.
.. image:: ../_static/schemas/02_io_readwrite.svg
:align: center
@@ -181,7 +180,7 @@ data sources is provided by function with the prefix ``read_*``. Similarly, the
-Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting, and extracting the
+Selecting or filtering specific rows and/or columns? Filtering the data on a particular condition? Methods for slicing, selecting, and extracting the
data you need are available in pandas.
.. image:: ../_static/schemas/03_subset_columns_rows.svg
@@ -228,7 +227,7 @@ data you need are available in pandas.
-pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter, bar, boxplot,...)
+pandas provides plotting for your data right out of the box with the power of Matplotlib. Simply pick the plot type (scatter, bar, boxplot,...)
corresponding to your data.
.. image:: ../_static/schemas/04_plot_overview.svg
@@ -275,7 +274,7 @@ corresponding to your data.
-There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work elementwise.
+There's no need to loop over all rows of your data table to do calculations. Column data manipulations work elementwise in pandas.
Adding a column to a :class:`DataFrame` based on existing data in other columns is straightforward.
.. image:: ../_static/schemas/05_newcolumn_2.svg
@@ -322,7 +321,7 @@ Adding a column to a :class:`DataFrame` based on existing data in other columns
-Basic statistics (mean, median, min, max, counts...) are easily calculable. These or custom aggregations can be applied on the entire
+Basic statistics (mean, median, min, max, counts...) are easily calculable across data frames. These, or even custom aggregations, can be applied on the entire
data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.
.. image:: ../_static/schemas/06_groupby.svg
@@ -369,8 +368,8 @@ data set, a sliding window of the data, or grouped by categories. The latter is
-Change the structure of your data table in multiple ways. You can :func:`~pandas.melt` your data table from wide to long/tidy form or :func:`~pandas.pivot`
-from long to wide format. With aggregations built-in, a pivot table is created with a single command.
+Change the structure of your data table in a variety of ways. You can use :func:`~pandas.melt` to reshape your data from a wide format to a long and tidy one. Use :func:`~pandas.pivot`
+ to go from long to wide format. With aggregations built-in, a pivot table can be created with a single command.
.. image:: ../_static/schemas/07_melt.svg
:align: center
@@ -416,7 +415,7 @@ from long to wide format. With aggregations built-in, a pivot table is created w
-Multiple tables can be concatenated both column wise and row wise as database-like join/merge operations are provided to combine multiple tables of data.
+Multiple tables can be concatenated column wise or row wise with pandas' database-like join and merge operations.
.. image:: ../_static/schemas/08_concat_row.svg
:align: center
@@ -505,7 +504,7 @@ pandas has great support for time series and has an extensive set of tools for w
-Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and extract useful information from it.
+Data sets often contain more than just numerical data. pandas provides a wide range of functions to clean textual data and extract useful information from it.
.. raw:: html
@@ -551,9 +550,9 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- The `R programming language
`__ provides the
- ``data.frame`` data structure and multiple packages, such as
- `tidyverse
`__ use and extend ``data.frame``
+ The `R programming language `__ provides a
+ ``data.frame`` data structure as well as packages like
+ `tidyverse `__ which use and extend ``data.frame``
for convenient data handling functionalities similar to pandas.
+++
@@ -572,8 +571,8 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- Already familiar to ``SELECT``, ``GROUP BY``, ``JOIN``, etc.?
- Most of these SQL manipulations do have equivalents in pandas.
+ Already familiar with ``SELECT``, ``GROUP BY``, ``JOIN``, etc.?
+ Many SQL manipulations have equivalents in pandas.
+++
@@ -613,7 +612,7 @@ the pandas-equivalent operations compared to software you already know:
Users of `Excel `__
or other spreadsheet programs will find that many of the concepts are
- transferrable to pandas.
+ transferable to pandas.
+++
@@ -631,10 +630,10 @@ the pandas-equivalent operations compared to software you already know:
:class-card: comparison-card
:shadow: md
- The `SAS `__ statistical software suite
- also provides the ``data set`` corresponding to the pandas ``DataFrame``.
- Also SAS vectorized operations, filtering, string processing operations,
- and more have similar functions in pandas.
+ `SAS `__, the statistical software suite,
+ uses the ``data set`` structure, which closely corresponds pandas' ``DataFrame``.
+ Also SAS vectorized operations such as filtering or string processing operations
+ have similar functions in pandas.
+++
diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst
index 1d7eca5223544..b3982c4ad091f 100644
--- a/doc/source/getting_started/install.rst
+++ b/doc/source/getting_started/install.rst
@@ -6,88 +6,75 @@
Installation
============
-The easiest way to install pandas is to install it
-as part of the `Anaconda `__ distribution, a
-cross platform distribution for data analysis and scientific computing.
-The `Conda `__ package manager is the
-recommended installation method for most users.
+The pandas development team officially distributes pandas for installation
+through the following methods:
-Instructions for installing :ref:`from source `,
-:ref:`PyPI `, or a
-:ref:`development version ` are also provided.
+* Available on `conda-forge `__ for installation with the conda package manager.
+* Available on `PyPI `__ for installation with pip.
+* Available on `Github `__ for installation from source.
+
+.. note::
+ pandas may be installable from other sources besides the ones listed above,
+ but they are **not** managed by the pandas development team.
.. _install.version:
Python version support
----------------------
-Officially Python 3.9, 3.10 and 3.11.
+See :ref:`Python support policy `.
Installing pandas
-----------------
-.. _install.anaconda:
+.. _install.conda:
-Installing with Anaconda
-~~~~~~~~~~~~~~~~~~~~~~~~
+Installing with Conda
+~~~~~~~~~~~~~~~~~~~~~
-For users that are new to Python, the easiest way to install Python, pandas, and the
-packages that make up the `PyData `__ stack
-(`SciPy `__, `NumPy `__,
-`Matplotlib `__, `and more `__)
-is with `Anaconda `__, a cross-platform
-(Linux, macOS, Windows) Python distribution for data analytics and
-scientific computing. Installation instructions for Anaconda
-`can be found here `__.
+For users working with the `Conda `__ package manager,
+pandas can be installed from the ``conda-forge`` channel.
-.. _install.miniconda:
+.. code-block:: shell
-Installing with Miniconda
-~~~~~~~~~~~~~~~~~~~~~~~~~
+ conda install -c conda-forge pandas
-For users experienced with Python, the recommended way to install pandas with
-`Miniconda `__.
-Miniconda allows you to create a minimal, self-contained Python installation compared to Anaconda and use the
-`Conda `__ package manager to install additional packages
-and create a virtual environment for your installation. Installation instructions for Miniconda
-`can be found here `__.
+To install the Conda package manager on your system, the
+`Miniforge distribution `__
+is recommended.
-The next step is to create a new conda environment. A conda environment is like a
-virtualenv that allows you to specify a specific version of Python and set of libraries.
-Run the following commands from a terminal window.
+Additionally, it is recommended to install and run pandas from a virtual environment.
.. code-block:: shell
conda create -c conda-forge -n name_of_my_env python pandas
-
-This will create a minimal environment with only Python and pandas installed.
-To put your self inside this environment run.
-
-.. code-block:: shell
-
+ # On Linux or MacOS
source activate name_of_my_env
# On Windows
activate name_of_my_env
-.. _install.pypi:
+.. tip::
+ For users that are new to Python, the easiest way to install Python, pandas, and the
+ packages that make up the `PyData `__ stack such as
+ `SciPy `__, `NumPy `__ and
+ `Matplotlib `__
+ is with `Anaconda `__, a cross-platform
+ (Linux, macOS, Windows) Python distribution for data analytics and
+ scientific computing.
-Installing from PyPI
-~~~~~~~~~~~~~~~~~~~~
+ However, pandas from Anaconda is **not** officially managed by the pandas development team.
-pandas can be installed via pip from
-`PyPI `__.
+.. _install.pip:
-.. code-block:: shell
-
- pip install pandas
+Installing with pip
+~~~~~~~~~~~~~~~~~~~
-.. note::
- You must have ``pip>=19.3`` to install from PyPI.
+For users working with the `pip `__ package manager,
+pandas can be installed from `PyPI `__.
-.. note::
+.. code-block:: shell
- It is recommended to install and run pandas from a virtual environment, for example,
- using the Python standard library's `venv `__
+ pip install pandas
pandas can also be installed with sets of optional dependencies to enable certain functionality. For example,
to install pandas with the optional dependencies to read Excel files.
@@ -98,25 +85,8 @@ to install pandas with the optional dependencies to read Excel files.
The full list of extras that can be installed can be found in the :ref:`dependency section.`
-Handling ImportErrors
-~~~~~~~~~~~~~~~~~~~~~
-
-If you encounter an ``ImportError``, it usually means that Python couldn't find pandas in the list of available
-libraries. Python internally has a list of directories it searches through, to find packages. You can
-obtain these directories with.
-
-.. code-block:: python
-
- import sys
- sys.path
-
-One way you could be encountering this error is if you have multiple Python installations on your system
-and you don't have pandas installed in the Python installation you're currently using.
-In Linux/Mac you can run ``which python`` on your terminal and it will tell you which Python installation you're
-using. If it's something like "/usr/bin/python", you're using the Python from the system, which is not recommended.
-
-It is highly recommended to use ``conda``, for quick installation and for package and dependency updates.
-You can find simple installation instructions for pandas :ref:`in this document `.
+Additionally, it is recommended to install and run pandas from a virtual environment, for example,
+using the Python standard library's `venv `__
.. _install.source:
@@ -144,49 +114,24 @@ index from the PyPI registry of anaconda.org. You can install it by running.
pip install --pre --extra-index https://pypi.anaconda.org/scientific-python-nightly-wheels/simple pandas
-Note that you might be required to uninstall an existing version of pandas to install the development version.
+.. note::
+ You might be required to uninstall an existing version of pandas to install the development version.
-.. code-block:: shell
+ .. code-block:: shell
- pip uninstall pandas -y
+ pip uninstall pandas -y
Running the test suite
----------------------
-pandas is equipped with an exhaustive set of unit tests. The packages required to run the tests
-can be installed with ``pip install "pandas[test]"``. To run the tests from a
-Python terminal.
-
-.. code-block:: python
-
- >>> import pandas as pd
- >>> pd.test()
- running: pytest -m "not slow and not network and not db" /home/user/anaconda3/lib/python3.9/site-packages/pandas
-
- ============================= test session starts ==============================
- platform linux -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
- rootdir: /home/user
- plugins: dash-1.19.0, anyio-3.5.0, hypothesis-6.29.3
- collected 154975 items / 4 skipped / 154971 selected
- ........................................................................ [ 0%]
- ........................................................................ [ 99%]
- ....................................... [100%]
-
- ==================================== ERRORS ====================================
-
- =================================== FAILURES ===================================
-
- =============================== warnings summary ===============================
-
- =========================== short test summary info ============================
-
- = 1 failed, 146194 passed, 7402 skipped, 1367 xfailed, 5 xpassed, 197 warnings, 10 errors in 1090.16s (0:18:10) =
+If pandas has been installed :ref:`from source `, running ``pytest pandas`` will run all of pandas unit tests.
+The unit tests can also be run from the pandas module itself with the :func:`test` function. The packages required to run the tests
+can be installed with ``pip install "pandas[test]"``.
.. note::
- This is just an example of what information is shown. Test failures are not necessarily indicative
- of a broken pandas installation.
+ Test failures are not necessarily indicative of a broken pandas installation.
.. _install.dependencies:
@@ -203,9 +148,8 @@ pandas requires the following dependencies.
================================================================ ==========================
Package Minimum supported version
================================================================ ==========================
-`NumPy `__ 1.22.4
+`NumPy `__ 1.23.5
`python-dateutil `__ 2.8.2
-`pytz `__ 2020.1
`tzdata `__ 2022.7
================================================================ ==========================
@@ -220,7 +164,7 @@ For example, :func:`pandas.read_hdf` requires the ``pytables`` package, while
optional dependency is not installed, pandas will raise an ``ImportError`` when
the method requiring that dependency is called.
-If using pip, optional pandas dependencies can be installed or managed in a file (e.g. requirements.txt or pyproject.toml)
+With pip, optional pandas dependencies can be installed or managed in a file (e.g. requirements.txt or pyproject.toml)
as optional extras (e.g. ``pandas[performance, aws]``). All optional dependencies can be installed with ``pandas[all]``,
and specific sets of dependencies are listed in the sections below.
@@ -269,6 +213,8 @@ SciPy 1.10.0 computation Miscellaneous stati
xarray 2022.12.0 computation pandas-like API for N-dimensional data
========================= ================== =============== =============================================================
+.. _install.excel_dependencies:
+
Excel files
^^^^^^^^^^^
@@ -277,11 +223,12 @@ Installable with ``pip install "pandas[excel]"``.
========================= ================== =============== =============================================================
Dependency Minimum Version pip extra Notes
========================= ================== =============== =============================================================
-xlrd 2.0.1 excel Reading Excel
-xlsxwriter 3.0.5 excel Writing Excel
-openpyxl 3.1.0 excel Reading / writing for xlsx files
+xlrd 2.0.1 excel Reading for xls files
+xlsxwriter 3.0.5 excel Writing for xlsx files
+openpyxl 3.1.0 excel Reading / writing for Excel 2010 xlsx/xlsm/xltx/xltm files
pyxlsb 1.0.10 excel Reading for xlsb files
-python-calamine 0.1.7 excel Reading for xls/xlsx/xlsb/ods files
+python-calamine 0.1.7 excel Reading for xls/xlsx/xlsm/xlsb/xla/xlam/ods files
+odfpy 1.4.1 excel Reading / writing for OpenDocument 1.2 files
========================= ================== =============== =============================================================
HTML
@@ -345,7 +292,7 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat
sql-other
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy
pymysql 1.0.2 mysql MySQL engine for sqlalchemy
-adbc-driver-postgresql 0.8.0 postgresql ADBC Driver for PostgreSQL
+adbc-driver-postgresql 0.10.0 postgresql ADBC Driver for PostgreSQL
adbc-driver-sqlite 0.8.0 sql-other ADBC Driver for SQLite
========================= ================== =============== =============================================================
@@ -360,7 +307,7 @@ Dependency Minimum Version pip extra Notes
PyTables 3.8.0 hdf5 HDF5-based reading / writing
blosc 1.21.3 hdf5 Compression for HDF5; only available on ``conda``
zlib hdf5 Compression for HDF5
-fastparquet 2022.12.0 - Parquet reading / writing (pyarrow is default)
+fastparquet 2023.10.0 - Parquet reading / writing (pyarrow is default)
pyarrow 10.0.1 parquet, feather Parquet, ORC, and feather reading / writing
pyreadstat 1.2.0 spss SPSS files (.sav) reading
odfpy 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
@@ -385,7 +332,6 @@ Dependency Minimum Version pip extra Notes
fsspec 2022.11.0 fss, gcp, aws Handling files aside from simple local and HTTP (required
dependency of s3fs, gcsfs).
gcsfs 2022.11.0 gcp Google Cloud Storage access
-pandas-gbq 0.19.0 gcp Google Big Query access
s3fs 2022.11.0 aws Amazon S3 access
========================= ================== =============== =============================================================
@@ -418,13 +364,13 @@ Dependency Minimum Version pip extra Notes
Zstandard 0.19.0 compression Zstandard compression
========================= ================== =============== =============================================================
-Consortium Standard
-^^^^^^^^^^^^^^^^^^^
+Timezone
+^^^^^^^^
-Installable with ``pip install "pandas[consortium-standard]"``
+Installable with ``pip install "pandas[timezone]"``
========================= ================== =================== =============================================================
Dependency Minimum Version pip extra Notes
========================= ================== =================== =============================================================
-dataframe-api-compat 0.1.7 consortium-standard Consortium Standard-compatible implementation based on pandas
+pytz 2023.4 timezone Alternative timezone library to ``zoneinfo``.
========================= ================== =================== =============================================================
diff --git a/doc/source/getting_started/intro_tutorials/01_table_oriented.rst b/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
index caaff3557ae40..efcdb22778ef4 100644
--- a/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
+++ b/doc/source/getting_started/intro_tutorials/01_table_oriented.rst
@@ -46,7 +46,7 @@ I want to store passenger data of the Titanic. For a number of passengers, I kno
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
- "Bonnell, Miss. Elizabeth",
+ "Bonnell, Miss Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
@@ -192,8 +192,8 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
.. note::
This is just a starting point. Similar to spreadsheet
software, pandas represents data as a table with columns and rows. Apart
- from the representation, also the data manipulations and calculations
- you would do in spreadsheet software are supported by pandas. Continue
+ from the representation, the data manipulations and calculations
+ you would do in spreadsheet software are also supported by pandas. Continue
reading the next tutorials to get started!
.. raw:: html
@@ -204,7 +204,7 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
- Import the package, aka ``import pandas as pd``
- A table of data is stored as a pandas ``DataFrame``
- Each column in a ``DataFrame`` is a ``Series``
-- You can do things by applying a method to a ``DataFrame`` or ``Series``
+- You can do things by applying a method on a ``DataFrame`` or ``Series``
.. raw:: html
@@ -215,7 +215,7 @@ Check more options on ``describe`` in the user guide section about :ref:`aggrega
To user guide
-A more extended explanation to ``DataFrame`` and ``Series`` is provided in the :ref:`introduction to data structures
`.
+A more extended explanation of ``DataFrame`` and ``Series`` is provided in the :ref:`introduction to data structures ` page.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/02_read_write.rst b/doc/source/getting_started/intro_tutorials/02_read_write.rst
index 832c2cc25712f..0549c17a1013c 100644
--- a/doc/source/getting_started/intro_tutorials/02_read_write.rst
+++ b/doc/source/getting_started/intro_tutorials/02_read_write.rst
@@ -97,11 +97,11 @@ in this ``DataFrame`` are integers (``int64``), floats (``float64``) and
strings (``object``).
.. note::
- When asking for the ``dtypes``, no brackets are used!
+ When asking for the ``dtypes``, no parentheses ``()`` are used!
``dtypes`` is an attribute of a ``DataFrame`` and ``Series``. Attributes
- of a ``DataFrame`` or ``Series`` do not need brackets. Attributes
+ of a ``DataFrame`` or ``Series`` do not need ``()``. Attributes
represent a characteristic of a ``DataFrame``/``Series``, whereas
- methods (which require brackets) *do* something with the
+ methods (which require parentheses ``()``) *do* something with the
``DataFrame``/``Series`` as introduced in the :ref:`first tutorial <10min_tut_01_tableoriented>`.
.. raw:: html
@@ -111,6 +111,12 @@ strings (``object``).
My colleague requested the Titanic data as a spreadsheet.
+.. note::
+ If you want to use :func:`~pandas.to_excel` and :func:`~pandas.read_excel`,
+ you need to install an Excel reader as outlined in the
+ :ref:`Excel files ` section of the
+ installation documentation.
+
.. ipython:: python
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
@@ -166,11 +172,11 @@ The method :meth:`~DataFrame.info` provides technical information about a
- The table has 12 columns. Most columns have a value for each of the
rows (all 891 values are ``non-null``). Some columns do have missing
values and less than 891 ``non-null`` values.
-- The columns ``Name``, ``Sex``, ``Cabin`` and ``Embarked`` consists of
+- The columns ``Name``, ``Sex``, ``Cabin`` and ``Embarked`` consist of
textual data (strings, aka ``object``). The other columns are
- numerical data with some of them whole numbers (aka ``integer``) and
- others are real numbers (aka ``float``).
-- The kind of data (characters, integers,…) in the different columns
+ numerical data, some of them are whole numbers (``integer``) and
+ others are real numbers (``float``).
+- The kind of data (characters, integers, …) in the different columns
are summarized by listing the ``dtypes``.
- The approximate amount of RAM used to hold the DataFrame is provided
as well.
@@ -188,7 +194,7 @@ The method :meth:`~DataFrame.info` provides technical information about a
- Getting data in to pandas from many different file formats or data
sources is supported by ``read_*`` functions.
- Exporting data out of pandas is provided by different
- ``to_*``\ methods.
+ ``to_*`` methods.
- The ``head``/``tail``/``info`` methods and the ``dtypes`` attribute
are convenient for a first check.
diff --git a/doc/source/getting_started/intro_tutorials/03_subset_data.rst b/doc/source/getting_started/intro_tutorials/03_subset_data.rst
index 6d7ec01551572..ce7aa629a89fc 100644
--- a/doc/source/getting_started/intro_tutorials/03_subset_data.rst
+++ b/doc/source/getting_started/intro_tutorials/03_subset_data.rst
@@ -101,7 +101,7 @@ selection brackets ``[]``.
.. note::
The inner square brackets define a
:ref:`Python list ` with column names, whereas
- the outer brackets are used to select the data from a pandas
+ the outer square brackets are used to select the data from a pandas
``DataFrame`` as seen in the previous example.
The returned data type is a pandas DataFrame:
@@ -300,7 +300,7 @@ want to select.
-When using the column names, row labels or a condition expression, use
+When using column names, row labels or a condition expression, use
the ``loc`` operator in front of the selection brackets ``[]``. For both
the part before and after the comma, you can use a single label, a list
of labels, a slice of labels, a conditional expression or a colon. Using
@@ -342,7 +342,7 @@ the name ``anonymous`` to the first 3 elements of the fourth column:
To user guide
-See the user guide section on :ref:`different choices for indexing
` to get more insight in the usage of ``loc`` and ``iloc``.
+See the user guide section on :ref:`different choices for indexing ` to get more insight into the usage of ``loc`` and ``iloc``.
.. raw:: html
@@ -354,13 +354,11 @@ See the user guide section on :ref:`different choices for indexing REMEMBER
- When selecting subsets of data, square brackets ``[]`` are used.
-- Inside these brackets, you can use a single column/row label, a list
+- Inside these square brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
-- Select specific rows and/or columns using ``loc`` when using the row
- and column names.
-- Select specific rows and/or columns using ``iloc`` when using the
- positions in the table.
+- Use ``loc`` for label-based selection (using row/column names).
+- Use ``iloc`` for position-based selection (using table positions).
- You can assign new values to a selection based on ``loc``/``iloc``.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/04_plotting.rst b/doc/source/getting_started/intro_tutorials/04_plotting.rst
index e96eb7c51a12a..e9f83c602d086 100644
--- a/doc/source/getting_started/intro_tutorials/04_plotting.rst
+++ b/doc/source/getting_started/intro_tutorials/04_plotting.rst
@@ -32,8 +32,10 @@ How do I create plots in pandas?
air_quality.head()
.. note::
- The usage of the ``index_col`` and ``parse_dates`` parameters of the ``read_csv`` function to define the first (0th) column as
- index of the resulting ``DataFrame`` and convert the dates in the column to :class:`Timestamp` objects, respectively.
+ The ``index_col=0`` and ``parse_dates=True`` parameters passed to the ``read_csv`` function define
+ the first (0th) column as index of the resulting ``DataFrame`` and convert the dates in the column
+ to :class:`Timestamp` objects, respectively.
+
.. raw:: html
@@ -85,7 +87,7 @@ I want to plot only the columns of the data table with the data from Paris.
air_quality["station_paris"].plot()
plt.show()
-To plot a specific column, use the selection method of the
+To plot a specific column, use a selection method from the
:ref:`subset data tutorial <10min_tut_03_subset>` in combination with the :meth:`~DataFrame.plot`
method. Hence, the :meth:`~DataFrame.plot` method works on both ``Series`` and
``DataFrame``.
@@ -127,7 +129,7 @@ standard Python to get an overview of the available plot methods:
]
.. note::
- In many development environments as well as IPython and
+ In many development environments such as IPython and
Jupyter Notebook, use the TAB button to get an overview of the available
methods, for example ``air_quality.plot.`` + TAB.
@@ -238,7 +240,7 @@ This strategy is applied in the previous example:
- The ``.plot.*`` methods are applicable on both Series and DataFrames.
- By default, each of the columns is plotted as a different element
- (line, boxplot,…).
+ (line, boxplot, …).
- Any plot created by pandas is a Matplotlib object.
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/05_add_columns.rst b/doc/source/getting_started/intro_tutorials/05_add_columns.rst
index d59a70cc2818e..481c094870e12 100644
--- a/doc/source/getting_started/intro_tutorials/05_add_columns.rst
+++ b/doc/source/getting_started/intro_tutorials/05_add_columns.rst
@@ -51,7 +51,7 @@ hPa, the conversion factor is 1.882*)
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality.head()
-To create a new column, use the ``[]`` brackets with the new column name
+To create a new column, use the square brackets ``[]`` with the new column name
at the left side of the assignment.
.. raw:: html
@@ -89,8 +89,8 @@ values in each row*.
-Also other mathematical operators (``+``, ``-``, ``*``, ``/``,…) or
-logical operators (``<``, ``>``, ``==``,…) work element-wise. The latter was already
+Other mathematical operators (``+``, ``-``, ``*``, ``/``, …) and logical
+operators (``<``, ``>``, ``==``, …) also work element-wise. The latter was already
used in the :ref:`subset data tutorial <10min_tut_03_subset>` to filter
rows of a table using a conditional expression.
diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
index fe3ae820e7085..1399ab66426f4 100644
--- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
+++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
@@ -162,7 +162,7 @@ columns by passing ``numeric_only=True``:
It does not make much sense to get the average value of the ``Pclass``.
If we are only interested in the average age for each gender, the
-selection of columns (rectangular brackets ``[]`` as usual) is supported
+selection of columns (square brackets ``[]`` as usual) is supported
on the grouped data as well:
.. ipython:: python
@@ -235,7 +235,7 @@ category in a column.
-The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
+The function is a shortcut, it is actually a groupby operation in combination with counting the number of records
within each group:
.. ipython:: python
diff --git a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
index 9081f274cd941..024300bb8a9b0 100644
--- a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
+++ b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
@@ -137,7 +137,7 @@ Hence, the resulting table has 3178 = 1110 + 2068 rows.
Most operations like concatenation or summary statistics are by default
across rows (axis 0), but can be applied across columns as well.
-Sorting the table on the datetime information illustrates also the
+Sorting the table on the datetime information also illustrates the
combination of both tables, with the ``parameter`` column defining the
origin of the table (either ``no2`` from table ``air_quality_no2`` or
``pm25`` from table ``air_quality_pm25``):
@@ -271,7 +271,7 @@ Add the parameters' full description and name, provided by the parameters metada
Compared to the previous example, there is no common column name.
However, the ``parameter`` column in the ``air_quality`` table and the
-``id`` column in the ``air_quality_parameters_name`` both provide the
+``id`` column in the ``air_quality_parameters`` table both provide the
measured variable in a common format. The ``left_on`` and ``right_on``
arguments are used here (instead of just ``on``) to make the link
between the two tables.
@@ -286,7 +286,7 @@ between the two tables.
To user guide
-pandas supports also inner, outer, and right joins.
+pandas also supports inner, outer, and right joins.
More information on join/merge of tables is provided in the user guide section on
:ref:`database style merging of tables
`. Or have a look at the
:ref:`comparison with SQL` page.
@@ -300,7 +300,7 @@ More information on join/merge of tables is provided in the user guide section o
REMEMBER
-- Multiple tables can be concatenated both column-wise and row-wise using
+- Multiple tables can be concatenated column-wise or row-wise using
the ``concat`` function.
- For database-like merging/joining of tables, use the ``merge``
function.
diff --git a/doc/source/getting_started/intro_tutorials/09_timeseries.rst b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
index b0530087e5b84..6ba3c17fac3c3 100644
--- a/doc/source/getting_started/intro_tutorials/09_timeseries.rst
+++ b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
@@ -77,9 +77,9 @@ I want to work with the dates in the column ``datetime`` as datetime objects ins
Initially, the values in ``datetime`` are character strings and do not
provide any datetime operations (e.g. extract the year, day of the
-week,…). By applying the ``to_datetime`` function, pandas interprets the
+week, …). By applying the ``to_datetime`` function, pandas interprets the
strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``)
-objects. In pandas we call these datetime objects similar to
+objects. In pandas we call these datetime objects that are similar to
``datetime.datetime`` from the standard library as :class:`pandas.Timestamp`.
.. raw:: html
@@ -117,7 +117,7 @@ length of our time series:
air_quality["datetime"].max() - air_quality["datetime"].min()
The result is a :class:`pandas.Timedelta` object, similar to ``datetime.timedelta``
-from the standard Python library and defining a time duration.
+from the standard Python library which defines a time duration.
.. raw:: html
@@ -257,7 +257,7 @@ the adapted time scale on plots. Let’s apply this on our data.
-
-Create a plot of the :math:`NO_2` values in the different stations from the 20th of May till the end of 21st of May
+Create a plot of the :math:`NO_2` values in the different stations from May 20th till the end of May 21st.
.. ipython:: python
:okwarning:
@@ -295,7 +295,7 @@ Aggregate the current hourly time series values to the monthly maximum value in
.. ipython:: python
- monthly_max = no_2.resample("ME").max()
+ monthly_max = no_2.resample("MS").max()
monthly_max
A very powerful method on time series data with a datetime index, is the
@@ -310,7 +310,7 @@ converting secondly data into 5-minutely data).
The :meth:`~Series.resample` method is similar to a groupby operation:
- it provides a time-based grouping, by using a string (e.g. ``M``,
- ``5H``,…) that defines the target frequency
+ ``5H``, …) that defines the target frequency
- it requires an aggregation function such as ``mean``, ``max``,…
.. raw:: html
diff --git a/doc/source/getting_started/intro_tutorials/10_text_data.rst b/doc/source/getting_started/intro_tutorials/10_text_data.rst
index 5b1885791d8fb..8493a071863c4 100644
--- a/doc/source/getting_started/intro_tutorials/10_text_data.rst
+++ b/doc/source/getting_started/intro_tutorials/10_text_data.rst
@@ -134,8 +134,8 @@ only one countess on the Titanic, we get one row as a result.
.. note::
More powerful extractions on strings are supported, as the
:meth:`Series.str.contains` and :meth:`Series.str.extract` methods accept `regular
- expressions `__, but out of
- scope of this tutorial.
+ expressions `__, but are out of
+ the scope of this tutorial.
.. raw:: html
@@ -200,7 +200,7 @@ In the "Sex" column, replace values of "male" by "M" and values of "female" by "
Whereas :meth:`~Series.replace` is not a string method, it provides a convenient way
to use mappings or vocabularies to translate certain values. It requires
-a ``dictionary`` to define the mapping ``{from : to}``.
+a ``dictionary`` to define the mapping ``{from: to}``.
.. raw:: html
diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst
index 05a7d63b7ff47..a8b7a387d80ec 100644
--- a/doc/source/getting_started/overview.rst
+++ b/doc/source/getting_started/overview.rst
@@ -6,11 +6,11 @@
Package overview
****************
-pandas is a `Python `__ package providing fast,
+pandas is a `Python `__ package that provides fast,
flexible, and expressive data structures designed to make working with
"relational" or "labeled" data both easy and intuitive. It aims to be the
-fundamental high-level building block for doing practical, **real-world** data
-analysis in Python. Additionally, it has the broader goal of becoming **the
+fundamental high-level building block for Python's practical, **real-world** data
+analysis. Additionally, it seeks to become **the
most powerful and flexible open source data analysis/manipulation tool
available in any language**. It is already well on its way toward this goal.
diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst
index fe65364896f54..a631cd517e3c2 100644
--- a/doc/source/reference/arrays.rst
+++ b/doc/source/reference/arrays.rst
@@ -539,6 +539,21 @@ To create a Series of dtype ``category``, use ``cat = s.astype(dtype)`` or
If the :class:`Series` is of dtype :class:`CategoricalDtype`, ``Series.cat`` can be used to change the categorical
data. See :ref:`api.series.cat` for more.
+More methods are available on :class:`Categorical`:
+
+.. autosummary::
+ :toctree: api/
+
+ Categorical.as_ordered
+ Categorical.as_unordered
+ Categorical.set_categories
+ Categorical.rename_categories
+ Categorical.reorder_categories
+ Categorical.add_categories
+ Categorical.remove_categories
+ Categorical.remove_unused_categories
+ Categorical.map
+
.. _api.arrays.sparse:
Sparse
@@ -685,7 +700,6 @@ Scalar introspection
api.types.is_float
api.types.is_hashable
api.types.is_integer
- api.types.is_interval
api.types.is_number
api.types.is_re
api.types.is_re_compilable
diff --git a/doc/source/reference/frame.rst b/doc/source/reference/frame.rst
index fefb02dd916cd..7680c8b434866 100644
--- a/doc/source/reference/frame.rst
+++ b/doc/source/reference/frame.rst
@@ -48,7 +48,7 @@ Conversion
DataFrame.convert_dtypes
DataFrame.infer_objects
DataFrame.copy
- DataFrame.bool
+ DataFrame.to_numpy
Indexing, iteration
~~~~~~~~~~~~~~~~~~~
@@ -74,6 +74,7 @@ Indexing, iteration
DataFrame.where
DataFrame.mask
DataFrame.query
+ DataFrame.isetitem
For more information on ``.at``, ``.iat``, ``.loc``, and
``.iloc``, see the :ref:`indexing documentation `.
@@ -117,7 +118,6 @@ Function application, GroupBy & window
DataFrame.apply
DataFrame.map
- DataFrame.applymap
DataFrame.pipe
DataFrame.agg
DataFrame.aggregate
@@ -185,11 +185,9 @@ Reindexing / selection / label manipulation
DataFrame.duplicated
DataFrame.equals
DataFrame.filter
- DataFrame.first
DataFrame.head
DataFrame.idxmax
DataFrame.idxmin
- DataFrame.last
DataFrame.reindex
DataFrame.reindex_like
DataFrame.rename
@@ -209,7 +207,6 @@ Missing data handling
.. autosummary::
:toctree: api/
- DataFrame.backfill
DataFrame.bfill
DataFrame.dropna
DataFrame.ffill
@@ -219,7 +216,6 @@ Missing data handling
DataFrame.isnull
DataFrame.notna
DataFrame.notnull
- DataFrame.pad
DataFrame.replace
Reshaping, sorting, transposing
@@ -238,7 +234,6 @@ Reshaping, sorting, transposing
DataFrame.swaplevel
DataFrame.stack
DataFrame.unstack
- DataFrame.swapaxes
DataFrame.melt
DataFrame.explode
DataFrame.squeeze
@@ -382,7 +377,6 @@ Serialization / IO / conversion
DataFrame.to_feather
DataFrame.to_latex
DataFrame.to_stata
- DataFrame.to_gbq
DataFrame.to_records
DataFrame.to_string
DataFrame.to_clipboard
diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst
index 771163ae1b0bc..3b02ffe20c10e 100644
--- a/doc/source/reference/groupby.rst
+++ b/doc/source/reference/groupby.rst
@@ -80,7 +80,6 @@ Function application
DataFrameGroupBy.describe
DataFrameGroupBy.diff
DataFrameGroupBy.ffill
- DataFrameGroupBy.fillna
DataFrameGroupBy.first
DataFrameGroupBy.head
DataFrameGroupBy.idxmax
@@ -131,7 +130,6 @@ Function application
SeriesGroupBy.describe
SeriesGroupBy.diff
SeriesGroupBy.ffill
- SeriesGroupBy.fillna
SeriesGroupBy.first
SeriesGroupBy.head
SeriesGroupBy.last
diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst
index 7da02f7958416..639bac4d40b70 100644
--- a/doc/source/reference/index.rst
+++ b/doc/source/reference/index.rst
@@ -24,13 +24,14 @@ The following subpackages are public.
`pandas-stubs `_ package
which has classes in addition to those that occur in pandas for type-hinting.
-In addition, public functions in ``pandas.io`` and ``pandas.tseries`` submodules
-are mentioned in the documentation.
+In addition, public functions in ``pandas.io``, ``pandas.tseries``, ``pandas.util`` submodules
+are explicitly mentioned in the documentation. Further APIs in these modules are not guaranteed
+to be stable.
.. warning::
- The ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.
+ The ``pandas.core``, ``pandas.compat`` top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.
.. If you update this toctree, also update the manual toctree in the
.. main index.rst.template
@@ -61,7 +62,6 @@ are mentioned in the documentation.
..
.. toctree::
- api/pandas.Index.holds_integer
api/pandas.Index.nlevels
api/pandas.Index.sort
diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst
index fa6105761df0a..7a4bc0f467f9a 100644
--- a/doc/source/reference/indexing.rst
+++ b/doc/source/reference/indexing.rst
@@ -41,6 +41,7 @@ Properties
Index.empty
Index.T
Index.memory_usage
+ Index.array
Modifying and computations
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -61,13 +62,6 @@ Modifying and computations
Index.identical
Index.insert
Index.is_
- Index.is_boolean
- Index.is_categorical
- Index.is_floating
- Index.is_integer
- Index.is_interval
- Index.is_numeric
- Index.is_object
Index.min
Index.max
Index.reindex
@@ -110,6 +104,7 @@ Conversion
Index.to_list
Index.to_series
Index.to_frame
+ Index.to_numpy
Index.view
Sorting
diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst
index fbd0f6bd200b9..805fb8b783459 100644
--- a/doc/source/reference/io.rst
+++ b/doc/source/reference/io.rst
@@ -188,13 +188,6 @@ SQL
read_sql
DataFrame.to_sql
-Google BigQuery
-~~~~~~~~~~~~~~~
-.. autosummary::
- :toctree: api/
-
- read_gbq
-
STATA
~~~~~
.. autosummary::
diff --git a/doc/source/reference/offset_frequency.rst b/doc/source/reference/offset_frequency.rst
index ab89fe74e7337..8bb2c6ffe73be 100644
--- a/doc/source/reference/offset_frequency.rst
+++ b/doc/source/reference/offset_frequency.rst
@@ -26,8 +26,6 @@ Properties
DateOffset.normalize
DateOffset.rule_code
DateOffset.n
- DateOffset.is_month_start
- DateOffset.is_month_end
Methods
~~~~~~~
@@ -35,7 +33,6 @@ Methods
:toctree: api/
DateOffset.copy
- DateOffset.is_anchored
DateOffset.is_on_offset
DateOffset.is_month_start
DateOffset.is_month_end
@@ -82,7 +79,6 @@ Methods
:toctree: api/
BusinessDay.copy
- BusinessDay.is_anchored
BusinessDay.is_on_offset
BusinessDay.is_month_start
BusinessDay.is_month_end
@@ -122,7 +118,6 @@ Methods
:toctree: api/
BusinessHour.copy
- BusinessHour.is_anchored
BusinessHour.is_on_offset
BusinessHour.is_month_start
BusinessHour.is_month_end
@@ -169,7 +164,6 @@ Methods
:toctree: api/
CustomBusinessDay.copy
- CustomBusinessDay.is_anchored
CustomBusinessDay.is_on_offset
CustomBusinessDay.is_month_start
CustomBusinessDay.is_month_end
@@ -209,7 +203,6 @@ Methods
:toctree: api/
CustomBusinessHour.copy
- CustomBusinessHour.is_anchored
CustomBusinessHour.is_on_offset
CustomBusinessHour.is_month_start
CustomBusinessHour.is_month_end
@@ -244,7 +237,6 @@ Methods
:toctree: api/
MonthEnd.copy
- MonthEnd.is_anchored
MonthEnd.is_on_offset
MonthEnd.is_month_start
MonthEnd.is_month_end
@@ -279,7 +271,6 @@ Methods
:toctree: api/
MonthBegin.copy
- MonthBegin.is_anchored
MonthBegin.is_on_offset
MonthBegin.is_month_start
MonthBegin.is_month_end
@@ -323,7 +314,6 @@ Methods
:toctree: api/
BusinessMonthEnd.copy
- BusinessMonthEnd.is_anchored
BusinessMonthEnd.is_on_offset
BusinessMonthEnd.is_month_start
BusinessMonthEnd.is_month_end
@@ -367,7 +357,6 @@ Methods
:toctree: api/
BusinessMonthBegin.copy
- BusinessMonthBegin.is_anchored
BusinessMonthBegin.is_on_offset
BusinessMonthBegin.is_month_start
BusinessMonthBegin.is_month_end
@@ -415,7 +404,6 @@ Methods
:toctree: api/
CustomBusinessMonthEnd.copy
- CustomBusinessMonthEnd.is_anchored
CustomBusinessMonthEnd.is_on_offset
CustomBusinessMonthEnd.is_month_start
CustomBusinessMonthEnd.is_month_end
@@ -463,7 +451,6 @@ Methods
:toctree: api/
CustomBusinessMonthBegin.copy
- CustomBusinessMonthBegin.is_anchored
CustomBusinessMonthBegin.is_on_offset
CustomBusinessMonthBegin.is_month_start
CustomBusinessMonthBegin.is_month_end
@@ -499,7 +486,6 @@ Methods
:toctree: api/
SemiMonthEnd.copy
- SemiMonthEnd.is_anchored
SemiMonthEnd.is_on_offset
SemiMonthEnd.is_month_start
SemiMonthEnd.is_month_end
@@ -535,7 +521,6 @@ Methods
:toctree: api/
SemiMonthBegin.copy
- SemiMonthBegin.is_anchored
SemiMonthBegin.is_on_offset
SemiMonthBegin.is_month_start
SemiMonthBegin.is_month_end
@@ -571,7 +556,6 @@ Methods
:toctree: api/
Week.copy
- Week.is_anchored
Week.is_on_offset
Week.is_month_start
Week.is_month_end
@@ -607,7 +591,6 @@ Methods
:toctree: api/
WeekOfMonth.copy
- WeekOfMonth.is_anchored
WeekOfMonth.is_on_offset
WeekOfMonth.weekday
WeekOfMonth.is_month_start
@@ -645,7 +628,6 @@ Methods
:toctree: api/
LastWeekOfMonth.copy
- LastWeekOfMonth.is_anchored
LastWeekOfMonth.is_on_offset
LastWeekOfMonth.is_month_start
LastWeekOfMonth.is_month_end
@@ -681,7 +663,6 @@ Methods
:toctree: api/
BQuarterEnd.copy
- BQuarterEnd.is_anchored
BQuarterEnd.is_on_offset
BQuarterEnd.is_month_start
BQuarterEnd.is_month_end
@@ -717,7 +698,6 @@ Methods
:toctree: api/
BQuarterBegin.copy
- BQuarterBegin.is_anchored
BQuarterBegin.is_on_offset
BQuarterBegin.is_month_start
BQuarterBegin.is_month_end
@@ -753,7 +733,6 @@ Methods
:toctree: api/
QuarterEnd.copy
- QuarterEnd.is_anchored
QuarterEnd.is_on_offset
QuarterEnd.is_month_start
QuarterEnd.is_month_end
@@ -789,7 +768,6 @@ Methods
:toctree: api/
QuarterBegin.copy
- QuarterBegin.is_anchored
QuarterBegin.is_on_offset
QuarterBegin.is_month_start
QuarterBegin.is_month_end
@@ -825,7 +803,6 @@ Methods
:toctree: api/
BYearEnd.copy
- BYearEnd.is_anchored
BYearEnd.is_on_offset
BYearEnd.is_month_start
BYearEnd.is_month_end
@@ -861,7 +838,6 @@ Methods
:toctree: api/
BYearBegin.copy
- BYearBegin.is_anchored
BYearBegin.is_on_offset
BYearBegin.is_month_start
BYearBegin.is_month_end
@@ -897,7 +873,6 @@ Methods
:toctree: api/
YearEnd.copy
- YearEnd.is_anchored
YearEnd.is_on_offset
YearEnd.is_month_start
YearEnd.is_month_end
@@ -933,7 +908,6 @@ Methods
:toctree: api/
YearBegin.copy
- YearBegin.is_anchored
YearBegin.is_on_offset
YearBegin.is_month_start
YearBegin.is_month_end
@@ -973,7 +947,6 @@ Methods
FY5253.copy
FY5253.get_rule_code_suffix
FY5253.get_year_end
- FY5253.is_anchored
FY5253.is_on_offset
FY5253.is_month_start
FY5253.is_month_end
@@ -1014,7 +987,6 @@ Methods
FY5253Quarter.copy
FY5253Quarter.get_rule_code_suffix
FY5253Quarter.get_weeks
- FY5253Quarter.is_anchored
FY5253Quarter.is_on_offset
FY5253Quarter.year_has_extra_week
FY5253Quarter.is_month_start
@@ -1050,7 +1022,6 @@ Methods
:toctree: api/
Easter.copy
- Easter.is_anchored
Easter.is_on_offset
Easter.is_month_start
Easter.is_month_end
@@ -1071,7 +1042,6 @@ Properties
.. autosummary::
:toctree: api/
- Tick.delta
Tick.freqstr
Tick.kwds
Tick.name
@@ -1086,7 +1056,6 @@ Methods
:toctree: api/
Tick.copy
- Tick.is_anchored
Tick.is_on_offset
Tick.is_month_start
Tick.is_month_end
@@ -1107,7 +1076,6 @@ Properties
.. autosummary::
:toctree: api/
- Day.delta
Day.freqstr
Day.kwds
Day.name
@@ -1122,7 +1090,6 @@ Methods
:toctree: api/
Day.copy
- Day.is_anchored
Day.is_on_offset
Day.is_month_start
Day.is_month_end
@@ -1143,7 +1110,6 @@ Properties
.. autosummary::
:toctree: api/
- Hour.delta
Hour.freqstr
Hour.kwds
Hour.name
@@ -1158,7 +1124,6 @@ Methods
:toctree: api/
Hour.copy
- Hour.is_anchored
Hour.is_on_offset
Hour.is_month_start
Hour.is_month_end
@@ -1179,7 +1144,6 @@ Properties
.. autosummary::
:toctree: api/
- Minute.delta
Minute.freqstr
Minute.kwds
Minute.name
@@ -1194,7 +1158,6 @@ Methods
:toctree: api/
Minute.copy
- Minute.is_anchored
Minute.is_on_offset
Minute.is_month_start
Minute.is_month_end
@@ -1215,7 +1178,6 @@ Properties
.. autosummary::
:toctree: api/
- Second.delta
Second.freqstr
Second.kwds
Second.name
@@ -1230,7 +1192,6 @@ Methods
:toctree: api/
Second.copy
- Second.is_anchored
Second.is_on_offset
Second.is_month_start
Second.is_month_end
@@ -1251,7 +1212,6 @@ Properties
.. autosummary::
:toctree: api/
- Milli.delta
Milli.freqstr
Milli.kwds
Milli.name
@@ -1266,7 +1226,6 @@ Methods
:toctree: api/
Milli.copy
- Milli.is_anchored
Milli.is_on_offset
Milli.is_month_start
Milli.is_month_end
@@ -1287,7 +1246,6 @@ Properties
.. autosummary::
:toctree: api/
- Micro.delta
Micro.freqstr
Micro.kwds
Micro.name
@@ -1302,7 +1260,6 @@ Methods
:toctree: api/
Micro.copy
- Micro.is_anchored
Micro.is_on_offset
Micro.is_month_start
Micro.is_month_end
@@ -1323,7 +1280,6 @@ Properties
.. autosummary::
:toctree: api/
- Nano.delta
Nano.freqstr
Nano.kwds
Nano.name
@@ -1338,7 +1294,6 @@ Methods
:toctree: api/
Nano.copy
- Nano.is_anchored
Nano.is_on_offset
Nano.is_month_start
Nano.is_month_end
diff --git a/doc/source/reference/resampling.rst b/doc/source/reference/resampling.rst
index edbc8090fc849..2e0717081b129 100644
--- a/doc/source/reference/resampling.rst
+++ b/doc/source/reference/resampling.rst
@@ -38,7 +38,6 @@ Upsampling
Resampler.ffill
Resampler.bfill
Resampler.nearest
- Resampler.fillna
Resampler.asfreq
Resampler.interpolate
diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
index af262f9e6c336..43d7480899dc4 100644
--- a/doc/source/reference/series.rst
+++ b/doc/source/reference/series.rst
@@ -47,7 +47,6 @@ Conversion
Series.convert_dtypes
Series.infer_objects
Series.copy
- Series.bool
Series.to_numpy
Series.to_period
Series.to_timestamp
@@ -177,17 +176,16 @@ Reindexing / selection / label manipulation
:toctree: api/
Series.align
+ Series.case_when
Series.drop
Series.droplevel
Series.drop_duplicates
Series.duplicated
Series.equals
- Series.first
Series.head
Series.idxmax
Series.idxmin
Series.isin
- Series.last
Series.reindex
Series.reindex_like
Series.rename
@@ -209,7 +207,6 @@ Missing data handling
.. autosummary::
:toctree: api/
- Series.backfill
Series.bfill
Series.dropna
Series.ffill
@@ -219,7 +216,6 @@ Missing data handling
Series.isnull
Series.notna
Series.notnull
- Series.pad
Series.replace
Reshaping, sorting
@@ -237,10 +233,8 @@ Reshaping, sorting
Series.unstack
Series.explode
Series.searchsorted
- Series.ravel
Series.repeat
Series.squeeze
- Series.view
Combining / comparing / joining / merging
-----------------------------------------
@@ -341,7 +335,6 @@ Datetime properties
Series.dt.tz
Series.dt.freq
Series.dt.unit
- Series.dt.normalize
Datetime methods
^^^^^^^^^^^^^^^^
diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst
index 2256876c93e01..0e1d93841d52f 100644
--- a/doc/source/reference/style.rst
+++ b/doc/source/reference/style.rst
@@ -41,6 +41,7 @@ Style application
Styler.map_index
Styler.format
Styler.format_index
+ Styler.format_index_names
Styler.relabel_index
Styler.hide
Styler.concat
diff --git a/doc/source/reference/testing.rst b/doc/source/reference/testing.rst
index a5d61703aceed..1f164d1aa98b4 100644
--- a/doc/source/reference/testing.rst
+++ b/doc/source/reference/testing.rst
@@ -58,8 +58,6 @@ Exceptions and warnings
errors.PossiblePrecisionLoss
errors.PyperclipException
errors.PyperclipWindowsException
- errors.SettingWithCopyError
- errors.SettingWithCopyWarning
errors.SpecificationError
errors.UndefinedVariableError
errors.UnsortedIndexError
diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst
index c8e67710c85a9..72bb93d21a99f 100644
--- a/doc/source/user_guide/10min.rst
+++ b/doc/source/user_guide/10min.rst
@@ -19,7 +19,7 @@ Customarily, we import as follows:
Basic data structures in pandas
-------------------------------
-Pandas provides two types of classes for handling data:
+pandas provides two types of classes for handling data:
1. :class:`Series`: a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc.
@@ -91,8 +91,8 @@ will be completed:
df2.any df2.combine
df2.append df2.D
df2.apply df2.describe
- df2.applymap df2.diff
df2.B df2.duplicated
+ df2.diff
As you can see, the columns ``A``, ``B``, ``C``, and ``D`` are automatically
tab completed. ``E`` and ``F`` are there as well; the rest of the attributes have been
@@ -101,7 +101,7 @@ truncated for brevity.
Viewing data
------------
-See the :ref:`Essentially basics functionality section `.
+See the :ref:`Essential basic functionality section `.
Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame
respectively:
@@ -177,7 +177,7 @@ See the indexing documentation :ref:`Indexing and Selecting Data ` and
Getitem (``[]``)
~~~~~~~~~~~~~~~~
-For a :class:`DataFrame`, passing a single label selects a columns and
+For a :class:`DataFrame`, passing a single label selects a column and
yields a :class:`Series` equivalent to ``df.A``:
.. ipython:: python
@@ -563,7 +563,7 @@ columns:
.. ipython:: python
- stacked = df2.stack(future_stack=True)
+ stacked = df2.stack()
stacked
With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the
diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst
index 453536098cfbb..f7ab466e92d93 100644
--- a/doc/source/user_guide/advanced.rst
+++ b/doc/source/user_guide/advanced.rst
@@ -11,13 +11,6 @@ and :ref:`other advanced indexing features `.
See the :ref:`Indexing and Selecting Data ` for general indexing documentation.
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation may
- depend on the context. This is sometimes called ``chained assignment`` and
- should be avoided. See :ref:`Returning a View versus Copy
- `.
-
See the :ref:`cookbook` for some advanced strategies.
.. _advanced.hierarchical:
@@ -402,6 +395,7 @@ slicers on a single axis.
Furthermore, you can *set* the values using the following methods.
.. ipython:: python
+ :okwarning:
df2 = dfmi.copy()
df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
index f7d89110e6c8f..ffd7a2ad7bb01 100644
--- a/doc/source/user_guide/basics.rst
+++ b/doc/source/user_guide/basics.rst
@@ -155,17 +155,6 @@ speedups. ``numexpr`` uses smart chunking, caching, and multiple cores. ``bottle
a set of specialized cython routines that are especially fast when dealing with arrays that have
``nans``.
-Here is a sample (using 100 column x 100,000 row ``DataFrames``):
-
-.. csv-table::
- :header: "Operation", "0.11.0 (ms)", "Prior Version (ms)", "Ratio to Prior"
- :widths: 25, 25, 25, 25
- :delim: ;
-
- ``df1 > df2``; 13.32; 125.35; 0.1063
- ``df1 * df2``; 21.71; 36.63; 0.5928
- ``df1 + df2``; 22.04; 36.50; 0.6039
-
You are highly encouraged to install both libraries. See the section
:ref:`Recommended Dependencies ` for more installation info.
@@ -299,8 +288,7 @@ Boolean reductions
~~~~~~~~~~~~~~~~~~
You can apply the reductions: :attr:`~DataFrame.empty`, :meth:`~DataFrame.any`,
-:meth:`~DataFrame.all`, and :meth:`~DataFrame.bool` to provide a
-way to summarize a boolean result.
+:meth:`~DataFrame.all`.
.. ipython:: python
@@ -477,15 +465,15 @@ For example:
.. ipython:: python
df
- df.mean(0)
- df.mean(1)
+ df.mean(axis=0)
+ df.mean(axis=1)
All such methods have a ``skipna`` option signaling whether to exclude missing
data (``True`` by default):
.. ipython:: python
- df.sum(0, skipna=False)
+ df.sum(axis=0, skipna=False)
df.sum(axis=1, skipna=True)
Combined with the broadcasting / arithmetic behavior, one can describe various
@@ -496,8 +484,8 @@ standard deviation of 1), very concisely:
ts_stand = (df - df.mean()) / df.std()
ts_stand.std()
- xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
- xs_stand.std(1)
+ xs_stand = df.sub(df.mean(axis=1), axis=0).div(df.std(axis=1), axis=0)
+ xs_stand.std(axis=1)
Note that methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod`
preserve the location of ``NaN`` values. This is somewhat different from
@@ -1309,8 +1297,8 @@ filling method chosen from the following table:
:header: "Method", "Action"
:widths: 30, 50
- pad / ffill, Fill values forward
- bfill / backfill, Fill values backward
+ ffill, Fill values forward
+ bfill, Fill values backward
nearest, Fill from the nearest index value
We illustrate these fill methods on a simple Series:
@@ -1608,7 +1596,7 @@ For instance:
This method does not convert the row to a Series object; it merely
returns the values inside a namedtuple. Therefore,
:meth:`~DataFrame.itertuples` preserves the data type of the values
-and is generally faster as :meth:`~DataFrame.iterrows`.
+and is generally faster than :meth:`~DataFrame.iterrows`.
.. note::
diff --git a/doc/source/user_guide/boolean.rst b/doc/source/user_guide/boolean.rst
index 3c361d4de17e5..7de0430123fd2 100644
--- a/doc/source/user_guide/boolean.rst
+++ b/doc/source/user_guide/boolean.rst
@@ -37,6 +37,19 @@ If you would prefer to keep the ``NA`` values you can manually fill them with ``
s[mask.fillna(True)]
+If you create a column of ``NA`` values (for example to fill them later)
+with ``df['new_col'] = pd.NA``, the ``dtype`` would be set to ``object`` in the
+new column. The performance on this column will be worse than with
+the appropriate type. It's better to use
+``df['new_col'] = pd.Series(pd.NA, dtype="boolean")``
+(or another ``dtype`` that supports ``NA``).
+
+.. ipython:: python
+
+ df = pd.DataFrame()
+ df['objects'] = pd.NA
+ df.dtypes
+
.. _boolean.kleene:
Kleene logical operations
diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst
index 8fb991dca02db..1e7d66dfeb142 100644
--- a/doc/source/user_guide/categorical.rst
+++ b/doc/source/user_guide/categorical.rst
@@ -245,7 +245,8 @@ Equality semantics
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and order. When comparing two
-unordered categoricals, the order of the ``categories`` is not considered.
+unordered categoricals, the order of the ``categories`` is not considered. Note
+that categories with different dtypes are not the same.
.. ipython:: python
@@ -263,6 +264,16 @@ All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
c1 == "category"
+Notice that the ``categories_dtype`` should be considered, especially when comparing with
+two empty ``CategoricalDtype`` instances.
+
+.. ipython:: python
+
+ c2 = pd.Categorical(np.array([], dtype=object))
+ c3 = pd.Categorical(np.array([], dtype=float))
+
+ c2.dtype == c3.dtype
+
Description
-----------
@@ -782,7 +793,7 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
:okwarning:
df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
- df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
+ df.loc[1:2, "a"] = pd.Categorical([2, 2], categories=[2, 3])
df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
df
df.dtypes
diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst
index b1a6aa8753be1..42430fb1fbba0 100644
--- a/doc/source/user_guide/cookbook.rst
+++ b/doc/source/user_guide/cookbook.rst
@@ -311,7 +311,7 @@ The :ref:`multindexing ` docs.
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])
df
# Now stack & Reset
- df = df.stack(0, future_stack=True).reset_index(1)
+ df = df.stack(0).reset_index(1)
df
# And fix the labels (Notice the label 'level_1' got added automatically)
df.columns = ["Sample", "All_X", "All_Y"]
@@ -688,7 +688,7 @@ The :ref:`Pivot ` docs.
aggfunc="sum",
margins=True,
)
- table.stack("City", future_stack=True)
+ table.stack("City")
`Frequency table like plyr in R
`__
@@ -914,7 +914,7 @@ Using TimeGrouper and another grouping to create subgroups, then apply a custom
`__
`Resample intraday frame without adding new days
-`__
+`__
`Resample minute data
`__
diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst
index 050c3901c3420..90353d9f49f00 100644
--- a/doc/source/user_guide/copy_on_write.rst
+++ b/doc/source/user_guide/copy_on_write.rst
@@ -8,16 +8,12 @@ Copy-on-Write (CoW)
.. note::
- Copy-on-Write will become the default in pandas 3.0. We recommend
- :ref:`turning it on now `
- to benefit from all improvements.
+ Copy-on-Write is now the default with pandas 3.0.
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
optimizations that become possible through CoW are implemented and supported. All possible
optimizations are supported starting from pandas 2.1.
-CoW will be enabled by default in version 3.0.
-
CoW will lead to more predictable behavior since it is not possible to update more than
one object with one statement, e.g. indexing operations or methods won't have side-effects. Additionally, through
delaying copies as long as possible, the average performance and memory usage will improve.
@@ -29,21 +25,25 @@ pandas indexing behavior is tricky to understand. Some operations return views w
other return copies. Depending on the result of the operation, mutating one object
might accidentally mutate another:
-.. ipython:: python
+.. code-block:: ipython
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- subset = df["foo"]
- subset.iloc[0] = 100
- df
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
-Mutating ``subset``, e.g. updating its values, also updates ``df``. The exact behavior is
+
+Mutating ``subset``, e.g. updating its values, also updated ``df``. The exact behavior was
hard to predict. Copy-on-Write solves accidentally modifying more than one object,
-it explicitly disallows this. With CoW enabled, ``df`` is unchanged:
+it explicitly disallows this. ``df`` is unchanged:
.. ipython:: python
- pd.options.mode.copy_on_write = True
-
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
@@ -57,13 +57,13 @@ applications.
Migrating to Copy-on-Write
--------------------------
-Copy-on-Write will be the default and only mode in pandas 3.0. This means that users
+Copy-on-Write is the default and only mode in pandas 3.0. This means that users
need to migrate their code to be compliant with CoW rules.
-The default mode in pandas will raise warnings for certain cases that will actively
+The default mode in pandas < 3.0 raises warnings for certain cases that will actively
change behavior and thus change user intended behavior.
-We added another mode, e.g.
+pandas 2.2 has a warning mode
.. code-block:: python
@@ -84,7 +84,6 @@ The following few items describe the user visible changes:
**Accessing the underlying array of a pandas object will return a read-only view**
-
.. ipython:: python
ser = pd.Series([1, 2, 3])
@@ -101,16 +100,21 @@ for more details.
**Only one pandas object is updated at once**
-The following code snippet updates both ``df`` and ``subset`` without CoW:
+The following code snippet updated both ``df`` and ``subset`` without CoW:
-.. ipython:: python
+.. code-block:: ipython
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- subset = df["foo"]
- subset.iloc[0] = 100
- df
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
-This won't be possible anymore with CoW, since the CoW rules explicitly forbid this.
+This is not possible anymore with CoW, since the CoW rules explicitly forbid this.
This includes updating a single column as a :class:`Series` and relying on the change
propagating back to the parent :class:`DataFrame`.
This statement can be rewritten into a single statement with ``loc`` or ``iloc`` if
@@ -146,7 +150,7 @@ A different alternative would be to not use ``inplace``:
**Constructors now copy NumPy arrays by default**
-The Series and DataFrame constructors will now copy NumPy array by default when not
+The Series and DataFrame constructors now copies a NumPy array by default when not
otherwise specified. This was changed to avoid mutating a pandas object when the
NumPy array is changed inplace outside of pandas. You can set ``copy=False`` to
avoid this copy.
@@ -162,7 +166,7 @@ that shares data with another DataFrame or Series object inplace.
This avoids side-effects when modifying values and hence, most methods can avoid
actually copying the data and only trigger a copy when necessary.
-The following example will operate inplace with CoW:
+The following example will operate inplace:
.. ipython:: python
@@ -207,15 +211,17 @@ listed in :ref:`Copy-on-Write optimizations `.
Previously, when operating on views, the view and the parent object was modified:
-.. ipython:: python
-
- with pd.option_context("mode.copy_on_write", False):
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- view = df[:]
- df.iloc[0, 0] = 100
+.. code-block:: ipython
- df
- view
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: subset = df["foo"]
+ In [3]: subset.iloc[0] = 100
+ In [4]: df
+ Out[4]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well:
@@ -236,16 +242,19 @@ Chained Assignment
Chained assignment references a technique where an object is updated through
two subsequent indexing operations, e.g.
-.. ipython:: python
- :okwarning:
+.. code-block:: ipython
- with pd.option_context("mode.copy_on_write", False):
- df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
- df["foo"][df["bar"] > 5] = 100
- df
+ In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
+ In [2]: df["foo"][df["bar"] > 5] = 100
+ In [3]: df
+ Out[3]:
+ foo bar
+ 0 100 4
+ 1 2 5
+ 2 3 6
-The column ``foo`` is updated where the column ``bar`` is greater than 5.
-This violates the CoW principles though, because it would have to modify the
+The column ``foo`` was updated where the column ``bar`` is greater than 5.
+This violated the CoW principles though, because it would have to modify the
view ``df["foo"]`` and ``df`` in one step. Hence, chained assignment will
consistently never work and raise a ``ChainedAssignmentError`` warning
with CoW enabled:
@@ -272,7 +281,6 @@ shares data with the initial DataFrame:
The array is a copy if the initial DataFrame consists of more than one array:
-
.. ipython:: python
df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
@@ -295,7 +303,7 @@ This array is read-only, which means that it can't be modified inplace:
The same holds true for a Series, since a Series always consists of a single array.
-There are two potential solution to this:
+There are two potential solutions to this:
- Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
- Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so
@@ -317,7 +325,7 @@ you are modifying one object inplace.
.. ipython:: python
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
- df2 = df.reset_index()
+ df2 = df.reset_index(drop=True)
df2.iloc[0, 0] = 100
This creates two objects that share data and thus the setitem operation will trigger a
@@ -328,7 +336,7 @@ held by the object.
.. ipython:: python
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
- df = df.reset_index()
+ df = df.reset_index(drop=True)
df.iloc[0, 0] = 100
No copy is necessary in this example.
@@ -347,22 +355,3 @@ and :meth:`DataFrame.rename`.
These methods return views when Copy-on-Write is enabled, which provides a significant
performance improvement compared to the regular execution.
-
-.. _copy_on_write_enabling:
-
-How to enable CoW
------------------
-
-Copy-on-Write can be enabled through the configuration option ``copy_on_write``. The option can
-be turned on __globally__ through either of the following:
-
-.. ipython:: python
-
- pd.set_option("mode.copy_on_write", True)
-
- pd.options.mode.copy_on_write = True
-
-.. ipython:: python
- :suppress:
-
- pd.options.mode.copy_on_write = False
diff --git a/doc/source/user_guide/dsintro.rst b/doc/source/user_guide/dsintro.rst
index d1e981ee1bbdc..b9c285ca30c96 100644
--- a/doc/source/user_guide/dsintro.rst
+++ b/doc/source/user_guide/dsintro.rst
@@ -41,8 +41,8 @@ Here, ``data`` can be many different things:
* an ndarray
* a scalar value (like 5)
-The passed **index** is a list of axis labels. Thus, this separates into a few
-cases depending on what **data is**:
+The passed **index** is a list of axis labels. The constructor's behavior
+depends on **data**'s type:
**From ndarray**
@@ -87,8 +87,9 @@ index will be pulled out.
**From scalar value**
-If ``data`` is a scalar value, an index must be
-provided. The value will be repeated to match the length of **index**.
+If ``data`` is a scalar value, the value will be repeated to match
+the length of **index**. If the **index** is not provided, it defaults
+to ``RangeIndex(1)``.
.. ipython:: python
@@ -97,7 +98,7 @@ provided. The value will be repeated to match the length of **index**.
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~
-:class:`Series` acts very similarly to a ``ndarray`` and is a valid argument to most NumPy functions.
+:class:`Series` acts very similarly to a :class:`numpy.ndarray` and is a valid argument to most NumPy functions.
However, operations such as slicing will also slice the index.
.. ipython:: python
@@ -111,7 +112,7 @@ However, operations such as slicing will also slice the index.
.. note::
We will address array-based indexing like ``s.iloc[[4, 3, 1]]``
- in :ref:`section on indexing `.
+ in the :ref:`section on indexing `.
Like a NumPy array, a pandas :class:`Series` has a single :attr:`~Series.dtype`.
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
index 8c510173819e0..c4721f3a6b09c 100644
--- a/doc/source/user_guide/enhancingperf.rst
+++ b/doc/source/user_guide/enhancingperf.rst
@@ -453,7 +453,7 @@ by evaluate arithmetic and boolean expression all at once for large :class:`~pan
:func:`~pandas.eval` is many orders of magnitude slower for
smaller expressions or objects than plain Python. A good rule of thumb is
to only use :func:`~pandas.eval` when you have a
- :class:`.DataFrame` with more than 10,000 rows.
+ :class:`~pandas.core.frame.DataFrame` with more than 10,000 rows.
Supported syntax
~~~~~~~~~~~~~~~~
diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst
index 99c85ac66623d..26eb656357bf6 100644
--- a/doc/source/user_guide/gotchas.rst
+++ b/doc/source/user_guide/gotchas.rst
@@ -315,19 +315,8 @@ Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language `R
-`__. Part of the reason is the NumPy type hierarchy:
-
-.. csv-table::
- :header: "Typeclass","Dtypes"
- :widths: 30,70
- :delim: |
-
- ``numpy.floating`` | ``float16, float32, float64, float128``
- ``numpy.integer`` | ``int8, int16, int32, int64``
- ``numpy.unsignedinteger`` | ``uint8, uint16, uint32, uint64``
- ``numpy.object_`` | ``object_``
- ``numpy.bool_`` | ``bool_``
- ``numpy.character`` | ``bytes_, str_``
+`__. Part of the reason is the
+`NumPy type hierarchy `__.
The R language, by contrast, only has a handful of built-in data types:
``integer``, ``numeric`` (floating-point), ``character``, and
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
index 11863f8aead31..8c80fa7052dd5 100644
--- a/doc/source/user_guide/groupby.rst
+++ b/doc/source/user_guide/groupby.rst
@@ -137,15 +137,6 @@ We could naturally group by either the ``A`` or ``B`` columns, or both:
``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
-If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
-the columns except the one we specify:
-
-.. ipython:: python
-
- df2 = df.set_index(["A", "B"])
- grouped = df2.groupby(level=df2.index.names.difference(["B"]))
- grouped.sum()
-
The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do
a transpose:
@@ -247,7 +238,7 @@ GroupBy object attributes
~~~~~~~~~~~~~~~~~~~~~~~~~
The ``groups`` attribute is a dictionary whose keys are the computed unique groups
-and corresponding values are the axis labels belonging to each group. In the
+and corresponding values are the index labels belonging to each group. In the
above example we have:
.. ipython:: python
@@ -289,7 +280,7 @@ the number of groups, which is the same as the length of the ``groups`` dictiona
In [1]: gb. # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
- gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
+ gb.apply gb.cummax gb.cumsum gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
.. _groupby.multiindex:
@@ -425,6 +416,12 @@ You can also include the grouping columns if you want to operate on them.
grouped[["A", "B"]].sum()
+.. note::
+
+ The ``groupby`` operation in Pandas drops the ``name`` field of the columns Index object
+ after the operation. This change ensures consistency in syntax between different
+ column selection methods within groupby operations.
+
.. _groupby.iterating-label:
Iterating through groups
@@ -509,29 +506,28 @@ listed below, those with a ``*`` do *not* have an efficient, GroupBy-specific, i
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
-
- :meth:`~.DataFrameGroupBy.any`;Compute whether any of the values in the groups are truthy
- :meth:`~.DataFrameGroupBy.all`;Compute whether all of the values in the groups are truthy
- :meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups
- :meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups
- :meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group
- :meth:`~.DataFrameGroupBy.idxmax`;Compute the index of the maximum value in each group
- :meth:`~.DataFrameGroupBy.idxmin`;Compute the index of the minimum value in each group
- :meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group
- :meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group
- :meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group
- :meth:`~.DataFrameGroupBy.median`;Compute the median of each group
- :meth:`~.DataFrameGroupBy.min`;Compute the minimum value in each group
- :meth:`~.DataFrameGroupBy.nunique`;Compute the number of unique values in each group
- :meth:`~.DataFrameGroupBy.prod`;Compute the product of the values in each group
- :meth:`~.DataFrameGroupBy.quantile`;Compute a given quantile of the values in each group
- :meth:`~.DataFrameGroupBy.sem`;Compute the standard error of the mean of the values in each group
- :meth:`~.DataFrameGroupBy.size`;Compute the number of values in each group
- :meth:`~.DataFrameGroupBy.skew` *;Compute the skew of the values in each group
- :meth:`~.DataFrameGroupBy.std`;Compute the standard deviation of the values in each group
- :meth:`~.DataFrameGroupBy.sum`;Compute the sum of the values in each group
- :meth:`~.DataFrameGroupBy.var`;Compute the variance of the values in each group
+
+ :meth:`~.DataFrameGroupBy.any`,Compute whether any of the values in the groups are truthy
+ :meth:`~.DataFrameGroupBy.all`,Compute whether all of the values in the groups are truthy
+ :meth:`~.DataFrameGroupBy.count`,Compute the number of non-NA values in the groups
+ :meth:`~.DataFrameGroupBy.cov` * ,Compute the covariance of the groups
+ :meth:`~.DataFrameGroupBy.first`,Compute the first occurring value in each group
+ :meth:`~.DataFrameGroupBy.idxmax`,Compute the index of the maximum value in each group
+ :meth:`~.DataFrameGroupBy.idxmin`,Compute the index of the minimum value in each group
+ :meth:`~.DataFrameGroupBy.last`,Compute the last occurring value in each group
+ :meth:`~.DataFrameGroupBy.max`,Compute the maximum value in each group
+ :meth:`~.DataFrameGroupBy.mean`,Compute the mean of each group
+ :meth:`~.DataFrameGroupBy.median`,Compute the median of each group
+ :meth:`~.DataFrameGroupBy.min`,Compute the minimum value in each group
+ :meth:`~.DataFrameGroupBy.nunique`,Compute the number of unique values in each group
+ :meth:`~.DataFrameGroupBy.prod`,Compute the product of the values in each group
+ :meth:`~.DataFrameGroupBy.quantile`,Compute a given quantile of the values in each group
+ :meth:`~.DataFrameGroupBy.sem`,Compute the standard error of the mean of the values in each group
+ :meth:`~.DataFrameGroupBy.size`,Compute the number of values in each group
+ :meth:`~.DataFrameGroupBy.skew` * ,Compute the skew of the values in each group
+ :meth:`~.DataFrameGroupBy.std`,Compute the standard deviation of the values in each group
+ :meth:`~.DataFrameGroupBy.sum`,Compute the sum of the values in each group
+ :meth:`~.DataFrameGroupBy.var`,Compute the variance of the values in each group
Some examples:
@@ -672,8 +668,9 @@ column, which produces an aggregated result with a hierarchical column index:
grouped[["C", "D"]].agg(["sum", "mean", "std"])
-The resulting aggregations are named after the functions themselves. If you
-need to rename, then you can add in a chained operation for a ``Series`` like this:
+The resulting aggregations are named after the functions themselves.
+
+For a ``Series``, if you need to rename, you can add in a chained operation like this:
.. ipython:: python
@@ -683,8 +680,19 @@ need to rename, then you can add in a chained operation for a ``Series`` like th
.rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
)
+Or, you can simply pass a list of tuples each with the name of the new column and the aggregate function:
+
+.. ipython:: python
+
+ (
+ grouped["C"]
+ .agg([("foo", "sum"), ("bar", "mean"), ("baz", "std")])
+ )
+
For a grouped ``DataFrame``, you can rename in a similar manner:
+By chaining ``rename`` operation,
+
.. ipython:: python
(
@@ -693,6 +701,16 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
)
)
+Or, passing a list of tuples,
+
+.. ipython:: python
+
+ (
+ grouped[["C", "D"]].agg(
+ [("foo", "sum"), ("bar", "mean"), ("baz", "std")]
+ )
+ )
+
.. note::
In general, the output column names should be unique, but pandas will allow
@@ -835,19 +853,18 @@ The following methods on GroupBy act as transformations.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
-
- :meth:`~.DataFrameGroupBy.bfill`;Back fill NA values within each group
- :meth:`~.DataFrameGroupBy.cumcount`;Compute the cumulative count within each group
- :meth:`~.DataFrameGroupBy.cummax`;Compute the cumulative max within each group
- :meth:`~.DataFrameGroupBy.cummin`;Compute the cumulative min within each group
- :meth:`~.DataFrameGroupBy.cumprod`;Compute the cumulative product within each group
- :meth:`~.DataFrameGroupBy.cumsum`;Compute the cumulative sum within each group
- :meth:`~.DataFrameGroupBy.diff`;Compute the difference between adjacent values within each group
- :meth:`~.DataFrameGroupBy.ffill`;Forward fill NA values within each group
- :meth:`~.DataFrameGroupBy.pct_change`;Compute the percent change between adjacent values within each group
- :meth:`~.DataFrameGroupBy.rank`;Compute the rank of each value within each group
- :meth:`~.DataFrameGroupBy.shift`;Shift values up or down within each group
+
+ :meth:`~.DataFrameGroupBy.bfill`,Back fill NA values within each group
+ :meth:`~.DataFrameGroupBy.cumcount`,Compute the cumulative count within each group
+ :meth:`~.DataFrameGroupBy.cummax`,Compute the cumulative max within each group
+ :meth:`~.DataFrameGroupBy.cummin`,Compute the cumulative min within each group
+ :meth:`~.DataFrameGroupBy.cumprod`,Compute the cumulative product within each group
+ :meth:`~.DataFrameGroupBy.cumsum`,Compute the cumulative sum within each group
+ :meth:`~.DataFrameGroupBy.diff`,Compute the difference between adjacent values within each group
+ :meth:`~.DataFrameGroupBy.ffill`,Forward fill NA values within each group
+ :meth:`~.DataFrameGroupBy.pct_change`,Compute the percent change between adjacent values within each group
+ :meth:`~.DataFrameGroupBy.rank`,Compute the rank of each value within each group
+ :meth:`~.DataFrameGroupBy.shift`,Shift values up or down within each group
In addition, passing any built-in aggregation method as a string to
:meth:`~.DataFrameGroupBy.transform` (see the next section) will broadcast the result
@@ -1095,11 +1112,10 @@ efficient, GroupBy-specific, implementation.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
- :delim: ;
- :meth:`~.DataFrameGroupBy.head`;Select the top row(s) of each group
- :meth:`~.DataFrameGroupBy.nth`;Select the nth row(s) of each group
- :meth:`~.DataFrameGroupBy.tail`;Select the bottom row(s) of each group
+ :meth:`~.DataFrameGroupBy.head`,Select the top row(s) of each group
+ :meth:`~.DataFrameGroupBy.nth`,Select the nth row(s) of each group
+ :meth:`~.DataFrameGroupBy.tail`,Select the bottom row(s) of each group
Users can also use transformations along with Boolean indexing to construct complex
filtrations within groups. For example, suppose we are given groups of products and
@@ -1730,4 +1746,4 @@ column index name will be used as the name of the inserted column:
result
- result.stack(future_stack=True)
+ result.stack()
diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst
index 4954ee1538697..503f7cc7cbe73 100644
--- a/doc/source/user_guide/indexing.rst
+++ b/doc/source/user_guide/indexing.rst
@@ -29,13 +29,6 @@ this area.
production code, we recommended that you take advantage of the optimized
pandas data access methods exposed in this chapter.
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may
- depend on the context. This is sometimes called ``chained assignment`` and
- should be avoided. See :ref:`Returning a View versus Copy
- `.
-
See the :ref:`MultiIndex / Advanced Indexing ` for ``MultiIndex`` and more advanced indexing documentation.
See the :ref:`cookbook` for some advanced strategies.
@@ -101,13 +94,14 @@ well). Any of the axes accessors may be the null slice ``:``. Axes left out of
the specification are assumed to be ``:``, e.g. ``p.loc['a']`` is equivalent to
``p.loc['a', :]``.
-.. csv-table::
- :header: "Object Type", "Indexers"
- :widths: 30, 50
- :delim: ;
- Series; ``s.loc[indexer]``
- DataFrame; ``df.loc[row_indexer,column_indexer]``
+.. ipython:: python
+
+ ser = pd.Series(range(5), index=list("abcde"))
+ ser.loc[["a", "c", "e"]]
+
+ df = pd.DataFrame(np.arange(25).reshape(5, 5), index=list("abcde"), columns=list("abcde"))
+ df.loc[["a", "c", "e"], ["b", "d"]]
.. _indexing.basics:
@@ -123,10 +117,9 @@ indexing pandas objects with ``[]``:
.. csv-table::
:header: "Object Type", "Selection", "Return Value Type"
:widths: 30, 30, 60
- :delim: ;
- Series; ``series[label]``; scalar value
- DataFrame; ``frame[colname]``; ``Series`` corresponding to colname
+ Series, ``series[label]``, scalar value
+ DataFrame, ``frame[colname]``, ``Series`` corresponding to colname
Here we construct a simple time series data set to use for illustrating the
indexing functionality:
@@ -269,6 +262,10 @@ The most robust and consistent way of slicing ranges along arbitrary axes is
described in the :ref:`Selection by Position ` section
detailing the ``.iloc`` method. For now, we explain the semantics of slicing using the ``[]`` operator.
+ .. note::
+
+ When the :class:`Series` has float indices, slicing will select by position.
+
With Series, the syntax works exactly as with an ndarray, returning a slice of
the values and the corresponding labels:
@@ -299,12 +296,6 @@ largely as a convenience since it is such a common operation.
Selection by label
------------------
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may depend on the context.
- This is sometimes called ``chained assignment`` and should be avoided.
- See :ref:`Returning a View versus Copy `.
-
.. warning::
``.loc`` is strict when you present slicers that are not compatible (or convertible) with the index type. For example
@@ -412,9 +403,9 @@ are returned:
s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
s.loc[3:5]
-If at least one of the two is absent, but the index is sorted, and can be
-compared against start and stop labels, then slicing will still work as
-expected, by selecting labels which *rank* between the two:
+If the index is sorted, and can be compared against start and stop labels,
+then slicing will still work as expected, by selecting labels which *rank*
+between the two:
.. ipython:: python
@@ -445,12 +436,6 @@ For more information about duplicate labels, see
Selection by position
---------------------
-.. warning::
-
- Whether a copy or a reference is returned for a setting operation, may depend on the context.
- This is sometimes called ``chained assignment`` and should be avoided.
- See :ref:`Returning a View versus Copy `.
-
pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely Python and NumPy slicing. These are ``0-based`` indexing. When slicing, the start bound is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise an ``IndexError``.
The ``.iloc`` attribute is the primary access method. The following are valid inputs:
@@ -967,7 +952,7 @@ To select a row where each column meets its own criterion:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
- row_mask = df.isin(values).all(1)
+ row_mask = df.isin(values).all(axis=1)
df[row_mask]
@@ -1722,234 +1707,10 @@ You can assign a custom index to the ``index`` attribute:
df_idx.index = pd.Index([10, 20, 30, 40], name="a")
df_idx
-.. _indexing.view_versus_copy:
-
-Returning a view versus a copy
-------------------------------
-
-.. warning::
-
- :ref:`Copy-on-Write `
- will become the new default in pandas 3.0. This means than chained indexing will
- never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary
- anymore.
- See :ref:`this section `
- for more context.
- We recommend turning Copy-on-Write on to leverage the improvements with
-
- ```
- pd.options.mode.copy_on_write = True
- ```
-
- even before pandas 3.0 is available.
-
-When setting values in a pandas object, care must be taken to avoid what is called
-``chained indexing``. Here is an example.
-
-.. ipython:: python
-
- dfmi = pd.DataFrame([list('abcd'),
- list('efgh'),
- list('ijkl'),
- list('mnop')],
- columns=pd.MultiIndex.from_product([['one', 'two'],
- ['first', 'second']]))
- dfmi
-
-Compare these two access methods:
-
-.. ipython:: python
-
- dfmi['one']['second']
-
-.. ipython:: python
-
- dfmi.loc[:, ('one', 'second')]
-
-These both yield the same results, so which should you use? It is instructive to understand the order
-of operations on these and why method 2 (``.loc``) is much preferred over method 1 (chained ``[]``).
-
-``dfmi['one']`` selects the first level of the columns and returns a DataFrame that is singly-indexed.
-Then another Python operation ``dfmi_with_one['second']`` selects the series indexed by ``'second'``.
-This is indicated by the variable ``dfmi_with_one`` because pandas sees these operations as separate events.
-e.g. separate calls to ``__getitem__``, so it has to treat them as linear operations, they happen one after another.
-
-Contrast this to ``df.loc[:,('one','second')]`` which passes a nested tuple of ``(slice(None),('one','second'))`` to a single call to
-``__getitem__``. This allows pandas to deal with this as a single entity. Furthermore this order of operations *can* be significantly
-faster, and allows one to index *both* axes if so desired.
-
Why does assignment fail when using chained indexing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. warning::
-
- :ref:`Copy-on-Write `
- will become the new default in pandas 3.0. This means than chained indexing will
- never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary
- anymore.
- See :ref:`this section `
- for more context.
- We recommend turning Copy-on-Write on to leverage the improvements with
-
- ```
- pd.options.mode.copy_on_write = True
- ```
-
- even before pandas 3.0 is available.
-
-The problem in the previous section is just a performance issue. What's up with
-the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when
-you do something that might cost a few extra milliseconds!
-
-But it turns out that assigning to the product of chained indexing has
-inherently unpredictable results. To see this, think about how the Python
-interpreter executes this code:
-
-.. code-block:: python
-
- dfmi.loc[:, ('one', 'second')] = value
- # becomes
- dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
-
-But this code is handled differently:
-
-.. code-block:: python
-
- dfmi['one']['second'] = value
- # becomes
- dfmi.__getitem__('one').__setitem__('second', value)
-
-See that ``__getitem__`` in there? Outside of simple cases, it's very hard to
-predict whether it will return a view or a copy (it depends on the memory layout
-of the array, about which pandas makes no guarantees), and therefore whether
-the ``__setitem__`` will modify ``dfmi`` or a temporary object that gets thrown
-out immediately afterward. **That's** what ``SettingWithCopy`` is warning you
-about!
-
-.. note:: You may be wondering whether we should be concerned about the ``loc``
- property in the first example. But ``dfmi.loc`` is guaranteed to be ``dfmi``
- itself with modified indexing behavior, so ``dfmi.loc.__getitem__`` /
- ``dfmi.loc.__setitem__`` operate on ``dfmi`` directly. Of course,
- ``dfmi.loc.__getitem__(idx)`` may be a view or a copy of ``dfmi``.
-
-Sometimes a ``SettingWithCopy`` warning will arise at times when there's no
-obvious chained indexing going on. **These** are the bugs that
-``SettingWithCopy`` is designed to catch! pandas is probably trying to warn you
-that you've done this:
-
-.. code-block:: python
-
- def do_something(df):
- foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
- # ... many lines here ...
- # We don't know whether this will modify df or not!
- foo['quux'] = value
- return foo
-
-Yikes!
-
-.. _indexing.evaluation_order:
-
-Evaluation order matters
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. warning::
-
- :ref:`Copy-on-Write `
- will become the new default in pandas 3.0. This means than chained indexing will
- never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary
- anymore.
- See :ref:`this section `
- for more context.
- We recommend turning Copy-on-Write on to leverage the improvements with
-
- ```
- pd.options.mode.copy_on_write = True
- ```
-
- even before pandas 3.0 is available.
-
-When you use chained indexing, the order and type of the indexing operation
-partially determine whether the result is a slice into the original object, or
-a copy of the slice.
-
-pandas has the ``SettingWithCopyWarning`` because assigning to a copy of a
-slice is frequently not intentional, but a mistake caused by chained indexing
-returning a copy where a slice was expected.
-
-If you would like pandas to be more or less trusting about assignment to a
-chained indexing expression, you can set the :ref:`option `
-``mode.chained_assignment`` to one of these values:
-
-* ``'warn'``, the default, means a ``SettingWithCopyWarning`` is printed.
-* ``'raise'`` means pandas will raise a ``SettingWithCopyError``
- you have to deal with.
-* ``None`` will suppress the warnings entirely.
-
-.. ipython:: python
- :okwarning:
-
- dfb = pd.DataFrame({'a': ['one', 'one', 'two',
- 'three', 'two', 'one', 'six'],
- 'c': np.arange(7)})
-
- # This will show the SettingWithCopyWarning
- # but the frame values will be set
- dfb['c'][dfb['a'].str.startswith('o')] = 42
-
-This however is operating on a copy and will not work.
-
-.. ipython:: python
- :okwarning:
- :okexcept:
-
- with pd.option_context('mode.chained_assignment','warn'):
- dfb[dfb['a'].str.startswith('o')]['c'] = 42
-
-A chained assignment can also crop up in setting in a mixed dtype frame.
-
-.. note::
-
- These setting rules apply to all of ``.loc/.iloc``.
-
-The following is the recommended access method using ``.loc`` for multiple items (using ``mask``) and a single item using a fixed index:
-
-.. ipython:: python
-
- dfc = pd.DataFrame({'a': ['one', 'one', 'two',
- 'three', 'two', 'one', 'six'],
- 'c': np.arange(7)})
- dfd = dfc.copy()
- # Setting multiple items using a mask
- mask = dfd['a'].str.startswith('o')
- dfd.loc[mask, 'c'] = 42
- dfd
-
- # Setting a single item
- dfd = dfc.copy()
- dfd.loc[2, 'a'] = 11
- dfd
-
-The following *can* work at times, but it is not guaranteed to, and therefore should be avoided:
-
-.. ipython:: python
- :okwarning:
-
- dfd = dfc.copy()
- dfd['a'][2] = 111
- dfd
-
-Last, the subsequent example will **not** work at all, and so should be avoided:
-
-.. ipython:: python
- :okwarning:
- :okexcept:
-
- with pd.option_context('mode.chained_assignment','raise'):
- dfd.loc[0]['a'] = 1111
-
-.. warning::
-
- The chained assignment warnings / exceptions are aiming to inform the user of a possibly invalid
- assignment. There may be false positives; situations where a chained assignment is inadvertently
- reported.
+:ref:`Copy-on-Write ` is the new default with pandas 3.0.
+This means that chained indexing will never work.
+See :ref:`this section `
+for more context.
diff --git a/doc/source/user_guide/integer_na.rst b/doc/source/user_guide/integer_na.rst
index 1a727cd78af09..76a2f22b7987d 100644
--- a/doc/source/user_guide/integer_na.rst
+++ b/doc/source/user_guide/integer_na.rst
@@ -84,6 +84,19 @@ with the dtype.
In the future, we may provide an option for :class:`Series` to infer a
nullable-integer dtype.
+If you create a column of ``NA`` values (for example to fill them later)
+with ``df['new_col'] = pd.NA``, the ``dtype`` would be set to ``object`` in the
+new column. The performance on this column will be worse than with
+the appropriate type. It's better to use
+``df['new_col'] = pd.Series(pd.NA, dtype="Int64")``
+(or another ``dtype`` that supports ``NA``).
+
+.. ipython:: python
+
+ df = pd.DataFrame()
+ df['objects'] = pd.NA
+ df.dtypes
+
Operations
----------
diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
index 6148086452d54..fa64bce60caf4 100644
--- a/doc/source/user_guide/io.rst
+++ b/doc/source/user_guide/io.rst
@@ -16,27 +16,25 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
.. csv-table::
:header: "Format Type", "Data Description", "Reader", "Writer"
:widths: 30, 100, 60, 60
- :delim: ;
-
- text;`CSV `__;:ref:`read_csv`;:ref:`to_csv`
- text;Fixed-Width Text File;:ref:`read_fwf`
- text;`JSON `__;:ref:`read_json`;:ref:`to_json`
- text;`HTML `__;:ref:`read_html`;:ref:`to_html`
- text;`LaTeX `__;;:ref:`Styler.to_latex`
- text;`XML `__;:ref:`read_xml`;:ref:`to_xml`
- text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard`
- binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel`
- binary;`OpenDocument `__;:ref:`read_excel`;
- binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf`
- binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather`
- binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet`
- binary;`ORC Format `__;:ref:`read_orc`;:ref:`to_orc`
- binary;`Stata `__;:ref:`read_stata`;:ref:`to_stata`
- binary;`SAS `__;:ref:`read_sas`;
- binary;`SPSS `__;:ref:`read_spss`;
- binary;`Python Pickle Format `__;:ref:`read_pickle`;:ref:`to_pickle`
- SQL;`SQL `__;:ref:`read_sql`;:ref:`to_sql`
- SQL;`Google BigQuery `__;:ref:`read_gbq`;:ref:`to_gbq`
+
+ text,`CSV `__, :ref:`read_csv`, :ref:`to_csv`
+ text,Fixed-Width Text File, :ref:`read_fwf` , NA
+ text,`JSON `__, :ref:`read_json`, :ref:`to_json`
+ text,`HTML `__, :ref:`read_html`, :ref:`to_html`
+ text,`LaTeX `__, :ref:`Styler.to_latex` , NA
+ text,`XML `__, :ref:`read_xml`, :ref:`to_xml`
+ text, Local clipboard, :ref:`read_clipboard`, :ref:`to_clipboard`
+ binary,`MS Excel `__ , :ref:`read_excel`, :ref:`to_excel`
+ binary,`OpenDocument `__, :ref:`read_excel`, NA
+ binary,`HDF5 Format `__, :ref:`read_hdf`, :ref:`to_hdf`
+ binary,`Feather Format `__, :ref:`read_feather`, :ref:`to_feather`
+ binary,`Parquet Format `__, :ref:`read_parquet`, :ref:`to_parquet`
+ binary,`ORC Format `__, :ref:`read_orc`, :ref:`to_orc`
+ binary,`Stata `__, :ref:`read_stata`, :ref:`to_stata`
+ binary,`SAS `__, :ref:`read_sas` , NA
+ binary,`SPSS `__, :ref:`read_spss` , NA
+ binary,`Python Pickle Format `__, :ref:`read_pickle`, :ref:`to_pickle`
+ SQL,`SQL `__, :ref:`read_sql`,:ref:`to_sql`
:ref:`Here ` is an informal performance comparison for some of these IO methods.
@@ -61,8 +59,8 @@ Basic
+++++
filepath_or_buffer : various
- Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`,
- or :class:`py:py._path.local.LocalPath`), URL (including http, ftp, and S3
+ Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`)
+ URL (including http, ftp, and S3
locations), or any object with a ``read()`` method (such as an open file or
:class:`~python:io.StringIO`).
sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table`
@@ -75,14 +73,6 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
delimiters are prone to ignoring quoted data. Regex example: ``'\\r\\t'``.
delimiter : str, default ``None``
Alternative argument name for sep.
-delim_whitespace : boolean, default False
- Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``)
- will be used as the delimiter. Equivalent to setting ``sep='\s+'``.
- If this option is set to ``True``, nothing should be passed in for the
- ``delimiter`` parameter.
-
- .. deprecated: 2.2.0
- Use ``sep="\\s+" instead.
Column and index locations and names
++++++++++++++++++++++++++++++++++++
@@ -179,7 +169,7 @@ dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFram
implementation when "numpy_nullable" is set, pyarrow is used for all
dtypes if "pyarrow" is set.
- The dtype_backends are still experimential.
+ The dtype_backends are still experimental.
.. versionadded:: 2.0
@@ -272,34 +262,9 @@ parse_dates : boolean or list of ints or names or list of lists or dict, default
* If ``True`` -> try parsing the index.
* If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date
column.
- * If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
- column.
- * If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'.
.. note::
A fast-path exists for iso8601-formatted dates.
-infer_datetime_format : boolean, default ``False``
- If ``True`` and parse_dates is enabled for a column, attempt to infer the
- datetime format to speed up the processing.
-
- .. deprecated:: 2.0.0
- A strict version of this argument is now the default, passing it has no effect.
-keep_date_col : boolean, default ``False``
- If ``True`` and parse_dates specifies combining multiple columns then keep the
- original columns.
-date_parser : function, default ``None``
- Function to use for converting a sequence of string columns to an array of
- datetime instances. The default uses ``dateutil.parser.parser`` to do the
- conversion. pandas will try to call date_parser in three different ways,
- advancing to the next if an exception occurs: 1) Pass one or more arrays (as
- defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
- values from the columns defined by parse_dates into a single array and pass
- that; and 3) call date_parser once for each row using one or more strings
- (corresponding to the columns defined by parse_dates) as arguments.
-
- .. deprecated:: 2.0.0
- Use ``date_format`` instead, or read in as ``object`` and then apply
- :func:`to_datetime` as-needed.
date_format : str or dict of column -> format, default ``None``
If used in conjunction with ``parse_dates``, will parse dates according to this
format. For anything more complex,
@@ -831,71 +796,8 @@ The simplest case is to just pass in ``parse_dates=True``:
It is often the case that we may want to store date and time data separately,
or store various date fields separately. the ``parse_dates`` keyword can be
-used to specify a combination of columns to parse the dates and/or times from.
-
-You can specify a list of column lists to ``parse_dates``, the resulting date
-columns will be prepended to the output (so as to not affect the existing column
-order) and the new column names will be the concatenation of the component
-column names:
-
-.. ipython:: python
- :okwarning:
-
- data = (
- "KORD,19990127, 19:00:00, 18:56:00, 0.8100\n"
- "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n"
- "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n"
- "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n"
- "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n"
- "KORD,19990127, 23:00:00, 22:56:00, -0.5900"
- )
-
- with open("tmp.csv", "w") as fh:
- fh.write(data)
-
- df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]])
- df
-
-By default the parser removes the component date columns, but you can choose
-to retain them via the ``keep_date_col`` keyword:
-
-.. ipython:: python
- :okwarning:
-
- df = pd.read_csv(
- "tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]], keep_date_col=True
- )
- df
-
-Note that if you wish to combine multiple columns into a single date column, a
-nested list must be used. In other words, ``parse_dates=[1, 2]`` indicates that
-the second and third columns should each be parsed as separate date columns
-while ``parse_dates=[[1, 2]]`` means the two columns should be parsed into a
-single column.
-
-You can also use a dict to specify custom name columns:
-
-.. ipython:: python
- :okwarning:
-
- date_spec = {"nominal": [1, 2], "actual": [1, 3]}
- df = pd.read_csv("tmp.csv", header=None, parse_dates=date_spec)
- df
-
-It is important to remember that if multiple text columns are to be parsed into
-a single date column, then a new column is prepended to the data. The ``index_col``
-specification is based off of this new set of columns rather than the original
-data columns:
-
-
-.. ipython:: python
- :okwarning:
+used to specify columns to parse the dates and/or times.
- date_spec = {"nominal": [1, 2], "actual": [1, 3]}
- df = pd.read_csv(
- "tmp.csv", header=None, parse_dates=date_spec, index_col=0
- ) # index is the nominal column
- df
.. note::
If a column or index contains an unparsable date, the entire column or
@@ -909,10 +811,6 @@ data columns:
for your data to store datetimes in this format, load times will be
significantly faster, ~20x has been observed.
-.. deprecated:: 2.2.0
- Combining date columns inside read_csv is deprecated. Use ``pd.to_datetime``
- on the relevant result columns instead.
-
Date parsing functions
++++++++++++++++++++++
@@ -928,12 +826,6 @@ Performance-wise, you should try these methods of parsing dates in order:
then use ``to_datetime``.
-.. ipython:: python
- :suppress:
-
- os.remove("tmp.csv")
-
-
.. _io.csv.mixed_timezones:
Parsing a CSV with mixed timezones
@@ -1045,7 +937,7 @@ Writing CSVs to binary file objects
``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object
opened binary mode. In most cases, it is not necessary to specify
-``mode`` as Pandas will auto-detect whether the file object is
+``mode`` as pandas will auto-detect whether the file object is
opened in text or binary mode.
.. ipython:: python
@@ -1605,7 +1497,7 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
Specifying the parser engine
''''''''''''''''''''''''''''
-Pandas currently supports three engines, the C engine, the python engine, and an experimental
+pandas currently supports three engines, the C engine, the python engine, and an experimental
pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest
on larger workloads and is equivalent in speed to the C engine on most other workloads.
The python engine tends to be slower than the pyarrow and C engines on most workloads. However,
@@ -1619,7 +1511,6 @@ Currently, options unsupported by the C and pyarrow engines include:
* ``sep`` other than a single character (e.g. regex separators)
* ``skipfooter``
-* ``sep=None`` with ``delim_whitespace=False``
Specifying any of the above options will produce a ``ParserWarning`` unless the
python engine is selected explicitly using ``engine='python'``.
@@ -1634,14 +1525,12 @@ Options that are unsupported by the pyarrow engine which are not covered by the
* ``memory_map``
* ``dialect``
* ``on_bad_lines``
-* ``delim_whitespace``
* ``quoting``
* ``lineterminator``
* ``converters``
* ``decimal``
* ``iterator``
* ``dayfirst``
-* ``infer_datetime_format``
* ``verbose``
* ``skipinitialspace``
* ``low_memory``
@@ -1704,7 +1593,7 @@ option parameter:
.. code-block:: python
- storage_options = {"client_kwargs": {"endpoint_url": "http://127.0.0.1:5555"}}}
+ storage_options = {"client_kwargs": {"endpoint_url": "http://127.0.0.1:5555"}}
df = pd.read_json("s3://pandas-test/test-1", storage_options=storage_options)
More sample configurations and documentation can be found at `S3Fs documentation
@@ -1838,14 +1727,13 @@ with optional parameters:
.. csv-table::
:widths: 20, 150
- :delim: ;
- ``split``; dict like {index -> [index], columns -> [columns], data -> [values]}
- ``records``; list like [{column -> value}, ... , {column -> value}]
- ``index``; dict like {index -> {column -> value}}
- ``columns``; dict like {column -> {index -> value}}
- ``values``; just the values array
- ``table``; adhering to the JSON `Table Schema`_
+ ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]}
+ ``records``, list like [{column -> value}; ... ]
+ ``index``, dict like {index -> {column -> value}}
+ ``columns``, dict like {column -> {index -> value}}
+ ``values``, just the values array
+ ``table``, adhering to the JSON `Table Schema`_
* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601.
* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10.
@@ -1950,13 +1838,6 @@ Writing in ISO date format, with microseconds:
json = dfd.to_json(date_format="iso", date_unit="us")
json
-Epoch timestamps, in seconds:
-
-.. ipython:: python
-
- json = dfd.to_json(date_format="epoch", date_unit="s")
- json
-
Writing to a file, with a date index and a date column:
.. ipython:: python
@@ -1966,7 +1847,7 @@ Writing to a file, with a date index and a date column:
dfj2["ints"] = list(range(5))
dfj2["bools"] = True
dfj2.index = pd.date_range("20130101", periods=5)
- dfj2.to_json("test.json")
+ dfj2.to_json("test.json", date_format="iso")
with open("test.json") as fh:
print(fh.read())
@@ -2033,14 +1914,13 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series``
.. csv-table::
:widths: 20, 150
- :delim: ;
- ``split``; dict like {index -> [index], columns -> [columns], data -> [values]}
- ``records``; list like [{column -> value}, ... , {column -> value}]
- ``index``; dict like {index -> {column -> value}}
- ``columns``; dict like {column -> {index -> value}}
- ``values``; just the values array
- ``table``; adhering to the JSON `Table Schema`_
+ ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]}
+ ``records``, list like [{column -> value} ...]
+ ``index``, dict like {index -> {column -> value}}
+ ``columns``, dict like {column -> {index -> value}}
+ ``values``, just the values array
+ ``table``, adhering to the JSON `Table Schema`_
* ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if ``False``, then don't infer dtypes at all, default is True, apply only to the data.
@@ -2141,7 +2021,7 @@ Dates written in nanoseconds need to be read back in nanoseconds:
.. ipython:: python
from io import StringIO
- json = dfj2.to_json(date_unit="ns")
+ json = dfj2.to_json(date_format="iso", date_unit="ns")
# Try to parse timestamps as milliseconds -> Won't Work
dfju = pd.read_json(StringIO(json), date_unit="ms")
@@ -2281,7 +2161,7 @@ a JSON string with two fields, ``schema`` and ``data``.
{
"A": [1, 2, 3],
"B": ["a", "b", "c"],
- "C": pd.date_range("2016-01-01", freq="d", periods=3),
+ "C": pd.date_range("2016-01-01", freq="D", periods=3),
},
index=pd.Index(range(3), name="idx"),
)
@@ -2390,7 +2270,7 @@ round-trippable manner.
{
"foo": [1, 2, 3, 4],
"bar": ["a", "b", "c", "d"],
- "baz": pd.date_range("2018-01-01", freq="d", periods=4),
+ "baz": pd.date_range("2018-01-01", freq="D", periods=4),
"qux": pd.Categorical(["a", "b", "c", "c"]),
},
index=pd.Index(range(4), name="idx"),
@@ -3013,16 +2893,17 @@ Read in the content of the "books.xml" as instance of ``StringIO`` or
df
Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing
-Biomedical and Life Science Jorurnals:
+Biomedical and Life Science Journals:
-.. ipython:: python
- :okwarning:
+.. code-block:: python
- df = pd.read_xml(
- "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
- xpath=".//journal-meta",
- )
- df
+ >>> df = pd.read_xml(
+ ... "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
+ ... xpath=".//journal-meta",
+ ...)
+ >>> df
+ journal-id journal-title issn publisher
+ 0 Cardiovasc Ultrasound Cardiovascular Ultrasound 1476-7120 NaN
With `lxml`_ as default ``parser``, you access the full-featured XML library
that extends Python's ElementTree API. One powerful tool is ability to query
@@ -3122,7 +3003,7 @@ However, if XPath does not reference node names such as default, ``/*``, then
.. note::
Since ``xpath`` identifies the parent of content to be parsed, only immediate
- desendants which include child nodes or current attributes are parsed.
+ descendants which include child nodes or current attributes are parsed.
Therefore, ``read_xml`` will not parse the text of grandchildren or other
descendants and will not parse attributes of any descendant. To retrieve
lower level content, adjust xpath to lower level. For example,
@@ -3247,7 +3128,7 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
"""
- df = pd.read_xml(StringIO(xml), stylesheet=xsl)
+ df = pd.read_xml(StringIO(xml), stylesheet=StringIO(xsl))
df
For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
@@ -3418,7 +3299,7 @@ Write an XML and transform with stylesheet:
"""
- print(geom_df.to_xml(stylesheet=xsl))
+ print(geom_df.to_xml(stylesheet=StringIO(xsl)))
XML Final Notes
@@ -3471,20 +3352,15 @@ saving a ``DataFrame`` to Excel. Generally the semantics are
similar to working with :ref:`csv` data.
See the :ref:`cookbook` for some advanced strategies.
-.. warning::
-
- The `xlrd `__ package is now only for reading
- old-style ``.xls`` files.
+.. note::
- Before pandas 1.3.0, the default argument ``engine=None`` to :func:`~pandas.read_excel`
- would result in using the ``xlrd`` engine in many cases, including new
- Excel 2007+ (``.xlsx``) files. pandas will now default to using the
- `openpyxl `__ engine.
+ When ``engine=None``, the following logic will be used to determine the engine:
- It is strongly encouraged to install ``openpyxl`` to read Excel 2007+
- (``.xlsx``) files.
- **Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.**
- This is no longer supported, switch to using ``openpyxl`` instead.
+ - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
+ then `odf `_ will be used.
+ - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used.
+ - Otherwise if ``path_or_buffer`` is in xlsb format, ``pyxlsb`` will be used.
+ - Otherwise ``openpyxl`` will be used.
.. _io.excel_reader:
@@ -3659,7 +3535,7 @@ For example, to read in a ``MultiIndex`` index without names:
df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
df
-If the index has level names, they will parsed as well, using the same
+If the index has level names, they will be parsed as well, using the same
parameters.
.. ipython:: python
@@ -3913,6 +3789,20 @@ The look and feel of Excel worksheets created from pandas can be modified using
* ``float_format`` : Format string for floating point numbers (default ``None``).
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``).
+.. note::
+
+ As of pandas 3.0, by default spreadsheets created with the ``to_excel`` method
+ will not contain any styling. Users wishing to bold text, add bordered styles,
+ etc in a worksheet output by ``to_excel`` can do so by using :meth:`Styler.to_excel`
+ to create styled excel files. For documentation on styling spreadsheets, see
+ `here `__.
+
+
+.. code-block:: python
+
+ css = "border: 1px solid black; font-weight: bold;"
+ df.style.map_index(lambda x: css).map_index(lambda x: css, axis=1).to_excel("myfile.xlsx")
+
Using the `Xlsxwriter`_ engine provides many options for controlling the
format of an Excel worksheet created with the ``to_excel`` method. Excellent examples can be found in the
`Xlsxwriter`_ documentation here: https://xlsxwriter.readthedocs.io/working_with_pandas.html
@@ -5100,7 +4990,7 @@ Caveats
convenience you can use ``store.flush(fsync=True)`` to do this for you.
* Once a ``table`` is created columns (DataFrame)
are fixed; only exactly the same columns can be appended
-* Be aware that timezones (e.g., ``pytz.timezone('US/Eastern')``)
+* Be aware that timezones (e.g., ``zoneinfo.ZoneInfo('US/Eastern')``)
are not necessarily equal across timezone versions. So if data is
localized to a specific timezone in the HDFStore using one version
of a timezone library and that data is updated with another version, the data
@@ -5279,6 +5169,8 @@ See the `Full Documentation `__.
.. ipython:: python
+ import pytz
+
df = pd.DataFrame(
{
"a": list("abc"),
@@ -5288,7 +5180,7 @@ See the `Full Documentation `__.
"e": [True, False, True],
"f": pd.Categorical(list("abc")),
"g": pd.date_range("20130101", periods=3),
- "h": pd.date_range("20130101", periods=3, tz="US/Eastern"),
+ "h": pd.date_range("20130101", periods=3, tz=pytz.timezone("US/Eastern")),
"i": pd.date_range("20130101", periods=3, freq="ns"),
}
)
@@ -5957,10 +5849,10 @@ You can check if a table exists using :func:`~pandas.io.sql.has_table`
Schema support
''''''''''''''
-Reading from and writing to different schema's is supported through the ``schema``
+Reading from and writing to different schemas is supported through the ``schema``
keyword in the :func:`~pandas.read_sql_table` and :func:`~pandas.DataFrame.to_sql`
functions. Note however that this depends on the database flavor (sqlite does not
-have schema's). For example:
+have schemas). For example:
.. code-block:: python
@@ -6100,10 +5992,6 @@ Google BigQuery
The ``pandas-gbq`` package provides functionality to read/write from Google BigQuery.
-pandas integrates with this external package. if ``pandas-gbq`` is installed, you can
-use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
-respective functions from ``pandas-gbq``.
-
Full documentation can be found `here `__.
.. _io.stata:
@@ -6395,7 +6283,7 @@ ignored.
In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
In [3]: df.info()
-
+
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A 1000000 non-null float64
diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst
index c9c8478a719f0..cfd2f40aa93a3 100644
--- a/doc/source/user_guide/merging.rst
+++ b/doc/source/user_guide/merging.rst
@@ -249,7 +249,7 @@ a :class:`MultiIndex`) associate specific keys with each original :class:`DataFr
p.plot(frames, result, labels=["df1", "df2", "df3"], vertical=True)
plt.close("all");
-The ``keys`` argument cane override the column names
+The ``keys`` argument can override the column names
when creating a new :class:`DataFrame` based on existing :class:`Series`.
.. ipython:: python
@@ -484,7 +484,7 @@ either the left or right tables, the values in the joined table will be
p.plot([left, right], result, labels=["left", "right"], vertical=False);
plt.close("all");
-You can :class:`Series` and a :class:`DataFrame` with a :class:`MultiIndex` if the names of
+You can merge :class:`Series` and a :class:`DataFrame` with a :class:`MultiIndex` if the names of
the :class:`MultiIndex` correspond to the columns from the :class:`DataFrame`. Transform
the :class:`Series` to a :class:`DataFrame` using :meth:`Series.reset_index` before merging
@@ -763,7 +763,7 @@ Joining a single Index to a MultiIndex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can join a :class:`DataFrame` with a :class:`Index` to a :class:`DataFrame` with a :class:`MultiIndex` on a level.
-The ``name`` of the :class:`Index` with match the level name of the :class:`MultiIndex`.
+The ``name`` of the :class:`Index` will match the level name of the :class:`MultiIndex`.
.. ipython:: python
@@ -974,7 +974,7 @@ with optional filling of missing data with ``fill_method``.
:func:`merge_asof`
---------------------
-:func:`merge_asof` is similar to an ordered left-join except that mactches are on the
+:func:`merge_asof` is similar to an ordered left-join except that matches are on the
nearest key rather than equal keys. For each row in the ``left`` :class:`DataFrame`,
the last row in the ``right`` :class:`DataFrame` are selected where the ``on`` key is less
than the left's key. Both :class:`DataFrame` must be sorted by the key.
@@ -1073,7 +1073,7 @@ compare two :class:`DataFrame` or :class:`Series`, respectively, and summarize t
df.compare(df2)
By default, if two corresponding values are equal, they will be shown as ``NaN``.
-Furthermore, if all values in an entire row / column, the row / column will be
+Furthermore, if all values in an entire row / column are equal, that row / column will be
omitted from the result. The remaining differences will be aligned on columns.
Stack the differences on rows.
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
index 4a2aa565dd15c..e15939eb49239 100644
--- a/doc/source/user_guide/missing_data.rst
+++ b/doc/source/user_guide/missing_data.rst
@@ -21,7 +21,7 @@ is that the original data type will be coerced to ``np.float64`` or ``object``.
pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2])
:class:`NaT` for NumPy ``np.datetime64``, ``np.timedelta64``, and :class:`PeriodDtype`. For typing applications,
-use :class:`api.types.NaTType`.
+use :class:`api.typing.NaTType`.
.. ipython:: python
@@ -30,9 +30,9 @@ use :class:`api.types.NaTType`.
pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2])
:class:`NA` for :class:`StringDtype`, :class:`Int64Dtype` (and other bit widths),
-:class:`Float64Dtype`(and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`.
+:class:`Float64Dtype` (and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`.
These types will maintain the original data type of the data.
-For typing applications, use :class:`api.types.NAType`.
+For typing applications, use :class:`api.typing.NAType`.
.. ipython:: python
@@ -60,7 +60,7 @@ To detect these missing value, use the :func:`isna` or :func:`notna` methods.
.. warning::
- Equality compaisons between ``np.nan``, :class:`NaT`, and :class:`NA`
+ Equality comparisons between ``np.nan``, :class:`NaT`, and :class:`NA`
do not act like ``None``
.. ipython:: python
@@ -88,7 +88,7 @@ To detect these missing value, use the :func:`isna` or :func:`notna` methods.
.. warning::
- Experimental: the behaviour of :class:`NA`` can still change without warning.
+ Experimental: the behaviour of :class:`NA` can still change without warning.
Starting from pandas 1.0, an experimental :class:`NA` value (singleton) is
available to represent scalar missing values. The goal of :class:`NA` is provide a
@@ -105,7 +105,7 @@ dtype, it will use :class:`NA`:
s[2]
s[2] is pd.NA
-Currently, pandas does not yet use those data types using :class:`NA` by default
+Currently, pandas does not use those data types using :class:`NA` by default in
a :class:`DataFrame` or :class:`Series`, so you need to specify
the dtype explicitly. An easy way to convert to those dtypes is explained in the
:ref:`conversion section `.
@@ -253,8 +253,8 @@ Conversion
^^^^^^^^^^
If you have a :class:`DataFrame` or :class:`Series` using ``np.nan``,
-:meth:`Series.convert_dtypes` and :meth:`DataFrame.convert_dtypes`
-in :class:`DataFrame` that can convert data to use the data types that use :class:`NA`
+:meth:`DataFrame.convert_dtypes` and :meth:`Series.convert_dtypes`, respectively,
+will convert your data to use the nullable data types supporting :class:`NA`,
such as :class:`Int64Dtype` or :class:`ArrowDtype`. This is especially helpful after reading
in data sets from IO methods where data types were inferred.
@@ -319,7 +319,7 @@ Missing values propagate through arithmetic operations between pandas objects.
The descriptive statistics and computational methods discussed in the
:ref:`data structure overview ` (and listed :ref:`here
-` and :ref:`here