Skip to content

Commit 39adf88

Browse files
author
sah0725
committed
DOC: Add comprehensive floating point precision documentation for CSV operations
Addresses issue #13159 by adding detailed documentation about: - Why floating point precision loss occurs in CSV roundtrips - How to use float_format parameter to control precision - Format specifier reference (%.6f, %.10g, %.6e, etc.) - Best practices for different data types (scientific, financial) - Testing function to validate roundtrip precision - dtype preservation behavior Includes 6 working code examples with proper cleanup and comprehensive guidance for users experiencing CSV precision issues.
1 parent b917b37 commit 39adf88

File tree

1 file changed

+202
-0
lines changed

1 file changed

+202
-0
lines changed

doc/source/user_guide/io.rst

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1671,6 +1671,208 @@ function takes a number of arguments. Only the first is required.
16711671
* ``chunksize``: Number of rows to write at a time
16721672
* ``date_format``: Format string for datetime objects
16731673

1674+
.. _io.csv_precision:
1675+
1676+
Floating Point Precision in CSV
1677+
++++++++++++++++++++++++++++++++
1678+
1679+
When working with floating point numbers in CSV files, it's important to understand
1680+
that precision can be lost during the write/read roundtrip. This section explains
1681+
why this happens and how to control precision using the ``float_format`` parameter.
1682+
1683+
Understanding Precision Loss
1684+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1685+
1686+
Floating point numbers are represented internally using binary format, which can
1687+
lead to precision issues when converting to and from text representation in CSV files.
1688+
Consider this example:
1689+
1690+
.. ipython:: python
1691+
1692+
import pandas as pd
1693+
import numpy as np
1694+
1695+
# Create a DataFrame with a problematic floating point value
1696+
df = pd.DataFrame({'value': [0.1 + 0.2]})
1697+
print(f"Original value: {df['value'].iloc[0]!r}")
1698+
1699+
# Save to CSV and read back
1700+
df.to_csv('test_precision.csv', index=False)
1701+
df_read = pd.read_csv('test_precision.csv')
1702+
print(f"After CSV roundtrip: {df_read['value'].iloc[0]!r}")
1703+
print(f"Values are equal: {df['value'].iloc[0] == df_read['value'].iloc[0]}")
1704+
1705+
.. ipython:: python
1706+
:suppress:
1707+
1708+
import os
1709+
if os.path.exists('test_precision.csv'):
1710+
os.remove('test_precision.csv')
1711+
1712+
In this case, the slight precision loss occurs because the decimal ``0.3`` cannot be
1713+
exactly represented in binary floating point format.
1714+
1715+
Using float_format for Precision Control
1716+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1717+
1718+
The ``float_format`` parameter allows you to control how floating point numbers are
1719+
formatted when written to CSV. This can help preserve precision and ensure reliable
1720+
roundtrip operations.
1721+
1722+
.. ipython:: python
1723+
1724+
# Example with high precision number
1725+
df = pd.DataFrame({'precision_test': [123456789.123456789]})
1726+
print(f"Original: {df['precision_test'].iloc[0]}")
1727+
1728+
# Default behavior
1729+
df.to_csv('default.csv', index=False)
1730+
df_default = pd.read_csv('default.csv')
1731+
1732+
# With explicit precision control
1733+
df.to_csv('formatted.csv', index=False, float_format='%.15g')
1734+
df_formatted = pd.read_csv('formatted.csv')
1735+
1736+
print(f"Default read: {df_default['precision_test'].iloc[0]}")
1737+
print(f"Formatted read: {df_formatted['precision_test'].iloc[0]}")
1738+
1739+
.. ipython:: python
1740+
:suppress:
1741+
1742+
for f in ['default.csv', 'formatted.csv']:
1743+
if os.path.exists(f):
1744+
os.remove(f)
1745+
1746+
Format Specifiers
1747+
~~~~~~~~~~~~~~~~~
1748+
1749+
Different format specifiers have different effects on precision and output format:
1750+
1751+
**Fixed-point notation (f)**:
1752+
- ``'%.6f'`` - 6 decimal places: ``123456789.123457``
1753+
- ``'%.10f'`` - 10 decimal places: ``123456789.1234567910``
1754+
- Best for: Numbers with known decimal precision requirements
1755+
1756+
**General format (g)**:
1757+
- ``'%.6g'`` - 6 significant digits: ``1.23457e+08``
1758+
- ``'%.15g'`` - 15 significant digits: ``123456789.123457``
1759+
- Best for: Preserving significant digits, automatic scientific notation
1760+
1761+
**Scientific notation (e)**:
1762+
- ``'%.6e'`` - Scientific with 6 decimal places: ``1.234568e+08``
1763+
- ``'%.10e'`` - Scientific with 10 decimal places: ``1.2345678912e+08``
1764+
- Best for: Very large or very small numbers
1765+
1766+
.. ipython:: python
1767+
1768+
# Demonstrate different format effects
1769+
df = pd.DataFrame({'number': [123456789.123456789]})
1770+
1771+
formats = {'%.6f': '6 decimal places',
1772+
'%.10g': '10 significant digits',
1773+
'%.6e': 'scientific notation'}
1774+
1775+
for fmt, description in formats.items():
1776+
df.to_csv('temp.csv', index=False, float_format=fmt)
1777+
with open('temp.csv', 'r') as f:
1778+
csv_content = f.read().strip().split('\n')[1]
1779+
print(f"{description:20}: {csv_content}")
1780+
1781+
.. ipython:: python
1782+
:suppress:
1783+
1784+
if os.path.exists('temp.csv'):
1785+
os.remove('temp.csv')
1786+
1787+
Best Practices
1788+
~~~~~~~~~~~~~~
1789+
1790+
**For high-precision scientific data**:
1791+
Use ``float_format='%.17g'`` to preserve maximum precision:
1792+
1793+
.. ipython:: python
1794+
1795+
# High precision example
1796+
scientific_data = pd.DataFrame({
1797+
'measurement': [1.23456789012345e-10, 9.87654321098765e15]
1798+
})
1799+
scientific_data.to_csv('scientific.csv', index=False, float_format='%.17g')
1800+
1801+
.. ipython:: python
1802+
:suppress:
1803+
1804+
if os.path.exists('scientific.csv'):
1805+
os.remove('scientific.csv')
1806+
1807+
**For financial data**:
1808+
Use fixed decimal places like ``float_format='%.2f'``:
1809+
1810+
.. ipython:: python
1811+
1812+
# Financial data example
1813+
financial_data = pd.DataFrame({
1814+
'price': [19.99, 1234.56, 0.01]
1815+
})
1816+
financial_data.to_csv('financial.csv', index=False, float_format='%.2f')
1817+
1818+
.. ipython:: python
1819+
:suppress:
1820+
1821+
if os.path.exists('financial.csv'):
1822+
os.remove('financial.csv')
1823+
1824+
**For ensuring exact roundtrip**:
1825+
Test your specific data to find the minimum precision needed:
1826+
1827+
.. ipython:: python
1828+
1829+
def test_roundtrip_precision(df, float_format):
1830+
"""Test if a float_format preserves data during CSV roundtrip."""
1831+
df.to_csv('test.csv', index=False, float_format=float_format)
1832+
df_read = pd.read_csv('test.csv')
1833+
return df.equals(df_read)
1834+
1835+
# Test data
1836+
test_df = pd.DataFrame({'values': [123.456789, 0.000123456, 1.23e15]})
1837+
1838+
# Test different precisions
1839+
for fmt in ['%.6g', '%.10g', '%.15g']:
1840+
success = test_roundtrip_precision(test_df, fmt)
1841+
print(f"Format {fmt}: {'' if success else ''} roundtrip success")
1842+
1843+
.. ipython:: python
1844+
:suppress:
1845+
1846+
if os.path.exists('test.csv'):
1847+
os.remove('test.csv')
1848+
1849+
**dtype Preservation Note**:
1850+
Be aware that CSV format does not preserve NumPy dtypes. All numeric data
1851+
will be read back as ``float64`` or ``int64`` regardless of the original dtype:
1852+
1853+
.. ipython:: python
1854+
1855+
# dtype preservation example
1856+
original_df = pd.DataFrame({
1857+
'float32_col': np.array([1.23], dtype=np.float32),
1858+
'float64_col': np.array([1.23], dtype=np.float64)
1859+
})
1860+
1861+
print("Original dtypes:")
1862+
print(original_df.dtypes)
1863+
1864+
original_df.to_csv('dtypes.csv', index=False)
1865+
read_df = pd.read_csv('dtypes.csv')
1866+
1867+
print("\nAfter CSV roundtrip:")
1868+
print(read_df.dtypes)
1869+
1870+
.. ipython:: python
1871+
:suppress:
1872+
1873+
if os.path.exists('dtypes.csv'):
1874+
os.remove('dtypes.csv')
1875+
16741876
Writing a formatted string
16751877
++++++++++++++++++++++++++
16761878

0 commit comments

Comments
 (0)