Skip to content

Commit 805b56b

Browse files
committed
Merge remote-tracking branch 'upstream/main' into bugfix/60678-pdNA-error
2 parents c14e08d + fef01c5 commit 805b56b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+1295
-158
lines changed

.github/workflows/unit-tests.yml

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,11 @@ defaults:
2222

2323
jobs:
2424
ubuntu:
25-
runs-on: ubuntu-22.04
25+
runs-on: ${{ matrix.platform }}
2626
timeout-minutes: 90
2727
strategy:
2828
matrix:
29+
platform: [ubuntu-22.04, ubuntu-24.04-arm]
2930
env_file: [actions-310.yaml, actions-311.yaml, actions-312.yaml]
3031
# Prevent the include jobs from overriding other jobs
3132
pattern: [""]
@@ -35,9 +36,11 @@ jobs:
3536
env_file: actions-311-downstream_compat.yaml
3637
pattern: "not slow and not network and not single_cpu"
3738
pytest_target: "pandas/tests/test_downstream.py"
39+
platform: ubuntu-22.04
3840
- name: "Minimum Versions"
3941
env_file: actions-310-minimum_versions.yaml
4042
pattern: "not slow and not network and not single_cpu"
43+
platform: ubuntu-22.04
4144
- name: "Locale: it_IT"
4245
env_file: actions-311.yaml
4346
pattern: "not slow and not network and not single_cpu"
@@ -48,6 +51,7 @@ jobs:
4851
# Also install it_IT (its encoding is ISO8859-1) but do not activate it.
4952
# It will be temporarily activated during tests with locale.setlocale
5053
extra_loc: "it_IT"
54+
platform: ubuntu-22.04
5155
- name: "Locale: zh_CN"
5256
env_file: actions-311.yaml
5357
pattern: "not slow and not network and not single_cpu"
@@ -58,25 +62,32 @@ jobs:
5862
# Also install zh_CN (its encoding is gb2312) but do not activate it.
5963
# It will be temporarily activated during tests with locale.setlocale
6064
extra_loc: "zh_CN"
65+
platform: ubuntu-22.04
6166
- name: "Future infer strings"
6267
env_file: actions-312.yaml
6368
pandas_future_infer_string: "1"
69+
platform: ubuntu-22.04
6470
- name: "Future infer strings (without pyarrow)"
6571
env_file: actions-311.yaml
6672
pandas_future_infer_string: "1"
73+
platform: ubuntu-22.04
6774
- name: "Pypy"
6875
env_file: actions-pypy-39.yaml
6976
pattern: "not slow and not network and not single_cpu"
7077
test_args: "--max-worker-restart 0"
78+
platform: ubuntu-22.04
7179
- name: "Numpy Dev"
7280
env_file: actions-311-numpydev.yaml
7381
pattern: "not slow and not network and not single_cpu"
7482
test_args: "-W error::DeprecationWarning -W error::FutureWarning"
83+
platform: ubuntu-22.04
7584
- name: "Pyarrow Nightly"
7685
env_file: actions-311-pyarrownightly.yaml
7786
pattern: "not slow and not network and not single_cpu"
87+
pandas_future_infer_string: "1"
88+
platform: ubuntu-22.04
7889
fail-fast: false
79-
name: ${{ matrix.name || format('ubuntu-latest {0}', matrix.env_file) }}
90+
name: ${{ matrix.name || format('{0} {1}', matrix.platform, matrix.env_file) }}
8091
env:
8192
PATTERN: ${{ matrix.pattern }}
8293
LANG: ${{ matrix.lang || 'C.UTF-8' }}
@@ -91,7 +102,7 @@ jobs:
91102
REMOVE_PYARROW: ${{ matrix.name == 'Future infer strings (without pyarrow)' && '1' || '0' }}
92103
concurrency:
93104
# https://github.community/t/concurrecy-not-work-for-push/183068/7
94-
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}
105+
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}-${{ matrix.platform }}
95106
cancel-in-progress: true
96107

97108
services:
@@ -419,20 +430,20 @@ jobs:
419430
with:
420431
fetch-depth: 0
421432

422-
- name: Set up Python for Pyodide
433+
- name: Set up Python for pyodide-build
423434
id: setup-python
424435
uses: actions/setup-python@v5
425436
with:
426-
python-version: '3.11.3'
437+
python-version: '3.12'
427438

428439
- name: Set up Emscripten toolchain
429440
uses: mymindstorm/setup-emsdk@v14
430441
with:
431-
version: '3.1.46'
442+
version: '3.1.58'
432443
actions-cache-folder: emsdk-cache
433444

434445
- name: Install pyodide-build
435-
run: pip install "pyodide-build==0.25.1"
446+
run: pip install "pyodide-build>=0.29.2"
436447

437448
- name: Build pandas for Pyodide
438449
run: |
@@ -441,10 +452,13 @@ jobs:
441452
- name: Set up Node.js
442453
uses: actions/setup-node@v4
443454
with:
444-
node-version: '18'
455+
node-version: '20'
445456

446457
- name: Set up Pyodide virtual environment
458+
env:
459+
pyodide-version: '0.27.1'
447460
run: |
461+
pyodide xbuildenv install ${{ env.pyodide-version }}
448462
pyodide venv .venv-pyodide
449463
source .venv-pyodide/bin/activate
450464
pip install dist/*.whl

.github/workflows/wheels.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ jobs:
9494
buildplat:
9595
- [ubuntu-22.04, manylinux_x86_64]
9696
- [ubuntu-22.04, musllinux_x86_64]
97+
- [ubuntu-24.04-arm, manylinux_aarch64]
9798
- [macos-13, macosx_x86_64]
9899
# Note: M1 images on Github Actions start from macOS 14
99100
- [macos-14, macosx_arm64]

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,3 +137,7 @@ doc/source/savefig/
137137
# Interactive terminal generated files #
138138
########################################
139139
.jupyterlite.doit.db
140+
141+
# Pyodide/WASM related files #
142+
##############################
143+
/.pyodide-xbuildenv-*

ci/code_checks.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,21 +72,17 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
7272
-i "pandas.Series.dt PR01" `# Accessors are implemented as classes, but we do not document the Parameters section` \
7373
-i "pandas.Period.freq GL08" \
7474
-i "pandas.Period.ordinal GL08" \
75-
-i "pandas.RangeIndex.from_range PR01,SA01" \
7675
-i "pandas.Timedelta.max PR02" \
7776
-i "pandas.Timedelta.min PR02" \
7877
-i "pandas.Timedelta.resolution PR02" \
7978
-i "pandas.Timestamp.max PR02" \
8079
-i "pandas.Timestamp.min PR02" \
8180
-i "pandas.Timestamp.resolution PR02" \
8281
-i "pandas.Timestamp.tzinfo GL08" \
83-
-i "pandas.arrays.ArrowExtensionArray PR07,SA01" \
84-
-i "pandas.arrays.TimedeltaArray PR07,SA01" \
8582
-i "pandas.core.groupby.DataFrameGroupBy.plot PR02" \
8683
-i "pandas.core.groupby.SeriesGroupBy.plot PR02" \
8784
-i "pandas.core.resample.Resampler.quantile PR01,PR07" \
8885
-i "pandas.core.resample.Resampler.transform PR01,RT03,SA01" \
89-
-i "pandas.plotting.andrews_curves RT03,SA01" \
9086
-i "pandas.tseries.offsets.BDay PR02,SA01" \
9187
-i "pandas.tseries.offsets.BQuarterBegin.is_on_offset GL08" \
9288
-i "pandas.tseries.offsets.BQuarterBegin.n GL08" \

ci/deps/actions-311-pyarrownightly.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ dependencies:
2323

2424
- pip:
2525
- "tzdata>=2022.7"
26-
- "--extra-index-url https://pypi.fury.io/arrow-nightlies/"
26+
- "--extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple"
2727
- "--prefer-binary"
2828
- "--pre"
2929
- "pyarrow"
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
.. _compare_with_spss:
2+
3+
{{ header }}
4+
5+
Comparison with SPSS
6+
********************
7+
For potential users coming from `SPSS <https://www.ibm.com/spss>`__, this page is meant to demonstrate
8+
how various SPSS operations would be performed using pandas.
9+
10+
.. include:: includes/introduction.rst
11+
12+
Data structures
13+
---------------
14+
15+
General terminology translation
16+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17+
18+
.. csv-table::
19+
:header: "pandas", "SPSS"
20+
:widths: 20, 20
21+
22+
:class:`DataFrame`, data file
23+
column, variable
24+
row, case
25+
groupby, split file
26+
:class:`NaN`, system-missing
27+
28+
:class:`DataFrame`
29+
~~~~~~~~~~~~~~~~~~
30+
31+
A :class:`DataFrame` in pandas is analogous to an SPSS data file - a two-dimensional
32+
data source with labeled columns that can be of different types. As will be shown in this
33+
document, almost any operation that can be performed in SPSS can also be accomplished in pandas.
34+
35+
:class:`Series`
36+
~~~~~~~~~~~~~~~
37+
38+
A :class:`Series` is the data structure that represents one column of a :class:`DataFrame`. SPSS doesn't have a
39+
separate data structure for a single variable, but in general, working with a :class:`Series` is analogous
40+
to working with a variable in SPSS.
41+
42+
:class:`Index`
43+
~~~~~~~~~~~~~~
44+
45+
Every :class:`DataFrame` and :class:`Series` has an :class:`Index` -- labels on the *rows* of the data. SPSS does not
46+
have an exact analogue, as cases are simply numbered sequentially from 1. In pandas, if no index is
47+
specified, a :class:`RangeIndex` is used by default (first row = 0, second row = 1, and so on).
48+
49+
While using a labeled :class:`Index` or :class:`MultiIndex` can enable sophisticated analyses and is ultimately an
50+
important part of pandas to understand, for this comparison we will essentially ignore the :class:`Index` and
51+
just treat the :class:`DataFrame` as a collection of columns. Please see the :ref:`indexing documentation<indexing>`
52+
for much more on how to use an :class:`Index` effectively.
53+
54+
55+
Copies vs. in place operations
56+
------------------------------
57+
58+
.. include:: includes/copies.rst
59+
60+
61+
Data input / output
62+
-------------------
63+
64+
Reading external data
65+
~~~~~~~~~~~~~~~~~~~~~
66+
67+
Like SPSS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within
68+
the pandas tests (`csv <https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/tips.csv>`_)
69+
will be used in many of the following examples.
70+
71+
In SPSS, you would use File > Open > Data to import a CSV file:
72+
73+
.. code-block:: text
74+
75+
FILE > OPEN > DATA
76+
/TYPE=CSV
77+
/FILE='tips.csv'
78+
/DELIMITERS=","
79+
/FIRSTCASE=2
80+
/VARIABLES=col1 col2 col3.
81+
82+
The pandas equivalent would use :func:`read_csv`:
83+
84+
.. code-block:: python
85+
86+
url = (
87+
"https://raw.githubusercontent.com/pandas-dev"
88+
"/pandas/main/pandas/tests/io/data/csv/tips.csv"
89+
)
90+
tips = pd.read_csv(url)
91+
tips
92+
93+
Like SPSS's data import wizard, ``read_csv`` can take a number of parameters to specify how the data should be parsed.
94+
For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:
95+
96+
.. code-block:: python
97+
98+
tips = pd.read_csv("tips.csv", sep="\t", header=None)
99+
100+
# alternatively, read_table is an alias to read_csv with tab delimiter
101+
tips = pd.read_table("tips.csv", header=None)
102+
103+
104+
Data operations
105+
---------------
106+
107+
Filtering
108+
~~~~~~~~~
109+
110+
In SPSS, filtering is done through Data > Select Cases:
111+
112+
.. code-block:: text
113+
114+
SELECT IF (total_bill > 10).
115+
EXECUTE.
116+
117+
In pandas, boolean indexing can be used:
118+
119+
.. code-block:: python
120+
121+
tips[tips["total_bill"] > 10]
122+
123+
124+
Sorting
125+
~~~~~~~
126+
127+
In SPSS, sorting is done through Data > Sort Cases:
128+
129+
.. code-block:: text
130+
131+
SORT CASES BY sex total_bill.
132+
EXECUTE.
133+
134+
In pandas, this would be written as:
135+
136+
.. code-block:: python
137+
138+
tips.sort_values(["sex", "total_bill"])
139+
140+
141+
String processing
142+
-----------------
143+
144+
Finding length of string
145+
~~~~~~~~~~~~~~~~~~~~~~~~
146+
147+
In SPSS:
148+
149+
.. code-block:: text
150+
151+
COMPUTE length = LENGTH(time).
152+
EXECUTE.
153+
154+
.. include:: includes/length.rst
155+
156+
157+
Changing case
158+
~~~~~~~~~~~~~
159+
160+
In SPSS:
161+
162+
.. code-block:: text
163+
164+
COMPUTE upper = UPCASE(time).
165+
COMPUTE lower = LOWER(time).
166+
EXECUTE.
167+
168+
.. include:: includes/case.rst
169+
170+
171+
Merging
172+
-------
173+
174+
In SPSS, merging data files is done through Data > Merge Files.
175+
176+
.. include:: includes/merge_setup.rst
177+
.. include:: includes/merge.rst
178+
179+
180+
GroupBy operations
181+
------------------
182+
183+
Split-file processing
184+
~~~~~~~~~~~~~~~~~~~~~
185+
186+
In SPSS, split-file analysis is done through Data > Split File:
187+
188+
.. code-block:: text
189+
190+
SORT CASES BY sex.
191+
SPLIT FILE BY sex.
192+
DESCRIPTIVES VARIABLES=total_bill tip
193+
/STATISTICS=MEAN STDDEV MIN MAX.
194+
195+
The pandas equivalent would be:
196+
197+
.. code-block:: python
198+
199+
tips.groupby("sex")[["total_bill", "tip"]].agg(["mean", "std", "min", "max"])
200+
201+
202+
Missing data
203+
------------
204+
205+
SPSS uses the period (``.``) for numeric missing values and blank spaces for string missing values.
206+
pandas uses ``NaN`` (Not a Number) for numeric missing values and ``None`` or ``NaN`` for string
207+
missing values.
208+
209+
.. include:: includes/missing.rst
210+
211+
212+
Other considerations
213+
--------------------
214+
215+
Output management
216+
-----------------
217+
218+
While pandas does not have a direct equivalent to SPSS's Output Management System (OMS), you can
219+
capture and export results in various ways:
220+
221+
.. code-block:: python
222+
223+
# Save summary statistics to CSV
224+
tips.groupby('sex')[['total_bill', 'tip']].mean().to_csv('summary.csv')
225+
226+
# Save multiple results to Excel sheets
227+
with pd.ExcelWriter('results.xlsx') as writer:
228+
tips.describe().to_excel(writer, sheet_name='Descriptives')
229+
tips.groupby('sex').mean().to_excel(writer, sheet_name='Means by Gender')

doc/source/getting_started/comparison/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ Comparison with other tools
1414
comparison_with_spreadsheets
1515
comparison_with_sas
1616
comparison_with_stata
17+
comparison_with_spss

0 commit comments

Comments
 (0)