Skip to content

Commit 843539c

Browse files
KhemkaranKhemkaran
authored andcommitted
Merge remote-tracking branch 'upstream/main' into issue_61917
2 parents e8a7607 + e72c8a1 commit 843539c

File tree

4 files changed

+13
-154
lines changed

4 files changed

+13
-154
lines changed

ci/deps/actions-311-downstream_compat.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,7 @@ dependencies:
5050
- pytz>=2023.4
5151
- pyxlsb>=1.0.10
5252
- s3fs>=2023.12.2
53-
# TEMP upper pin for scipy (https://github.com/statsmodels/statsmodels/issues/9584)
54-
- scipy>=1.12.0,<1.16
53+
- scipy>=1.12.0
5554
- sqlalchemy>=2.0.0
5655
- tabulate>=0.9.0
5756
- xarray>=2024.1.1

pandas/core/algorithms.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -751,7 +751,7 @@ def factorize(
751751
array([0, 0, 1])
752752
>>> uniques
753753
['a', 'c']
754-
Categories (3, object): ['a', 'b', 'c']
754+
Categories (3, str): [a, b, c]
755755
756756
Notice that ``'b'`` is in ``uniques.categories``, despite not being
757757
present in ``cat.values``.
@@ -764,7 +764,7 @@ def factorize(
764764
>>> codes
765765
array([0, 0, 1])
766766
>>> uniques
767-
Index(['a', 'c'], dtype='object')
767+
Index(['a', 'c'], dtype='str')
768768
769769
If NaN is in the values, and we want to include NaN in the uniques of the
770770
values, it can be achieved by setting ``use_na_sentinel=False``.

pandas/core/base.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -323,12 +323,12 @@ def transpose(self, *args, **kwargs) -> Self:
323323
0 Ant
324324
1 Bear
325325
2 Cow
326-
dtype: object
326+
dtype: str
327327
>>> s.T
328328
0 Ant
329329
1 Bear
330330
2 Cow
331-
dtype: object
331+
dtype: str
332332
333333
For Index:
334334
@@ -383,7 +383,7 @@ def ndim(self) -> int:
383383
0 Ant
384384
1 Bear
385385
2 Cow
386-
dtype: object
386+
dtype: str
387387
>>> s.ndim
388388
1
389389
@@ -452,9 +452,9 @@ def nbytes(self) -> int:
452452
0 Ant
453453
1 Bear
454454
2 Cow
455-
dtype: object
455+
dtype: str
456456
>>> s.nbytes
457-
24
457+
34
458458
459459
For Index:
460460
@@ -487,7 +487,7 @@ def size(self) -> int:
487487
0 Ant
488488
1 Bear
489489
2 Cow
490-
dtype: object
490+
dtype: str
491491
>>> s.size
492492
3
493493
@@ -567,7 +567,7 @@ def array(self) -> ExtensionArray:
567567
>>> ser = pd.Series(pd.Categorical(["a", "b", "a"]))
568568
>>> ser.array
569569
['a', 'b', 'a']
570-
Categories (2, object): ['a', 'b']
570+
Categories (2, str): [a, b]
571571
"""
572572
raise AbstractMethodError(self)
573573

@@ -1076,15 +1076,15 @@ def value_counts(
10761076
10771077
>>> df.dtypes
10781078
a category
1079-
b object
1079+
b str
10801080
c category
10811081
d category
10821082
dtype: object
10831083
10841084
>>> df.dtypes.value_counts()
10851085
category 2
10861086
category 1
1087-
object 1
1087+
str 1
10881088
Name: count, dtype: int64
10891089
"""
10901090
return algorithms.value_counts_internal(
@@ -1386,7 +1386,7 @@ def factorize(
13861386
... )
13871387
>>> ser
13881388
['apple', 'bread', 'bread', 'cheese', 'milk']
1389-
Categories (4, object): ['apple' < 'bread' < 'cheese' < 'milk']
1389+
Categories (4, str): [apple < bread < cheese < milk]
13901390
13911391
>>> ser.searchsorted('bread')
13921392
np.int64(1)

web/pandas/about/roadmap.md

Lines changed: 0 additions & 140 deletions
Original file line numberDiff line numberDiff line change
@@ -34,143 +34,3 @@ For more information about PDEPs, and how to submit one, please refer to
3434
</ul>
3535

3636
{% endfor %}
37-
38-
## Roadmap points pending a PDEP
39-
40-
<div class="alert alert-warning" role="alert">
41-
pandas is in the process of moving roadmap points to PDEPs (implemented in
42-
August 2022). During the transition, some roadmap points will exist as PDEPs,
43-
while others will exist as sections below.
44-
</div>
45-
46-
### Extensibility
47-
48-
Pandas `extending.extension-types` allow
49-
for extending NumPy types with custom data types and array storage.
50-
Pandas uses extension types internally, and provides an interface for
51-
3rd-party libraries to define their own custom data types.
52-
53-
Many parts of pandas still unintentionally convert data to a NumPy
54-
array. These problems are especially pronounced for nested data.
55-
56-
We'd like to improve the handling of extension arrays throughout the
57-
library, making their behavior more consistent with the handling of
58-
NumPy arrays. We'll do this by cleaning up pandas' internals and
59-
adding new methods to the extension array interface.
60-
61-
### Apache Arrow interoperability
62-
63-
[Apache Arrow](https://arrow.apache.org) is a cross-language development
64-
platform for in-memory data. The Arrow logical types are closely aligned
65-
with typical pandas use cases.
66-
67-
We'd like to provide better-integrated support for Arrow memory and
68-
data types within pandas. This will let us take advantage of its I/O
69-
capabilities and provide for better interoperability with other
70-
languages and libraries using Arrow.
71-
72-
### Decoupling of indexing and internals
73-
74-
The code for getting and setting values in pandas' data structures
75-
needs refactoring. In particular, we must clearly separate code that
76-
converts keys (e.g., the argument to `DataFrame.loc`) to positions from
77-
code that uses these positions to get or set values. This is related to
78-
the proposed BlockManager rewrite. Currently, the BlockManager sometimes
79-
uses label-based, rather than position-based, indexing. We propose that
80-
it should only work with positional indexing, and the translation of
81-
keys to positions should be entirely done at a higher level.
82-
83-
Indexing is a complicated API with many subtleties. This refactor will require care
84-
and attention. The following principles should inspire refactoring of indexing code and
85-
should result on cleaner, simpler, and more performant code.
86-
87-
1. Label indexing must never involve looking in an axis twice for the same label(s).
88-
This implies that any validation step must either:
89-
90-
* limit validation to general features (e.g. dtype/structure of the key/index), or
91-
* reuse the result for the actual indexing.
92-
93-
2. Indexers must never rely on an explicit call to other indexers.
94-
For instance, it is OK to have some internal method of `.loc` call some
95-
internal method of `__getitem__` (or of their common base class),
96-
but never in the code flow of `.loc` should `the_obj[something]` appear.
97-
98-
3. Execution of positional indexing must never involve labels (as currently, sadly, happens).
99-
That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed)
100-
to `.iloc` should never involve the axes of the object in any way.
101-
102-
4. Indexing must never involve accessing/modifying values (i.e., act on `._data` or `.values`) more than once.
103-
The following steps must hence be clearly decoupled:
104-
105-
* find positions we need to access/modify on each axis
106-
* (if we are accessing) derive the type of object we need to return (dimensionality)
107-
* actually access/modify the values
108-
* (if we are accessing) construct the return object
109-
110-
5. As a corollary to the decoupling between 4.i and 4.iii, any code which deals on how data is stored
111-
(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types)
112-
must be independent from code that deals with identifying affected rows/columns,
113-
and take place only once step 4.i is completed.
114-
115-
* In particular, such code should most probably not live in `pandas/core/indexing.py`
116-
* ... and must not depend in any way on the type(s) of axes (e.g. no `MultiIndex` special cases)
117-
118-
6. As a corollary to point 1.i, `Index` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup,
119-
on the one side, and for any required conversion/adaptation/lookup of label(s), on the other.
120-
121-
7. Use of trial and error should be limited, and anyway restricted to catch only exceptions
122-
which are actually expected (typically `KeyError`).
123-
124-
* In particular, code should never (intentionally) raise new exceptions in the `except` portion of a `try... exception`
125-
126-
8. Any code portion which is not specific to setters and getters must be shared,
127-
and when small differences in behavior are expected (e.g. getting with `.loc` raises for
128-
missing labels, setting still doesn't), they can be managed with a specific parameter.
129-
130-
### Numba-accelerated operations
131-
132-
[Numba](https://numba.pydata.org) is a JIT compiler for Python code.
133-
We'd like to provide ways for users to apply their own Numba-jitted
134-
functions where pandas accepts user-defined functions (for example,
135-
`Series.apply`,
136-
`DataFrame.apply`,
137-
`DataFrame.applymap`, and in groupby and
138-
window contexts). This will improve the performance of
139-
user-defined-functions in these operations by staying within compiled
140-
code.
141-
142-
### Documentation improvements
143-
144-
We'd like to improve the content, structure, and presentation of the
145-
pandas documentation. Some specific goals include
146-
147-
- Overhaul the HTML theme with a modern, responsive design
148-
(`15556`)
149-
- Improve the "Getting Started" documentation, designing and writing
150-
learning paths for users different backgrounds (e.g. brand new to
151-
programming, familiar with other languages like R, already familiar
152-
with Python).
153-
- Improve the overall organization of the documentation and specific
154-
subsections of the documentation to make navigation and finding
155-
content easier.
156-
157-
### Performance monitoring
158-
159-
Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/)
160-
to monitor for performance regressions. ASV itself is a fabulous tool,
161-
but requires some additional work to be integrated into an open source
162-
project's workflow.
163-
164-
The [asv-runner](https://github.com/asv-runner) organization, currently
165-
made up of pandas maintainers, provides tools built on top of ASV. We
166-
have a physical machine for running a number of project's benchmarks,
167-
and tools managing the benchmark runs and reporting on results.
168-
169-
We'd like to fund improvements and maintenance of these tools to
170-
171-
- Be more stable. Currently, they're maintained on the nights and
172-
weekends when a maintainer has free time.
173-
- Tune the system for benchmarks to improve stability, following
174-
<https://pyperf.readthedocs.io/en/latest/system.html>
175-
- Build a GitHub bot to request ASV runs *before* a PR is merged.
176-
Currently, the benchmarks are only run nightly.

0 commit comments

Comments
 (0)