GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

HyukjinKwon · 2025-12-12T01:46:43Z

Rationale for this change

arrow/python/pyarrow/src/arrow/python/arrow_to_pandas.cc

Lines 1655 to 1656 in 0bfbd19

    
           // Date64Type is millisecond timestamp stored as int64_t 
        
           // TODO(wesm): Do we want to make sure to zero out the milliseconds?

arrow/python/pyarrow/src/arrow/python/python_to_arrow.cc

Line 312 in d09233a

// TODO: introduce an option for this

What changes are included in this PR?

This PR adds an option for truncating intraday milliseconds in Date64, which is disabled by default for pandas conversion, and enabled by default for Python conversion to avoid breaking changes.

Are these changes tested?

Yes, unittests were added, and tested as below:

pytest pyarrow/tests/test_pandas.py

Are there any user-facing changes?

No by default. It adds a new option

(Generated by ChatGPT)

Conversion Type	Default Behavior	With Explicit Option	Option Value	Result
Python sequences → Arrow (`pa.array()`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
NumPy arrays → Arrow (`pa.array()`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
Pandas Series → Arrow (`pa.array()` with `from_pandas=True`)	Truncates time	Preserves time	`truncate_date64_time=False`	int64: `946684800000` (truncated) → `946730096123` (preserved)
Arrow → Pandas (`to_pandas()`)	Preserves time	Truncates time	`truncate_date64_time=True`	`2018-05-10 00:02:03.456000` (preserved) → `2018-05-10 00:00:00` (truncated)

import datetime
import pyarrow as pa

dt_with_time = datetime.datetime(2000, 1, 1, 12, 34, 56, 123456)
dt_date_only = datetime.datetime(2000, 1, 1)

# ============================================================================
# 1. Python sequences (lists)
# ============================================================================

# BEFORE (default behavior - truncates time)
arr_python_before = pa.array([dt_with_time], type=pa.date64())
arr_python_date_only_before = pa.array([dt_date_only], type=pa.date64())
print("Python sequences - BEFORE (default):")
print(f"  int64: {arr_python_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_python_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_python_before.equals(arr_python_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_python_after = pa.array([dt_with_time], type=pa.date64(), truncate_date64_time=False)
arr_python_date_only_after = pa.array([dt_date_only], type=pa.date64(), truncate_date64_time=False)
print("Python sequences - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_python_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_python_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_python_after.equals(arr_python_date_only_after)}")  # False

# ============================================================================
# 2. NumPy arrays
# ============================================================================

import numpy as np

arr_numpy = np.array([dt_with_time], dtype=object)
arr_numpy_date_only = np.array([dt_date_only], dtype=object)

# BEFORE (default behavior - truncates time, since array() defaults to True)
arr_numpy_before = pa.array(arr_numpy, type=pa.date64())
arr_numpy_date_only_before = pa.array(arr_numpy_date_only, type=pa.date64())
print("\nNumPy arrays - BEFORE (default):")
print(f"  int64: {arr_numpy_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_numpy_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_numpy_before.equals(arr_numpy_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_numpy_after = pa.array(arr_numpy, type=pa.date64(), truncate_date64_time=False)
arr_numpy_date_only_after = pa.array(arr_numpy_date_only, type=pa.date64(), truncate_date64_time=False)
print("NumPy arrays - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_numpy_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_numpy_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_numpy_after.equals(arr_numpy_date_only_after)}")  # False

# ============================================================================
# 3. Pandas Series
# ============================================================================

import pandas as pd

series_pandas = pd.Series([dt_with_time], dtype=object)
series_pandas_date_only = pd.Series([dt_date_only], dtype=object)

# BEFORE (default behavior - truncates time, since array() defaults to True)
arr_pandas_before = pa.array(series_pandas, type=pa.date64(), from_pandas=True)
arr_pandas_date_only_before = pa.array(series_pandas_date_only, type=pa.date64(), from_pandas=True)
print("\nPandas Series - BEFORE (default):")
print(f"  int64: {arr_pandas_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  int64: {arr_pandas_date_only_before.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_pandas_before.equals(arr_pandas_date_only_before)}")  # True

# AFTER (explicit truncate_date64_time=False - preserves time)
arr_pandas_after = pa.array(series_pandas, type=pa.date64(), from_pandas=True, truncate_date64_time=False)
arr_pandas_date_only_after = pa.array(series_pandas_date_only, type=pa.date64(), from_pandas=True, truncate_date64_time=False)
print("Pandas Series - AFTER (truncate_date64_time=False):")
print(f"  int64: {arr_pandas_after.view('int64')[0].as_py()}")  # 946730096123
print(f"  int64: {arr_pandas_date_only_after.view('int64')[0].as_py()}")  # 946684800000
print(f"  Equal? {arr_pandas_after.equals(arr_pandas_date_only_after)}")  # False

# ============================================================================
# 4. Arrow to Pandas conversion (to_pandas)
# ============================================================================

milliseconds_at_midnight = 1525910400000  # 2018-05-10 00:00:00
milliseconds_with_time = milliseconds_at_midnight + 123456  # 2018-05-10 00:02:03.456

arr_arrow = pa.array([milliseconds_at_midnight, milliseconds_with_time], type=pa.date64())

# BEFORE (default behavior - preserves time, since to_pandas() defaults to False)
result_before = arr_arrow.to_pandas(date_as_object=False)
print("\nArrow to Pandas - BEFORE (default):")
print(f"  arr.to_pandas(date_as_object=False)[0] = {result_before[0]}")  # 2018-05-10 00:00:00
print(f"  arr.to_pandas(date_as_object=False)[1] = {result_before[1]}")  # 2018-05-10 00:02:03.456000

# AFTER (explicit truncate_date64_time=True - truncates time)
result_after = arr_arrow.to_pandas(date_as_object=False, truncate_date64_time=True)
print("Arrow to Pandas - AFTER (truncate_date64_time=True):")
print(f"  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[0] = {result_after[0]}")  # 2018-05-10 00:00:00
print(f"  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[1] = {result_after[1]}")  # 2018-05-10 00:00:00

Python sequences - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
Python sequences - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

NumPy arrays - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
NumPy arrays - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

Pandas Series - BEFORE (default):
  int64: 946684800000
  int64: 946684800000
  Equal? True
Pandas Series - AFTER (truncate_date64_time=False):
  int64: 946730096123
  int64: 946684800000
  Equal? False

Arrow to Pandas - BEFORE (default):
  arr.to_pandas(date_as_object=False)[0] = 2018-05-10 00:00:00
  arr.to_pandas(date_as_object=False)[1] = 2018-05-10 00:02:03.456000
Arrow to Pandas - AFTER (truncate_date64_time=True):
  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[0] = 2018-05-10 00:00:00
  arr.to_pandas(date_as_object=False, truncate_date64_time=True)[1] = 2018-05-10 00:00:00

github-actions · 2025-12-12T01:47:10Z

⚠️ GitHub issue #48465 has been automatically assigned in GitHub to PR creator.

alippai · 2025-12-15T13:25:35Z

By spec Date64 should be limited to full day values in arrow

alippai · 2025-12-15T15:58:39Z

Interesting, the arr.to_pandas(date_as_object=False) docs also says it should be the appropriate time unit (which is D in this case, not ms).

Overall I'm not a fan of introducing a slower conversion for managing a case violating the spec.

HyukjinKwon · 2025-12-18T07:26:51Z

@alippai Thanks for reviewing this. I am fine with keeping the original behaviour as is, and add a switch. That is actually another todo for Python conversion at:

arrow/python/pyarrow/src/arrow/python/python_to_arrow.cc

Line 312 in d09233a

// TODO: introduce an option for this

If that's preferred, I can add a switch for Python and Arrow conversion sides, and keep the original behaviour as is (True for Python conv, and False for Arrow conv).

Otherwise, we can also simply just remove this todo as well.

AlenkaF · 2025-12-23T09:12:10Z

cc @rok pinging in case you have any opinions on this topic.

rok · 2025-12-23T09:49:51Z

I don't have a strong opinion on this either way. Avoiding a performance regression by making this non-default behavior seems like a good idea at this point.

HyukjinKwon · 2025-12-23T11:07:48Z

Yeah let me work on it 👍

github-actions · 2025-12-29T06:08:42Z

⚠️ GitHub issue #48672 has been automatically assigned in GitHub to PR creator.

HyukjinKwon · 2025-12-29T07:41:53Z

This PR should be ready for a look.

alippai · 2025-12-29T08:13:38Z

Looks good, thanks for the change

EnricoMi · 2026-01-06T10:24:21Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

+      // Date64Type is millisecond timestamp
+      if (this->options_.truncate_date64_time) {
+        // Truncate intraday milliseconds
+        ConvertDatetimeWithTruncation<1L>(*data, out_values);


Can we avoid computing the ... * 1L for each value in the array when SHIFT == 1? Or will the compiler optimize this away?

I believe it will optimize it out as a noop from my understanding but to make sure, I changed a bit to leverage constexpr for 1 case. It should compiletime branch it out, and should be optimized enough as documented in c++ lang.

EnricoMi · 2026-01-06T10:38:09Z

python/pyarrow/src/arrow/python/arrow_to_pandas.cc

+template <int64_t SHIFT>
+inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {
+  for (int c = 0; c < data.num_chunks(); c++) {
+    const auto& arr = *data.chunk(c);
+    const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);
+    for (int64_t i = 0; i < arr.length(); ++i) {
+      *out_values++ = arr.IsNull(i)
+                          ? kPandasTimestampNull
+                          : ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT);
+    }
+  }
+}


The SHIFT sounds like we are bit-shifting, where this is more a factor.

Suggested change

template <int64_t SHIFT>

inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {

for (int c = 0; c < data.num_chunks(); c++) {

const auto& arr = *data.chunk(c);

const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);

for (int64_t i = 0; i < arr.length(); ++i) {

*out_values++ = arr.IsNull(i)

? kPandasTimestampNull

: ((in_values[i] - in_values[i] % kMillisecondsInDay) * SHIFT);

}

}

}

template <int64_t FACTOR>

inline void ConvertDatetimeWithTruncation(const ChunkedArray& data, int64_t* out_values) {

for (int c = 0; c < data.num_chunks(); c++) {

const auto& arr = *data.chunk(c);

const int64_t* in_values = GetPrimitiveValues<int64_t>(arr);

for (int64_t i = 0; i < arr.length(); ++i) {

*out_values++ = arr.IsNull(i)

? kPandasTimestampNull

: ((in_values[i] - in_values[i] % kMillisecondsInDay) * FACTOR);

}

}

}

Looks like this naming exists in ConvertDatetime as well :-(.

Yeah .. let me just keep it consistent for now

EnricoMi

LGTM!

HyukjinKwon · 2026-01-08T21:51:59Z

@AlenkaF do you mind taking a look when you find some time? I believe I resolved all comments. Now it does not change any default behaviour 🫡

… intraday milliseconds in Date64

AlenkaF · 2026-01-12T07:44:37Z

Can you rebase and resolve the conflicts?

AlenkaF

Thank you for working on this @HyukjinKwon!

Sorry if I am delaying this work being merged but I am a bit sceptical to add truncate_date64_time option to all the APIs here. I am not sure but is there a middle way where we could add the option for cases that have genuine ambiguity (like conversions from Arrow to Pandas) and not for constructors where one could simply use Date32 instead?

@pitrou what do you think?

pitrou · 2026-01-13T14:22:39Z

Date64 data that has a non-zero intraday component is invalid according to the Arrow format, so this entire feature is undesirable:

arrow/format/Schema.fbs

Lines 246 to 254 in d54a205

    
           /// Date is either a 32-bit or 64-bit signed integer type representing an 
        
           /// elapsed time since UNIX epoch (1970-01-01), stored in either of two units: 
        
           /// 
        
           /// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no 
        
           ///   leap seconds), where the values are evenly divisible by 86400000 
        
           /// * Days (32 bits) since the UNIX epoch 
        
           table Date { 
        
             unit: DateUnit = MILLISECOND; 
        
           }

Unrelated, but @HyukjinKwon please write a proper PR description next time, instead of pasting multiple pages of ChatGPT-generated text.

rok · 2026-01-13T15:57:54Z

Agreed with @AlenkaF here, NdarrayToArrow shouldn't have a truncation parameter. Also tests could be parameterized to be significantly less verbose.

HyukjinKwon · 2026-01-13T21:24:55Z

Yeah let me try to dedup and fix!

@pitrou I meant to show the TLDR first and show the reproducer and details (e.g. w/ tables) generated by LLMs. That's the approach I usually use but seems like it was too verbose this time.

So usually you won't have to read after (Generated by ChatGPT) part (and I am trying to mark it so to be honest there). Let me make it clear and concise next time.

pitrou · 2026-01-14T08:57:42Z

@HyukjinKwon Thanks for the heads up, but please see this sentence also:

Date64 data that has a non-zero intraday component is invalid according to the Arrow format

So it does not make sense to expose an option to truncate something that should already be truncated. I recommend closing this PR and the associated issues.

pitrou · 2026-01-14T09:00:08Z

For the record:

>>> dates = pa.array([0, 86400000, None], type=pa.int64()).view(pa.date64())
>>> dates.validate(full=True)
>>> dates = pa.array([0, 86400001, None], type=pa.int64()).view(pa.date64())
>>> dates.validate(full=True)
Traceback (most recent call last):
  Cell In[8], line 1
    dates.validate(full=True)
  File pyarrow/array.pxi:1851 in pyarrow.lib.Array.validate
  File pyarrow/error.pxi:92 in pyarrow.lib.check_status
ArrowInvalid: date64[ms] 86400001 does not represent a whole number of days

HyukjinKwon · 2026-01-14T09:59:58Z

Ah okie that works to me!

HyukjinKwon · 2026-01-14T10:01:47Z

Ps: I'll be on vacation till 19th so responses might be delayed!

HyukjinKwon requested review from AlenkaF, raulcd and rok as code owners December 12, 2025 01:46

github-actions bot added Component: Python awaiting review Awaiting review labels Dec 12, 2025

HyukjinKwon marked this pull request as draft December 23, 2025 11:07

HyukjinKwon mentioned this pull request Dec 29, 2025

[Python] Add a switch for truncating intraday milliseconds in Date64 in Python conversion #48672

Closed

HyukjinKwon changed the title ~~GH-48465: [Python] Truncate intraday milliseconds in Date64 to pandas conversion~~ GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 Dec 29, 2025

HyukjinKwon force-pushed the truncate-millies branch 5 times, most recently from 2ffa9a0 to 7e8eb86 Compare December 29, 2025 07:23

HyukjinKwon marked this pull request as ready for review December 29, 2025 07:25

HyukjinKwon force-pushed the truncate-millies branch from 7e8eb86 to 3896331 Compare December 29, 2025 07:41

HyukjinKwon force-pushed the truncate-millies branch from 3896331 to 0bf9fec Compare December 29, 2025 07:54

EnricoMi reviewed Jan 6, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 6, 2026

HyukjinKwon force-pushed the truncate-millies branch from 0bf9fec to e478128 Compare January 7, 2026 02:03

EnricoMi approved these changes Jan 7, 2026

View reviewed changes

apacheGH-48672, apacheGH-48465: [Python] Add an option for truncating…

3e75327

… intraday milliseconds in Date64

HyukjinKwon force-pushed the truncate-millies branch from 9066b84 to 3e75327 Compare January 9, 2026 07:34

Merge branch 'main' into truncate-millies

70afa4b

AlenkaF reviewed Jan 13, 2026

View reviewed changes

HyukjinKwon closed this Jan 14, 2026

	// Date64Type is millisecond timestamp stored as int64_t
	// TODO(wesm): Do we want to make sure to zero out the milliseconds?

GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

GH-48672, GH-48465: [Python] Add an option for truncating intraday milliseconds in Date64 #48466

Uh oh!

Conversation

HyukjinKwon commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

alippai commented Dec 15, 2025

Uh oh!

alippai commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Dec 18, 2025

Uh oh!

AlenkaF commented Dec 23, 2025

Uh oh!

rok commented Dec 23, 2025

Uh oh!

HyukjinKwon commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

HyukjinKwon commented Dec 29, 2025

Uh oh!

alippai commented Dec 29, 2025

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoMi left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 8, 2026

Uh oh!

AlenkaF commented Jan 12, 2026

Uh oh!

AlenkaF left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jan 13, 2026

Uh oh!

rok commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jan 14, 2026

Uh oh!

pitrou commented Jan 14, 2026

Uh oh!

HyukjinKwon commented Jan 14, 2026

Uh oh!

HyukjinKwon commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

HyukjinKwon commented Dec 12, 2025 •

edited

Loading

alippai commented Dec 15, 2025 •

edited

Loading

rok commented Jan 13, 2026 •

edited

Loading

HyukjinKwon commented Jan 13, 2026 •

edited

Loading