-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
In a Jupyter notebook:
import pandas as pd
range_test = pd.DataFrame(index=pd.RangeIndex(0, 20000))
ts_test = pd.DataFrame(index=pd.date_range("2024-01-01", periods=20000, freq="30min"))
ts_test_utc = pd.DataFrame(
index=pd.date_range("2024-01-01", periods=20000, freq="30min", tz="UTC")
)
%%timeit
ts_test.index.to_numpy()
1.02 μs ± 6.69 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%%timeit
ts_test_utc.index.to_numpy()
13.1 ms ± 102 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
range_test.index.to_numpy()
229 ns ± 5.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Summary table:
Index type | mean time | std time | Speed-up rel to RangeIndex | rel to Datetime | rel to Datetime UTC |
---|---|---|---|---|---|
RangeIndex | 229 ns | 5.81 ns | 1.0 | 4.45x | 57,205x |
DatetimeIndex | 1.02 μs | 6.69 ns | 0.22x | 1.0 | 12843x |
DatetimeIndex (utc) | 13.1 ms | 102 μs | 0.000017x | 0.000078x | 1.0 |
I understand that this is likely due to numpy not supporting tz_aware datetimes, however, given that to_numpy
is the recommended way of accessing numpy arrays, a 50,000x increase in runtime seems to be problematic.
FYI - using .value
gives a 100,000x speed-up for a tz aware DatetimeIndex:
%%timeit
ts_test_utc.index.values
100 ns ± 0.565 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Installed Versions
INSTALLED VERSIONS
commit : 4665c10
python : 3.11.13
python-bits : 64
OS : Darwin
OS-release : 24.6.0
Version : Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:40 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6041
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.3.2
numpy : 1.26.4
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.2
Cython : 3.1.3
sphinx : 8.2.3
IPython : 9.5.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.7.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.6
lxml.etree : None
matplotlib : 3.8.4
numba : 0.61.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 21.0.0
pyreadstat : None
pytest : 8.4.1
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.41
tables : None
tabulate : None
xarray : 2025.8.0
xlrd : None
xlsxwriter : None
zstandard : 0.24.0
tzdata : 2025.2
qtpy : None
pyqt5 : None
Prior Performance
I haven't noticed a difference previously. This is likely due to not having used to_numpy
in a workload where it is called so frequently on such a large dataframe.