GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide #48619

AlenkaF · 2025-12-22T15:25:08Z

Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (In and Out during the doc build). This can lead to slower builds.

What changes are included in this PR?

IPython directives are converted to runnable code-block (with >>> and ...) and pytest doctest support for .rst files is added to the conda-python-docs CI job. This means the code in the Python User Guide is tested separately to the building of the documentation.

Are these changes tested?

Yes, with the CI.

Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with pytest --doctest-glob='*.rst' docs/source/python/file.rst

GitHub Issue: [Doc][Python] The use of IPython directive or doctest code blocks in the python user guide #28859

AlenkaF · 2025-12-23T15:09:00Z

Converting this PR to draft till I figure out what would be the best way to run RST doctest on 3.12 Sphinx Documentation CI job and not on the Python 3.10 Sphinx & Numpydoc.

AlenkaF · 2026-01-08T10:03:07Z

@raulcd the main pain point why AMD64 Conda Python 3.10 Sphinx & Numpydoc fails and AMD64 Conda Python 3.12 Sphinx Documentation succeeds is the Python version and the use of datetime.UTC which was only added in Python 3.11, see https://docs.python.org/3/library/datetime.html#datetime.UTC.

I think the easiest solution would be to run Sphinx & Numpydoc on Python 3.11, or even Python 3.12 (I am not aware of any reason we would need the olderst Python version we support here. Sphinx Documentation runs on docs changes only while Sphinx & Numpydoc runs on any Python or C++ changes and validates the docstrings).

raulcd · 2026-01-08T10:16:46Z

Thanks for checking that @AlenkaF ! So currently we are providing a snippet on our documentation:

.. ipython:: python
    :okexcept:

    import datetime

    current_year = datetime.datetime.now(datetime.UTC).year
    for table_chunk in birthdays_dataset.to_batches():
        print("AGES", pc.subtract(current_year, table_chunk["years"]))

that will fail for some users as we are still supporting Python 3.10, right? Is it worth for the example to add the datetime.UTC? Should we just use for the example: current_year = datetime.datetime.now().year
Or maybe add a comment with a note?

I am ok to just bump the Python version of the job but we probably should not provide examples that will fail on some of the supported versions.

AlenkaF · 2026-01-08T12:53:03Z

Yeah, you are right. Changing to datetime.datetime.now().year or even datetime.datetime.now(datetime.timezone.utc) makes much more sense! Will update 👍

AlenkaF · 2026-01-08T12:58:20Z

Ha ha, the example would fail anyways as the year changed in the meantime 🤣
Probably it is best to just hardcode it.

raulcd · 2026-01-08T13:58:55Z

Ha ha, the example would fail anyways as the year changed in the meantime 🤣 Probably it is best to just hardcode it.

Yes, we don't want to have to update this every year because the data changes 😄

This reverts commit 8101252.

This reverts commit 63b3944.

This reverts commit 9c4cec7.

AlenkaF · 2026-01-12T07:54:30Z

@github-actions crossbow submit preview-docs

github-actions · 2026-01-12T07:56:40Z

Revision: 1fb6f0a

Submitted crossbow builds: ursacomputing/crossbow @ actions-b1a47e0770

Task	Status
preview-docs

AlenkaF · 2026-01-12T09:43:55Z

Hi @rmnskb @tadeja @zhengruifeng @HyukjinKwon! In case anybody fancies giving a review, it would be much appreciated.
This PR looks like a big change but it only unifies how we write code examples in the Python User guide (code-block and not ipython directive. Note >>> is needed in order for the examples to be tested).

Link to the preview: https://s3.amazonaws.com/arrow-data/pr_docs/48619/python/index.html

This PR also adds a doctest of the .rst files to the two existing documentation CI jobs. One job runs only with changes to the documentation, the other job runs with changes in the C++ and Python code. cc @raulcd in case you have time to look at the ci/ changes.

rmnskb

LGTM 🔥 Thanks for working on that! I can imagine it was a tremendous amount of work. Left some general comments about some smaller things that I picked up while looking at the PR, otherwise I think it's good to merge.

rmnskb · 2026-01-12T15:59:39Z

docs/source/python/compute.rst

+.. code-block:: python

-   >>> import pyarrow as pa
-   >>> import pyarrow.compute as pc


Did you decide to opt out from the explicit imports? Does the documentation still compile?

I decided to not duplicate imports per page. Meaning once on the top should suffice (see lines above, approx line 32). Yes, the compilation of docs and doctest should work.

Here are the doctest session logs:

https://github.com/apache/arrow/actions/runs/20845326372/job/59887467379#step:6:8946

https://github.com/apache/arrow/actions/runs/20845326375/job/59887488399#step:6:9089

nit: if we keep the explicit imports, the examples will be copy-paste able which might be more friendly to users.

docs/source/python/compute.rst

HyukjinKwon · 2026-01-13T08:40:37Z

ack. taking a look now

AlenkaF · 2026-01-13T08:44:43Z

Thanks for the review @rmnskb! 🎈

docs/source/python/getstarted.rst

HyukjinKwon · 2026-01-13T09:05:30Z

docs/source/python/timestamps.rst

    |              naive|              aware|
    +-------------------+-------------------+
-    |2019-01-01 00:00:00|2019-01-01 08:00:00|
+    |2018-12-31 23:00:00|2019-01-01 08:00:00|


I think 2019-01-01 00:00:00 became 2018-12-31 23:00:00 here cuz I suspect you or CI (?) is somewhere in GMT+1. datetime(2019, 1, 1, 0) is assumed as local time (yes it's up to the system to interpret but Spark thinks so). So, Spark thought that it's a local time but the timezone was set as UTC so it decreased one hour.

I think we should probably just skip all here cuz now it seems depending on local timezone.

Maybe it's better to just keep the original input/output here and skip all. I will take a separate look.

Yeah, agree - these tests should already be skipped. I can remove the diff and keep it as it was before?
The example is good altogether and shows the timezone conversion behaviour. Maybe we could add a note? Claude suggests:

.. note:: The examples above demonstrate timezone conversion behaviour. The exact output may differ depending on your system's local timezone, as Spark interprets naive timestamps relative to the local timezone when converting to UTC.

I can remove the diff and keep it as it was before?

yupyup. whichever easier. I will take a separate look after this gets merged.

docs/source/python/parquet.rst

HyukjinKwon · 2026-01-13T09:19:49Z

docs/source/python/filesystems.rst

-   <FileInfo for 'test.arrow': type=FileType.File, size=3250>
+.. code-block:: python

+   >>> local.get_file_info('test.arrow')


This is also a really nitpick .. feel free to ignore it for now as I don't want this kind of comment to block the PR.

Should we remove the file after the test somewhere somehow? Seems like test.arrow file will be created but not removed?

All saved files should be cleaned up thanks to the fixture here: https://github.com/apache/arrow/pull/48619/files#diff-de7516be9fdc98a7bfc2fe897bd93bff7c7d0a5d62ea50759956c9745082a310.

I have tested this locally.

PS: not a nitpick, I think this is a very valid comment!

HyukjinKwon

TL;DR: LGTM

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

AlenkaF · 2026-01-13T09:33:30Z

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

My main aim was to unify the docs and have the possibility of running doctest on the examples separately. But am curious if there is any change in performance so I will try it out now 😄 (not sure if the amount of IPython directives has been that big before this change, though).

zhengruifeng · 2026-01-13T09:57:10Z

docs/source/python/compute.rst

+.. code-block:: python

-   >>> import pyarrow as pa
-   >>> import pyarrow.compute as pc


nit: if we keep the explicit imports, the examples will be copy-paste able which might be more friendly to users.

AlenkaF requested review from assignUser, jonkeane, kou and raulcd as code owners December 22, 2025 15:25

github-actions bot added Component: Documentation awaiting review Awaiting review labels Dec 22, 2025

AlenkaF marked this pull request as draft December 23, 2025 15:06

AlenkaF added 17 commits January 9, 2026 09:03

Move to code-block in memory.rst

4aa58d2

Add conftest in python docs folder

b149ba1

Update getstarted.rst

6b2acf9

Update data.rstz

fdc592c

Update dataset.rst

fafaa94

Update ipc.rst and pandas.rst

47eda37

Update parquet.rst

66b824f

Add doctest glob to CI and docs

5d0b46e

Update extending_types.rst

9d8d9df

Update compute.rst

9747b41

Update csv.rst

df299e0

Update dlpack.rst

7e42690

Update filesystems.rst

9a0e31c

Update install.rst

d8af83f

Update install.rst

fcc35df

Update substrait.rst

88581ef

Update interchange_protocol.rst

b893cc7

AlenkaF added 8 commits January 9, 2026 09:03

Try passing PYTEST_RST_ARGS in docs_light.yml

351645c

Try passing PYTEST_RST_ARGS sooner

9a07526

Revert "Try passing PYTEST_RST_ARGS sooner"

0fab4bc

This reverts commit 8101252.

Revert "Try passing PYTEST_RST_ARGS in docs_light.yml"

2f0f3cd

This reverts commit 63b3944.

Fix table1.join non-deterministic outcome

1d62830

Update Python in Sphinx & Numpydoc to 3.12

e07fad5

Revert "Update Python in Sphinx & Numpydoc to 3.12"

9bd80b8

This reverts commit 9c4cec7.

DO not use datetime.now() in examples

1fb6f0a

AlenkaF force-pushed the gh-28859-python-docs-examples-testing branch from f417177 to 1fb6f0a Compare January 9, 2026 08:03

AlenkaF marked this pull request as ready for review January 12, 2026 09:42

rmnskb suggested changes Jan 12, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 12, 2026

AlenkaF added 2 commits January 13, 2026 09:08

Add blanklines between function definitions

4b46325

Fix memory example - use context manager

38bbd79

HyukjinKwon reviewed Jan 13, 2026

View reviewed changes

docs/source/python/getstarted.rst Outdated Show resolved Hide resolved

Remove unused import

8b0d79f

HyukjinKwon reviewed Jan 13, 2026

View reviewed changes

docs/source/python/parquet.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 13, 2026

View reviewed changes

docs/source/python/parquet.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 13, 2026

View reviewed changes

HyukjinKwon approved these changes Jan 13, 2026

View reviewed changes

Add skip to the imports also

5363b69

zhengruifeng approved these changes Jan 13, 2026

View reviewed changes

GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide #48619

Are you sure you want to change the base?

GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide #48619

Uh oh!

Conversation

AlenkaF commented Dec 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

AlenkaF commented Dec 23, 2025

Uh oh!

AlenkaF commented Jan 8, 2026

Uh oh!

raulcd commented Jan 8, 2026

Uh oh!

AlenkaF commented Jan 8, 2026

Uh oh!

AlenkaF commented Jan 8, 2026

Uh oh!

raulcd commented Jan 8, 2026

Uh oh!

AlenkaF commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

AlenkaF commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmnskb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlenkaF Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Jan 13, 2026

Uh oh!

AlenkaF commented Jan 13, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

AlenkaF commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

AlenkaF commented Dec 22, 2025 •

edited by github-actions bot

Loading

AlenkaF commented Jan 12, 2026 •

edited

Loading

AlenkaF Jan 13, 2026 •

edited

Loading

HyukjinKwon Jan 13, 2026 •

edited

Loading