Skip to content

Conversation

@itholic
Copy link
Contributor

@itholic itholic commented Aug 26, 2022

What changes were proposed in this pull request?

This PR proposes to install the openpyxl for PySpark test environments to re-enable the to_excel tests.

Why are the changes needed?

For better test coverage

Does this PR introduce any user-facing change?

No, it's test only

How was this patch tested?

Enabling the existing skipping tests related to openpyxl.

@itholic itholic changed the title [WIP][SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark Sep 5, 2022
@itholic itholic marked this pull request as ready for review September 5, 2022 23:00
@itholic
Copy link
Contributor Author

itholic commented Sep 5, 2022

The read_excel tests still failed for some reason with error below:

======================================================================
FAIL [1.476s]: test_read_excel (pyspark.pandas.tests.test_dataframe_spark_io.DataFrameSparkIOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_dataframe_spark_io.py", line 278, in test_read_excel
    ps.read_excel(tmp, index_col=0).sort_index(),
  File "/__w/spark/spark/python/pyspark/pandas/namespace.py", line 1224, in read_excel
    return read_excel_on_spark(pdf_or_psers, sheet_name)
  File "/__w/spark/spark/python/pyspark/pandas/namespace.py", line 1213, in read_excel_on_spark
    psdf = DataFrame(psdf._internal.with_new_sdf(sdf))
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1223, in with_new_sdf
    return self.copy(
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1427, in copy
    return InternalFrame(
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 755, in __init__
    assert all(
AssertionError: ([InternalField(dtype=float64, struct_field=StructField('__index_level_0__', DoubleType(), False))], [StructField('__index_level_0__', DoubleType(), True)])

----------------------------------------------------------------------
Ran 8 tests in 34.923s

I failed to reproduce this error although I use the same version of related envs (e.g. Python, pandas, openpyxl).

Let me leave the re-enabling read_excel tests as TODO, and just re-enable the to_excel tests here for now.

@itholic
Copy link
Contributor Author

itholic commented Sep 5, 2022

Just created a ticket for re-enabling the read_excel test: SPARK-40353.

@itholic
Copy link
Contributor Author

itholic commented Sep 5, 2022

cc @HyukjinKwon FYI

- name: Install Python packages (Python 3.9, PyPy3)
run: |
# To test excel I/O for pandas API on Spark.
python3.9 -m pip install openpyxl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@itholic itholic Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Just moved

@github-actions github-actions bot added the BUILD label Sep 6, 2022
@HyukjinKwon
Copy link
Member

cc @xinrong-meng FYI (I will be off for the rest of this week)

}

@unittest.skip("openpyxl")
def test_to_excel(self):
Copy link
Member

@Yikun Yikun Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this case for pypy3? Will it affect developers run tests?

python_execs = [x for x in ["python3.9", "pypy3"] if which(x)]

(I remember we run pypy3 and python3.9 in CI....but not running, pls let me know if I missed something history changes...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually PyPy is not tested with Pandas API on Spark per https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L663-L667 because pyarrow, etc are not available with PyPy. Should probably docunent it somewhere but let's do that separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good to know!

@HyukjinKwon
Copy link
Member

Merged to master.

Yikun pushed a commit that referenced this pull request Oct 28, 2022
### What changes were proposed in this pull request?

This is a follow-up of #37671.

### Why are the changes needed?

Since #37671 added `openpyxl` for PySpark test environments and re-enabled `test_to_excel` test, we need to add it to `requirements.txt` as PySpark test dependency explicitly.

### Does this PR introduce _any_ user-facing change?

No. This is a test dependency.

### How was this patch tested?

Manually.

Closes #38425 from dongjoon-hyun/SPARK-40229.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This is a follow-up of apache#37671.

### Why are the changes needed?

Since apache#37671 added `openpyxl` for PySpark test environments and re-enabled `test_to_excel` test, we need to add it to `requirements.txt` as PySpark test dependency explicitly.

### Does this PR introduce _any_ user-facing change?

No. This is a test dependency.

### How was this patch tested?

Manually.

Closes apache#38425 from dongjoon-hyun/SPARK-40229.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
@itholic itholic deleted the SPARK-40229 branch April 22, 2023 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants