- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Implement multiple axis for dropna #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition! A few comments though.
| cp = ray_df.copy() | ||
| result = ray_df.dropna(how='all', axis=[0, 1]) | ||
| result2 = ray_df.dropna(how='all', axis=(0, 1)) | ||
| expected = pd_df.dropna(how='all').dropna(how='all', axis=1) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this. For now, just comparing ray and pandas is fine.
|  | ||
| ray_df_equals_pandas(ray_df.dropna(axis=1, how='any'), | ||
| pd_df.dropna(axis=1, how='any')) | ||
| assert ray_df_equals_pandas(ray_df.dropna(axis=1, how='any'), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks!
        
          
                python/ray/dataframe/dataframe.py
              
                Outdated
          
        
      | raise NotImplementedError( | ||
| "To contribute to Pandas on Ray, please visit " | ||
| "github.com/ray-project/ray.") | ||
| axis = set([pd.DataFrame()._get_axis_number(ax) for ax in axis]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't put it in a set here.
| "To contribute to Pandas on Ray, please visit " | ||
| "github.com/ray-project/ray.") | ||
| axis = set([pd.DataFrame()._get_axis_number(ax) for ax in axis]) | ||
| result = self | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is necessary in this case b/c the for loop makes reference to result. this is consistent with the pandas source code.
| "github.com/ray-project/ray.") | ||
| axis = set([pd.DataFrame()._get_axis_number(ax) for ax in axis]) | ||
| result = self | ||
| for ax in axis: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here that this is inefficient since it forces the DataFrame to be built as an intermediate and should be fixed later.
| if not inplace: | ||
| return result | ||
|  | ||
| return self._update_inplace( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass _block_partitions in instead, it's more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| assert ray_df_equals_pandas(result, expected) | ||
| assert ray_df_equals_pandas(result2, expected) | ||
|  | ||
| assert ray_df_equals_pandas(result, expected) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code duplicated?
| assert ray_df_equals_pandas(result2, expected) | ||
| assert ray_df_equals(ray_df, cp) | ||
|  | ||
| inp = ray_df.copy() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already make a copy above, use it since none of those operations mutated the copy.
|  | ||
| inp = ray_df.copy() | ||
| inp.dropna(how='all', axis=(0, 1), inplace=True) | ||
| assert ray_df_equals_pandas(inp, expected) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split inplace test from other test and model after the other dropna tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more!
|  | ||
| @pytest.fixture | ||
| def test_dropna_multiple_axes_inplace(ray_df, pd_df): | ||
| ray_df = ray_df.copy() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the name here to ray_df_copy
|  | ||
| assert ray_df_equals_pandas(ray_df, pd_df) | ||
|  | ||
| ray_df.dropna(how='all', axis=(0, 1), inplace=True) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a new copy of the original dfs first, else this operates on the result of the previous dropna call.
        
          
                python/ray/dataframe/dataframe.py
              
                Outdated
          
        
      |  | ||
| return self._update_inplace( | ||
| block_partitions=result._block_partitions, | ||
| columns=result._col_metadata.index, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.columns here
        
          
                python/ray/dataframe/dataframe.py
              
                Outdated
          
        
      | return self._update_inplace( | ||
| block_partitions=result._block_partitions, | ||
| columns=result._col_metadata.index, | ||
| index=result._row_metadata.index | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.index here
        
          
                python/ray/dataframe/dataframe.py
              
                Outdated
          
        
      | "To contribute to Pandas on Ray, please visit " | ||
| "github.com/ray-project/ray.") | ||
| result = self | ||
| for ax in axis: # TODO: inefficient, df built as intermediate | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer comment: # TODO(kunalgosar): this builds an intermediate dataframe, which does unnecessary computation
Also, put comment on its own line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!! Thanks
* implement filter * begin implementation of dropna * implement dropna * docs and tests * resolving comments * resolving merge * add error checking to dropna * fix update inplace call * Implement multiple axis for dropna (#13) * Implement multiple axis for dropna * Add multiple axis dropna test * Fix using dummy_frame in dropna * Clean up dropna multiple axis tests * remove unnecessary axis modification * Clean up dropna tests * resolve comments * fix lint
What do these changes do?
Implement dropna when axis is tuple or list.