- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-20098][PYSPARK] dataType's typeName fix #17435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
        
          
                python/pyspark/sql/types.py
              
                Outdated
          
        
      | "ShortType": "short", | ||
| "ArrayType": "array", | ||
| "MapType": "map", | ||
| "StructField": "struct", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this problem only applies to StructField. Could we just override typeName with simply throwing an exception? I think users are not supposed to call typeName against StructField but simpleString against the instance.
BTW, It apparently seems a bit odd that it extends DataType though.. I guess probably some tests are broken if we change the parent as it seems it is dependent on the parent assuming from #1598. So, I guess minimised fix would be just to override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it is valid to call typeName against a StructField. Actually, StructField is not a data type, strictly speaking...
I don't know why StructField inherits DataType in pyspark. In scala, it is not.
Throwing an exception when calling typeName on StructField seems good enough, instead of a map like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya The way I found this bug:
I wanted to figure out the schema of a dataset. I loaded it into a data frame  and asked its schema. Then I called the typeName on each column. I do not know this is/was best way to do this, but I think it is valid to call typeName against a dataType to get its real type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is valid call for DataTypes but I don't think it is against StructField. If you look at the Scala-side codes, StructField is not a DataType (which is "The base type of all Spark SQL data types") but it seems the parent became DataType in Python for some reason in the PR I pointed out. In any way, It seems  "A field inside a StructType".
If you want to know the schema, you could simply
>>> spark.range(1).schema[0].simpleString()
'id:bigint'
>>> spark.range(1).schema.simpleString()
'struct<id:bigint>'
>>> type(spark.range(1).schema[0])
<class 'pyspark.sql.types.StructField'>
>>> str(spark.range(1).schema[0])
'StructField(id,LongType,false)'Could you elaborate your use-case and why simpleString is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szalai1 I think @HyukjinKwon 's code snippets should address your request. Doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I don't think i.typeName() is a valid usage. We better let it throw an exception when calling typeName on StructField.
i.dataType.typeName() is more reasonable call to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya You are right. I will use it that way. Thanks @HyukjinKwon !
But my bug report is still valid,  I think. Can I override the typeName function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think we should still fix and overriding it is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| @szalai1, could you update this PR please? Otherwise, we should close this as a stale PR. | 
| @HyukjinKwon sure, I will do it this week. I totally forgot this. Sorry. | 
| I think we need a test and @holdenk's review. | 
| Thanks for working on this. I feel like the error message could maybe be improved to suggest what the user should be doing? It would be nicer to eventually not have this depend on DataType since we don't have this in the Scala version as @HyukjinKwon pointed out, but I think this could be a good improvement for now. | 
| @holdenk I am happy to contribute to this project. I changed the error message and added a test case. | 
| Sorry for my delay finishing the review. This looks good pending Jenkins and if it passes I'll try and merge it when I get home. (so Jenkins looks good to test. Jenkins test this please. ) :) | 
| Test build #78097 has finished for PR 17435 at commit  
 | 
| Hey @szalai1 if you've got time to address the style issues its looking good otherwise :) | 
| @holdenk Thanks for reviewing it. I fixed the style issues. | 
| ok to test | 
| Test build #80013 has finished for PR 17435 at commit  
 | 
| @szalai1 Could you fix tests if you're still working on this please? | 
| @ueshin Sure, I'll do it next week. | 
| Gentle ping (going through old PySpark PRs) :) | 
| Test build #81511 has finished for PR 17435 at commit  
 | 
| @holdenk thanks for reminder. I fixed the problem. | 
| return self.dataType.fromInternal(obj) | ||
|  | ||
| def typeName(self): | ||
| raise TypeError( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do like ...
raise TypeError(
    "..."
    "...")if it doesn't bother you much?
        
          
                python/pyspark/sql/types.py
              
                Outdated
          
        
      | def typeName(self): | ||
| raise TypeError( | ||
| "StructField does not have typename. \ | ||
| You can use self.dataType.simpleString() instead.") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove self here and just say something like  use typeName() on its type explicitly ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you  mean simpleString() and not typeName(). right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be typeName on its datatype because typeName was called.
        
          
                python/pyspark/sql/types.py
              
                Outdated
          
        
      |  | ||
| def typeName(self): | ||
| raise TypeError( | ||
| "StructField does not have typename. \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little nit: looks a typo, typename -> typeName.
| Test build #81556 has finished for PR 17435 at commit  
 | 
        
          
                python/pyspark/sql/types.py
              
                Outdated
          
        
      |  | ||
| def typeName(self): | ||
| raise TypeError( | ||
| "StructField does not have typeName." | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. Could you add a space at the end ..eName." -> ...eName. "?
| Test build #81559 has finished for PR 17435 at commit  
 | 
| @szalai1 I think @HyukjinKwon wanted the space on the previous line/sentence. | 
| Test build #81584 has finished for PR 17435 at commit  
 | 
| LGTM | 
## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <[email protected]> Closes #17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a) Signed-off-by: hyukjinkwon <[email protected]>
| Merged to master and branch-2.2. | 
## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <[email protected]> Closes apache#17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a) Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
typeNameclassmethod has been fixed by using type -> typeName map.How was this patch tested?
local build