- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings ( DRAFT) #24969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ta sources.Relevant unit tests are added.
Issues:
Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due  to
1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT
case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))
In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.
2. (Read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed.
Note : These issues were found when reading/writing with SQLServer.
Error seen when writing table
(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable apache#2: Cannot find data type BYTE.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable apache#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..
    ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and
spark data frame being created with unintended types.
Some example issue
    Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger
    table that expected. Read results in a dataframe with type INTEGER as opposed to ShortType
FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'.
But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped
to 'DoubleType' rather than 'FloatType'.
            
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know SQL Server, so can't really evaluate it, but seems like a plausible change.
        
          
                sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT)) | ||
| case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY)) | ||
| case ByteType => Option(JdbcType("TINYINT", java.sql.Types.TINYINT)) | ||
| case ShortType => Option(JdbcType("SMALLINT", java.sql.Types.SMALLINT)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does FloatType need to map back to REAL or does that already happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, FloatType is correctly mapped to REAL in def getCommonJDBCType(dt: DataType)
…NT and FLOAT in MsSqlServerIntegrationSuite.scala. 2. code cleanup in MsSqlServerDialect.scala 3. Fixed issue in JdbcUtils where tinyInt was being written as Int and failing the 'basic write' test cases in IntegrationSuite.
| ok to test | 
| Test build #107268 has finished for PR 24969 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @shivsood . Thank you for your contribution.
Could you run dev/scalastyle in your environment and fix all errors?
Currently, this PR has the followings.
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:68: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:78: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:80: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:69: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:80: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:82: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:65: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:72: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:74: No space after token ,
[error] Total time: 27 s, completed Jul 5, 2019 1:08:48 AM
        
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | Test build #107364 has finished for PR 24969 at commit  
 | 
| 
 Resolved this. I am not using this exact command below to build locally and thus did not see that. Have to see the right/strict way to build that shows all all error locally. [info] Done packaging. 
 | 
| Test build #4816 has finished for PR 24969 at commit  
 | 
…hould have seen this as a build failure earlier.
| Test build #107370 has finished for PR 24969 at commit  
 | 
        
          
                sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | case ByteType => | ||
| (stmt: PreparedStatement, row: Row, pos: Int) => | ||
| stmt.setInt(pos + 1, row.getByte(pos)) | ||
| stmt.setByte(pos + 1, row.getByte(pos)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is orthogonal to MySqlServerDialect. Can we proceed this PR without this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we can proceed this first as another PR with a proper regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun JDBCUtils.scala fix is required for completeness of the ByteType Fix. Without this fix Integrations test in "external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala" will fail.
I dont think the "MsSqlServerIntegrationSuite.scala" integration tests are running in CI/CD. I was running these locally following a review suggestion. Is that's true, i can still remove this fix and submit this as part of another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, you need to follow the second option I gave. Please make another PR. We need to review that first before this PR.
Or, we can proceed this first as another PR with a proper regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Can u please elaborate on proper regression test. I have test cases added/fixed for ByteType in JDBCSuite.scala and MsSqlServerIntegrationSuite.scala. Are you expecting some other test? Can u please point me to type of test.
Another way to cleanly split this PR would be as follows.
- Remove ByteType fix from this PR and jdbcutils.scala fix also get removed. This PR will only have fix for FloatType and Short.
- 2nd PR with complete ByteType fix with more elaborate integration test.
I prefer this for better seperation of concern in each of the PRs , while still retaining the completeness. Let me know what you think.
Thanks for your patience and very helpful guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is like your 2nd option.
2nd PR with complete ByteType fix with more elaborate integration test.
Here, the proper regression test means a test causing failure for the other Dialects. (e.g. PostgreSQL/MySQL/...) because JdbcUtils is shared across Dialects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Would create 2 PRs then. PR1 ( this PR) i will remove ByteType and JDBCUtilis fix. Will create PR2 for ByteType and JDBCUtilis fix with full integration test. PR2 will take time as i have to investigate how to test for other dialects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Please try as you want. Then, ping me later when the PR is ready.
| @shivsood .  | 
| Test build #107412 has finished for PR 24969 at commit  
 | 
| @shivsood do you want to update this PR accordingly? | 
| 
 | 
| @shivsood did you open a new PR? | 
| Yes. We are working on #25146 . | 
| Was a PR ever submitted on the Byte/TinyInt problem? If not, does anyone know why? We are still getting burned by this. | 
| @dongjoon-hyun @shivsood:: I'm getting this exception below which makes me consider that maybe the fix was not applied?? Was the fix for ByteType ever submitted/applied? To which branch, please? Thanks | 
| #25146 ? see above | 
| @srowen : Feedback: As suspected, the issue was between the keyboard and the chair. Maybe I was not casting properly, or maybe a typo or maybe some equality test somewhere... whatever... when I cast like this  | 
Abandoning this PR. Will create new one that have specific fixes
Fix for ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables
###ByteType issue
Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due
case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))
In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.
Note : These issues were found when reading/writing with SQLServer. Will be submitting a PR soon to fix these mappings in MSSQLServerDialect.
Error seen when writing table
(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
..
###ShortType and FloatType issue
ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types.
Some example issue
Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
read results in a dataframe with type INTEGER as opposed to ShortType
FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.
What changes were proposed in this pull request?
(MsSqlServerDialect.scala ) BYTETYPE is mapped to "TINYINT", SHORTTYPE is mapped to small int
(JdbcUtils.scala) reading ByteTYPE is added to makeGetter to enable reading after ByteType is translated to TINYINT.
How was this patch tested?
Unit test - Added and passed.
Integration Test - Tested end to end with SQLServer using JDBC connector.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review https://spark.apache.org/contributing.html before opening a pull request.