[SPARK-10857] [SQL] Block SQL injection vulnerabilities under DataFrameWriter.jdbc(). #9202

rick-ibm · 2015-10-21T19:10:44Z

This patch attempts to address both https://issues.apache.org/jira/browse/SPARK-10857 and https://issues.apache.org/jira/browse/SPARK-10977. This patch is my original work which I license to the ASF. My ICLA is already on file as a Derby committer: http://people.apache.org/committer-index.html

I have abandoned the approach discussed on those issues, viz., attempting to replace the user-supplied table name with a properly escaped and quoted chain of dot-separated delimited identifiers. I believe that JDBC does not supply enough information to attempt that substitution, particularly in the case of a Microsoft SQL Server running on a case-insensitive file system. An overview of the issues involved can be found here: https://github.com/ontop/ontop/wiki/Case-sensitivity-for-SQL-identifiers

Instead, this patch simply verifies that the table name is a valid SQL identifier before inserting it into a CREATE/DROP/INSERT statement or into a statement which tests whether the table exists. That is, a SQL injection attempt will fail because the injected text does not parse as a valid SQL identifier.

The validating code may not catch some bad names. This can happen for databases which don't allow embedded, escaped quotes inside quoted identifiers. Microsoft SQL Server and Oracle don't allow embedded quotes. For those databases, the bad table name will appear as an ungrammatical sequence of quoted identifiers. E.g.:

"foo""bar"

will appear as

"foo" "bar"

Instead of a "bad identifier" error, those databases will raise some other exception. I leave further polishing of this behavior to people who are experts on the affected databases.

This patch makes the following changes:

Introduces a new Java class, SqlIdentifierUtil. The code was cribbed from Derby, where it has undergone almost 2 decades of commodity testing in production applications. I thought that it was safer to re-use this code rather than translate it into Scala and run the risk of introducing translation errors. However, I can try my hand at translating this class into Scala if reviewers think that that is a better approach.

The SqlIdentifierUtil class has 2 entry points:

a) quoteString() - This method escapes embedded quotes when converting a string to a quoted identifier. JdbcDialect.quoteIdentifier() did not account for this subtlety. But the patch makes JdbcDialect.quoteIdentifier() call SqlIdentifier.quoteString() now.

b) parseMultiPartSQLIdentifier() - This method takes a dot separated sequence of SQL identifiers like databaseName.schemaName.tableName and returns an array of normalized identifiers like [ DATABASENAME, SCHEMANAME, TABLENAME ].

Introduces some new JdbcDialect methods:

a) vetSqlIdentifier() - This method takes a table name and raises an exception if the name isn't a legal SQL identifier as determined by SqlIdentifierUtil.parseMultiPartSQLIdentifier().

b) quoteChar() - This method returns the dialect-specific character which is used to frame quoted identifiers. This information is needed by SqlIdentifierUtil.parseMultiPartSQLIdentifier().

Calls JdbcDialect.vetSqlIdentifier() everywhere that we plug a table name into a SQL statement. I have placed the calls to JdbcDialect.vetSqlIdentifier() as close to the statement text as possible in order to minimize the chance that a new code path will be introduced in the future which circumvents the SQL injection check.

marmbrus · 2015-10-21T20:32:34Z

ok to test

rick-ibm · 2015-10-21T21:11:38Z

According to the log for the build/test cycle, a git pull request succeeded...

GitHub pull request #9202 of commit b3b845de9960e003637d26fecaf8dccaa0206f59 automatically merged.

...but a git fetch failed after 15 minutes...

git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/9202/:refs/remotes/origin/pr/9202/ # timeout=15

Would appreciate some clues about what I've done wrong.

Thanks,
-Rick

marmbrus · 2015-10-21T21:13:54Z

Transient networking issues. test this please

SparkQA · 2015-10-21T21:21:32Z

Test build #44094 has finished for PR 9202 at commit b3b845d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

rick-ibm · 2015-10-21T21:38:20Z

The following style tests ran cleanly for me:

build/sbt scalastyle

What other style tests should I run?

Thanks,
-Rick

JoshRosen · 2015-10-21T21:39:21Z

Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala:167: File line length exceeds 100 characters
[error] (sql/test:scalastyle) errors exist
[error] Total time: 8 s, completed Oct 21, 2015 2:20:49 PM
[error] /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala:167: File line length exceeds 100 characters
[error] (sql/test:scalastyle) errors exist
[error] Total time: 8 s, completed Oct 21, 2015 2:21:32 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder@3/dev/lint-scala ; received return code 1

rick-ibm · 2015-10-21T22:35:41Z

Thanks, Josh. My last commit should address that long line problem.

SparkQA · 2015-10-22T00:45:02Z

Test build #44096 has finished for PR 9202 at commit e7e401f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

srowen · 2015-10-22T12:52:21Z

sql/catalyst/src/main/java/org/apache/spark/sql/SqlIdentifierUtil.java

As an aside, does this need to be Java? it would be simpler in Scala. I may be missing a reason it must be.

Oh right. I read right past the reasoning. It's not just inspired by Derby but copied wholesale on purpose. OK, that makes some sense, more than making a dependency on Derby. It has some 'cost' though.

rick-ibm · 2015-10-22T15:46:08Z

Thanks for the review comments, Sean.

I did not polish the Java code in SqlIdentifierUtil, and I didn't translate it into Scala. This was in the interests of stability (see the header comment on my pull request above). The code came from Derby, where it has been proven by two decades of use in production applications. But I am happy to re-write it in Scala if you think that the benefits outweigh the risk of introducing bugs.

Concerning JdbcDialects.vetSqlIdentifier(), thanks for the suggestion about using a case class. That does sound like a better solution.

Thanks,
-Rick

SparkQA · 2015-10-22T17:51:23Z

Test build #44156 has finished for PR 9202 at commit 79514f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

rick-ibm · 2015-10-27T18:50:34Z

Thanks for the advice about how to improve this patch, Sean. Hopefully, my last commit addresses your concerns. I have made the following changes:

Eliminated the Java code (SqlIdentifierUtil) and replaced it with a Scala class (SqlIdUtil).
Revamped the id parsing to use a Scala regular expression.
Introduced a case class (TableId) to represent a 3-part table identfier. This replaces the ordered triple which performed the same service in the previous rev of the patch.

Thanks,
-Rick

SparkQA · 2015-10-27T20:59:36Z

Test build #44457 has finished for PR 9202 at commit 7d90459.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class TableId(database: String, schema: String, table: String)\n

rick-ibm · 2015-10-30T14:36:55Z

Any further refinements which I should make to this patch? Thanks.

marmbrus · 2015-10-30T17:08:24Z

@rick-ibm, thanks for working on this. With the Spark Summit just wrapping up and code freeze for 1.6 tomorrow I don't think there is going to be a lot of review bandwidth for a patch this large until after the release. While it would be nice to secure things, this doesn't seem like a particularly high priority vulnerability since any user that is creating a dataframe by querying JDBC must have the credentials already and could just open their own connection.

I haven't looked closely at the implementation, but one high level question is whether this is breaking the use case where a users gives a subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an important part of the API that we can't break.

rick-ibm · 2015-10-30T20:05:30Z

Thanks for that feedback, Michael. To answer your question:

MA> I haven't looked closely at the implementation, but one high level question is whether this is breaking the use case where a users gives a subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an important part of the API that we can't break.

Reynold posed the same question at https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967827 and I replied at https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967930&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967930

Can you give me an example of the syntax you have in mind which might trip this problem? I tried the following experiment...

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "select * from app.t")).load()

...but that statement doesn't tickle JdbcDialect.getTableExistsQuery(). Instead, the statement fails because the following ungrammatical query is sent to the database:

SELECT * FROM select * from app.t WHERE 1=0

That, in turn, is because JDBCRDD.resolveTable() has its own hard-coded query which probes for the existence of a table. I have logged https://issues.apache.org/jira/browse/SPARK-11426 to track this issue. The header comment on JDBCRDD.resolveTable() indicates that the method expects a table name and not an arbitrary query.

Thanks,
-Rick

marmbrus · 2015-10-30T20:54:04Z

You have to put the subquery in parentheses so it results in valid SQL.
On Oct 30, 2015 9:06 PM, "Rick Hillegas" [email protected] wrote:

Thanks for that feedback, Michael. To answer your question:

MA> I haven't looked closely at the implementation, but one high level
question is whether this is breaking the use case where a users gives a
subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an
important part of the API that we can't break.

Reynold posed the same question at
https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967827
and I replied at
https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967930&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967930

Can you give me an example of the syntax you have in mind which might trip
this problem? I tried the following experiment...

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "select * from app.t")).load()

...but that statement doesn't tickle JdbcDialect.getTableExistsQuery().
Instead, the statement fails because the following ungrammatical query is
sent to the database:

SELECT * FROM select * from app.t WHERE 1=0

That, in turn, is because JDBCRDD.resolveTable() has its own hard-coded
query which probes for the existence of a table. I have logged
https://issues.apache.org/jira/browse/SPARK-11426 to track this issue.
The header comment on JDBCRDD.resolveTable() indicates that the method
expects a table name and not an arbitrary query.

Thanks,
-Rick

—
Reply to this email directly or view it on GitHub
#9202 (comment).

rick-ibm · 2015-10-30T22:09:12Z

Thanks for the quick response, Michael. Simply parenthesizing the query will result in non-Standard syntax which an ANSI-compliant database will reject...

// Raises the unhelpful error: Syntax error: Encountered "WHERE" at line 1, column 37.
try {
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "(select * from app.t)")).load()
} catch {
case t: Throwable => println(t.getMessage)
}

However, you can go one step further and give the subquery a correlation name so that it will fit nicely in a FROM list...

// succeeds
try {
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "(select * from app.t) correlationName")).load()
} catch {
case t: Throwable => println(t.getMessage)
}

I am inclined to close SPARK-11426 as "not a problem". It seems that JDBCRDD.resolveTable() deliberately doesn't call JdbcDialect.getTableExistsQuery() because it is not really probing for the existence of the table. It is trying to compute the row-signature of a query expression.

In any event, the changes made by this patch do not seem to affect the code path you are concerned about.

Thanks,
-Rick

…erify that they are legal SQL identifiers in the underlying database.

… based on regular expressions; introduce a case object to represent a 3-part table identifier.

SparkQA · 2015-11-04T19:44:16Z

Test build #45027 has finished for PR 9202 at commit 6c906bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class ExecutorLostFailure(\n * case class RegisteredExecutor(hostname: String) extends CoarseGrainedClusterMessage\n * public class JavaIsotonicRegressionExample\n * public class JavaNaiveBayesExample\n * case class TableId(database: String, schema: String, table: String)\n * case class DecimalLit(chars: String) extends Token\n * case class UnresolvedStar(target: Option[Seq[String]]) extends Star with Unevaluable\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n * case class Exchange(\n * class CoalescedPartitioner(val parent: Partitioner, val partitionStartIndices: Array[Int])\n

AmplabJenkins · 2016-05-03T17:17:15Z

Can one of the admins verify this patch?

sameeragarwal · 2017-06-16T23:02:31Z

@rick-ibm would you still have time to bring this up to date? Thanks!

gatorsmile · 2017-06-27T06:39:30Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

srowen reviewed Oct 22, 2015
View reviewed changes

rick-ibm force-pushed the b_10857_sqlInjection branch from e7e401f to 79514f3 Compare October 22, 2015 15:49

rick-ibm force-pushed the b_10857_sqlInjection branch from 79514f3 to 7d90459 Compare October 27, 2015 18:49

rick-ibm added 6 commits November 4, 2015 07:09

SPARK-10857: Vet table names which are inserted into SQL strings to v…

2dde452

…erify that they are legal SQL identifiers in the underlying database.

SPARK-10857: Add and remove some comments.

2548175

SPARK-10857: Add and remove some comments.

2865416

DERBY-10857: Fix a scalastyle problem.

287e8f0

SPARK-10857: Fix a long line style problem.

1353820

SPARK-10857: Replace the Java id parser with a Scala version which is…

6c906bb

… based on regular expressions; introduce a case object to represent a 3-part table identifier.

rick-ibm force-pushed the b_10857_sqlInjection branch from 7d90459 to 6c906bb Compare November 4, 2015 17:45

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

Uh oh!

[SPARK-10857] [SQL] Block SQL injection vulnerabilities under DataFrameWriter.jdbc(). #9202

[SPARK-10857] [SQL] Block SQL injection vulnerabilities under DataFrameWriter.jdbc(). #9202

Uh oh!

Conversation

rick-ibm commented Oct 21, 2015

Uh oh!

marmbrus commented Oct 21, 2015

Uh oh!

rick-ibm commented Oct 21, 2015

Uh oh!

marmbrus commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 21, 2015

Uh oh!

rick-ibm commented Oct 21, 2015

Uh oh!

JoshRosen commented Oct 21, 2015

Uh oh!

rick-ibm commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

srowen Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

srowen Oct 23, 2015

Choose a reason for hiding this comment

Uh oh!

rick-ibm commented Oct 22, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

rick-ibm commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

rick-ibm commented Oct 30, 2015

Uh oh!

marmbrus commented Oct 30, 2015

Uh oh!

rick-ibm commented Oct 30, 2015

Uh oh!

marmbrus commented Oct 30, 2015

Uh oh!

rick-ibm commented Oct 30, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

AmplabJenkins commented May 3, 2016

Uh oh!

sameeragarwal commented Jun 16, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants