Skip to content

Conversation

@rick-ibm
Copy link
Contributor

@marmbrus
@rxin
@JoshRosen
@srowen

This patch attempts to address both https://issues.apache.org/jira/browse/SPARK-10857 and https://issues.apache.org/jira/browse/SPARK-10977. This patch is my original work which I license to the ASF. My ICLA is already on file as a Derby committer: http://people.apache.org/committer-index.html

I have abandoned the approach discussed on those issues, viz., attempting to replace the user-supplied table name with a properly escaped and quoted chain of dot-separated delimited identifiers. I believe that JDBC does not supply enough information to attempt that substitution, particularly in the case of a Microsoft SQL Server running on a case-insensitive file system. An overview of the issues involved can be found here: https://github.com/ontop/ontop/wiki/Case-sensitivity-for-SQL-identifiers

Instead, this patch simply verifies that the table name is a valid SQL identifier before inserting it into a CREATE/DROP/INSERT statement or into a statement which tests whether the table exists. That is, a SQL injection attempt will fail because the injected text does not parse as a valid SQL identifier.

The validating code may not catch some bad names. This can happen for databases which don't allow embedded, escaped quotes inside quoted identifiers. Microsoft SQL Server and Oracle don't allow embedded quotes. For those databases, the bad table name will appear as an ungrammatical sequence of quoted identifiers. E.g.:

"foo""bar"

will appear as

"foo" "bar"

Instead of a "bad identifier" error, those databases will raise some other exception. I leave further polishing of this behavior to people who are experts on the affected databases.

This patch makes the following changes:

  1. Introduces a new Java class, SqlIdentifierUtil. The code was cribbed from Derby, where it has undergone almost 2 decades of commodity testing in production applications. I thought that it was safer to re-use this code rather than translate it into Scala and run the risk of introducing translation errors. However, I can try my hand at translating this class into Scala if reviewers think that that is a better approach.

The SqlIdentifierUtil class has 2 entry points:

a) quoteString() - This method escapes embedded quotes when converting a string to a quoted identifier. JdbcDialect.quoteIdentifier() did not account for this subtlety. But the patch makes JdbcDialect.quoteIdentifier() call SqlIdentifier.quoteString() now.

b) parseMultiPartSQLIdentifier() - This method takes a dot separated sequence of SQL identifiers like databaseName.schemaName.tableName and returns an array of normalized identifiers like [ DATABASENAME, SCHEMANAME, TABLENAME ].

  1. Introduces some new JdbcDialect methods:

a) vetSqlIdentifier() - This method takes a table name and raises an exception if the name isn't a legal SQL identifier as determined by SqlIdentifierUtil.parseMultiPartSQLIdentifier().

b) quoteChar() - This method returns the dialect-specific character which is used to frame quoted identifiers. This information is needed by SqlIdentifierUtil.parseMultiPartSQLIdentifier().

  1. Calls JdbcDialect.vetSqlIdentifier() everywhere that we plug a table name into a SQL statement. I have placed the calls to JdbcDialect.vetSqlIdentifier() as close to the statement text as possible in order to minimize the chance that a new code path will be introduced in the future which circumvents the SQL injection check.

@marmbrus
Copy link
Contributor

ok to test

@rick-ibm
Copy link
Contributor Author

According to the log for the build/test cycle, a git pull request succeeded...

GitHub pull request #9202 of commit b3b845de9960e003637d26fecaf8dccaa0206f59 automatically merged.

...but a git fetch failed after 15 minutes...

git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/9202/:refs/remotes/origin/pr/9202/ # timeout=15

Would appreciate some clues about what I've done wrong.

Thanks,
-Rick

@marmbrus
Copy link
Contributor

Transient networking issues. test this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44094 has finished for PR 9202 at commit b3b845d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

@rick-ibm
Copy link
Contributor Author

The following style tests ran cleanly for me:

build/sbt scalastyle

What other style tests should I run?

Thanks,
-Rick

@JoshRosen
Copy link
Contributor

Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala:167: File line length exceeds 100 characters
[error] (sql/test:scalastyle) errors exist
[error] Total time: 8 s, completed Oct 21, 2015 2:20:49 PM
[error] /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala:167: File line length exceeds 100 characters
[error] (sql/test:scalastyle) errors exist
[error] Total time: 8 s, completed Oct 21, 2015 2:21:32 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder@3/dev/lint-scala ; received return code 1

@rick-ibm
Copy link
Contributor Author

Thanks, Josh. My last commit should address that long line problem.

@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44096 has finished for PR 9202 at commit e7e401f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, does this need to be Java? it would be simpler in Scala. I may be missing a reason it must be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. I read right past the reasoning. It's not just inspired by Derby but copied wholesale on purpose. OK, that makes some sense, more than making a dependency on Derby. It has some 'cost' though.

@rick-ibm
Copy link
Contributor Author

Thanks for the review comments, Sean.

I did not polish the Java code in SqlIdentifierUtil, and I didn't translate it into Scala. This was in the interests of stability (see the header comment on my pull request above). The code came from Derby, where it has been proven by two decades of use in production applications. But I am happy to re-write it in Scala if you think that the benefits outweigh the risk of introducing bugs.

Concerning JdbcDialects.vetSqlIdentifier(), thanks for the suggestion about using a case class. That does sound like a better solution.

Thanks,
-Rick

@rick-ibm rick-ibm force-pushed the b_10857_sqlInjection branch from e7e401f to 79514f3 Compare October 22, 2015 15:49
@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44156 has finished for PR 9202 at commit 79514f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class SqlIdentifierUtil\n

@rick-ibm rick-ibm force-pushed the b_10857_sqlInjection branch from 79514f3 to 7d90459 Compare October 27, 2015 18:49
@rick-ibm
Copy link
Contributor Author

Thanks for the advice about how to improve this patch, Sean. Hopefully, my last commit addresses your concerns. I have made the following changes:

  1. Eliminated the Java code (SqlIdentifierUtil) and replaced it with a Scala class (SqlIdUtil).

  2. Revamped the id parsing to use a Scala regular expression.

  3. Introduced a case class (TableId) to represent a 3-part table identfier. This replaces the ordered triple which performed the same service in the previous rev of the patch.

Thanks,
-Rick

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44457 has finished for PR 9202 at commit 7d90459.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class TableId(database: String, schema: String, table: String)\n

@rick-ibm
Copy link
Contributor Author

Any further refinements which I should make to this patch? Thanks.

@marmbrus
Copy link
Contributor

@rick-ibm, thanks for working on this. With the Spark Summit just wrapping up and code freeze for 1.6 tomorrow I don't think there is going to be a lot of review bandwidth for a patch this large until after the release. While it would be nice to secure things, this doesn't seem like a particularly high priority vulnerability since any user that is creating a dataframe by querying JDBC must have the credentials already and could just open their own connection.

I haven't looked closely at the implementation, but one high level question is whether this is breaking the use case where a users gives a subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an important part of the API that we can't break.

@rick-ibm
Copy link
Contributor Author

Thanks for that feedback, Michael. To answer your question:

MA> I haven't looked closely at the implementation, but one high level question is whether this is breaking the use case where a users gives a subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an important part of the API that we can't break.

Reynold posed the same question at https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967827 and I replied at https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967930&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967930

Can you give me an example of the syntax you have in mind which might trip this problem? I tried the following experiment...

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "select * from app.t")).load()

...but that statement doesn't tickle JdbcDialect.getTableExistsQuery(). Instead, the statement fails because the following ungrammatical query is sent to the database:

SELECT * FROM select * from app.t WHERE 1=0

That, in turn, is because JDBCRDD.resolveTable() has its own hard-coded query which probes for the existence of a table. I have logged https://issues.apache.org/jira/browse/SPARK-11426 to track this issue. The header comment on JDBCRDD.resolveTable() indicates that the method expects a table name and not an arbitrary query.

Thanks,
-Rick

@marmbrus
Copy link
Contributor

You have to put the subquery in parentheses so it results in valid SQL.
On Oct 30, 2015 9:06 PM, "Rick Hillegas" [email protected] wrote:

Thanks for that feedback, Michael. To answer your question:

MA> I haven't looked closely at the implementation, but one high level
question is whether this is breaking the use case where a users gives a
subquery instead of a table name (i.e. dbtable = (SELECT ...).) This is an
important part of the API that we can't break.

Reynold posed the same question at
https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967827
and I replied at
https://issues.apache.org/jira/browse/SPARK-10857?focusedCommentId=14967930&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14967930

Can you give me an example of the syntax you have in mind which might trip
this problem? I tried the following experiment...

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "select * from app.t")).load()

...but that statement doesn't tickle JdbcDialect.getTableExistsQuery().
Instead, the statement fails because the following ungrammatical query is
sent to the database:

SELECT * FROM select * from app.t WHERE 1=0

That, in turn, is because JDBCRDD.resolveTable() has its own hard-coded
query which probes for the existence of a table. I have logged
https://issues.apache.org/jira/browse/SPARK-11426 to track this issue.
The header comment on JDBCRDD.resolveTable() indicates that the method
expects a table name and not an arbitrary query.

Thanks,
-Rick


Reply to this email directly or view it on GitHub
#9202 (comment).

@rick-ibm
Copy link
Contributor Author

Thanks for the quick response, Michael. Simply parenthesizing the query will result in non-Standard syntax which an ANSI-compliant database will reject...

// Raises the unhelpful error: Syntax error: Encountered "WHERE" at line 1, column 37.
try {
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "(select * from app.t)")).load()
} catch {
case t: Throwable => println(t.getMessage)
}

However, you can go one step further and give the subquery a correlation name so that it will fit nicely in a FROM list...

// succeeds
try {
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "(select * from app.t) correlationName")).load()
} catch {
case t: Throwable => println(t.getMessage)
}

I am inclined to close SPARK-11426 as "not a problem". It seems that JDBCRDD.resolveTable() deliberately doesn't call JdbcDialect.getTableExistsQuery() because it is not really probing for the existence of the table. It is trying to compute the row-signature of a query expression.

In any event, the changes made by this patch do not seem to affect the code path you are concerned about.

Thanks,
-Rick

@rick-ibm rick-ibm force-pushed the b_10857_sqlInjection branch from 7d90459 to 6c906bb Compare November 4, 2015 17:45
@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45027 has finished for PR 9202 at commit 6c906bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class ExecutorLostFailure(\n * case class RegisteredExecutor(hostname: String) extends CoarseGrainedClusterMessage\n * public class JavaIsotonicRegressionExample\n * public class JavaNaiveBayesExample\n * case class TableId(database: String, schema: String, table: String)\n * case class DecimalLit(chars: String) extends Token\n * case class UnresolvedStar(target: Option[Seq[String]]) extends Star with Unevaluable\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n * case class Exchange(\n * class CoalescedPartitioner(val parent: Partitioner, val partitionStartIndices: Array[Int])\n

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sameeragarwal
Copy link
Member

@rick-ibm would you still have time to bring this up to date? Thanks!

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants