Skip to content

Conversation

@windpiger
Copy link
Contributor

@windpiger windpiger commented Mar 3, 2017

What changes were proposed in this pull request?

Currently we treat the location of table/partition/database as URI string.

It will be safer if we can make the type of location as java.net.URI.

In this PR, there are following classes changes:
1. CatalogDatabase

case class CatalogDatabase(
    name: String,
    description: String,
    locationUri: String,
    properties: Map[String, String])
--->
case class CatalogDatabase(
    name: String,
    description: String,
    locationUri: URI,
    properties: Map[String, String])

2. CatalogStorageFormat

case class CatalogStorageFormat(
    locationUri: Option[String],
    inputFormat: Option[String],
    outputFormat: Option[String],
    serde: Option[String],
    compressed: Boolean,
    properties: Map[String, String])
---->
case class CatalogStorageFormat(
    locationUri: Option[URI],
    inputFormat: Option[String],
    outputFormat: Option[String],
    serde: Option[String],
    compressed: Boolean,
    properties: Map[String, String])

Before and After this PR, it is transparent for user, there is no change that the user should concern. The String to URI just happened in SparkSQL internally.

Here list some operation related location:
1. whitespace in the location
e.g. /a/b c/d
For both table location and partition location,
After CREATE TABLE t... (PARTITIONED BY ...) LOCATION '/a/b c/d' ,
then DESC EXTENDED t show the location is /a/b c/d,
and the real path in the FileSystem also show /a/b c/d

2. colon(:) in the location
e.g. /a/b:c/d
For both table location and partition location,
when CREATE TABLE t... (PARTITIONED BY ...) LOCATION '/a/b:c/d' ,

In linux file system
DESC EXTENDED t show the location is /a/b:c/d,
and the real path in the FileSystem also show /a/b:c/d

in HDFS throw exception:
java.lang.IllegalArgumentException: Pathname /a/b:c/d from hdfs://iZbp1151s8hbnnwriekxdeZ:9000/a/b:c/d is not a valid DFS filename.

while After INSERT INTO TABLE t PARTITION(a="a:b") SELECT 1
then DESC EXTENDED t show the location is /xxx/a=a%3Ab,
and the real path in the FileSystem also show /xxx/a=a%3Ab

3. percent sign(%) in the location
e.g. /a/b%c/d
For both table location and partition location,
After CREATE TABLE t... (PARTITIONED BY ...) LOCATION '/a/b%c/d' ,
then DESC EXTENDED t show the location is /a/b%c/d,
and the real path in the FileSystem also show /a/b%c/d

4. encoded(%25) in the location
e.g. /a/b%25c/d
For both table location and partition location,
After CREATE TABLE t... (PARTITIONED BY ...) LOCATION '/a/b%25c/d' ,
then DESC EXTENDED t show the location is /a/b%25c/d,
and the real path in the FileSystem also show /a/b%25c/d

while After INSERT INTO TABLE t PARTITION(a="%25") SELECT 1
then DESC EXTENDED t show the location is /xxx/a=%2525,
and the real path in the FileSystem also show /xxx/a=%2525

Additionally, except the location, there are two other factors will affect the location of the table/partition. one is the table name which does not allowed to have special characters, and the other is partition name which have the same actions with partition value, and partition name with special character situation has add some testcase and resolve a bug in PR

Summary:

After CREATE TABLE t... (PARTITIONED BY ...) LOCATION path,
the path which we get from DESC TABLE and real path in FileSystem are all the same with the CREATE TABLE command(different filesystem has different action that allow what kind of special character to create the path, e.g. HDFS does not allow colon, but linux filesystem allow it ).

DataBase also have the same logic with CREATE TABLE

while if the partition value has some special character like % : # etc, then we will get the path with encoded partition value like /xxx/a=A%25B from DESC TABLE and real path in FileSystem

In this PR, the core change code is using new Path(str).toUri and new Path(uri).toString
which transfrom str to uri or uri to str.
for example:

val str = '/a/b c/d'
val uri = new Path(str).toUri  --> '/a/b%20c/d'
val strFromUri = new Path(uri).toString -> '/a/b c/d'

when we restore table/partition from metastore, or get the location from CREATE TABLE command, we can use it as above to change string to uri new Path(str).toUri

How was this patch tested?

unit test added.
The current master branch also passed all the test cases added in this PR by a litter change.
https://github.com/apache/spark/pull/17149/files#diff-b7094baa12601424a5d19cb930e3402fR1764
here toURI -> toString when test in master branch.

This can show that this PR is transparent for user.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73840 has finished for PR 17149 at commit 69a1646.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73841 has finished for PR 17149 at commit 890327a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73842 has finished for PR 17149 at commit e792cb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [SPARK-19257][SQL][WIP]location for table/partition/database should be java.net.URI [SPARK-19257][SQL]location for table/partition/database should be java.net.URI Mar 3, 2017
@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73855 has finished for PR 17149 at commit f2b9bd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

}
validateName(dbName)
val qualifiedPath = makeQualifiedPath(dbDefinition.locationUri).toString
val qualifiedPath = makeQualifiedPath(new Path(dbDefinition.locationUri).toString).toUri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeQualifiedPath should accept URI now

Map("partCol1" -> "1", "partCol2" -> "2")).location
val tableLocation = catalog.getTable("db1", "tbl").location
val defaultPartitionLocation = new Path(new Path(tableLocation, "partCol1=1"), "partCol2=2")
val tableLocationPath = new Path(catalog.getTable("db1", "tbl").location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableLocationPath sound weird, let's keep the previous name

val partition1 =
CatalogTablePartition(Map("partCol1" -> "1", "partCol2" -> "2"),
storageFormat.copy(locationUri = Some(newLocationPart1)))
storageFormat.copy(locationUri = Some(new Path(newLocationPart1).toUri)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't newLocationPart1 already a URI?

catalog.createTable(table, ignoreIfExists = false)

val tableLocation = catalog.getTable("db1", "tbl").location
val tableLocationPath = new Path(catalog.getTable("db1", "tbl").location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the previous name

val catalog = newBasicCatalog()
try {
val newLocation = newUriForDatabase()
val newLocationUri = new Path(newUriForDatabase()).toUri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the previous name

def newFunc(): CatalogFunction = newFunc("funcName")

def newUriForDatabase(): String = Utils.createTempDir().toURI.toString.stripSuffix("/")
def newUriForDatabase(): URI = new URI(Utils.createTempDir().toURI.toString.stripSuffix("/"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can simply it and write Utils.createTempDir().toURI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.createTempDir().toURI has a suffix '/', here we have to strip it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan after I try to use Path to Compare , I think stripSuffix here is the simple way.

Path with a String type constructor will be equal when one has a /, and another does not.

scala> val x = new Path("/ab/c/")
x: org.apache.hadoop.fs.Path = /ab/c

scala> val y = new Path("/ab/c")
y: org.apache.hadoop.fs.Path = /ab/c

scala> x == y
res0: Boolean = true

Path with a URI type constructor will be not equal when one has a /, and another does not.

scala> val x =new URI("/a/b/c/")
x: java.net.URI = /a/b/c/

scala> val y =new URI("/a/b/c")
y: java.net.URI = /a/b/c

scala> x == y
res1: Boolean = false

scala> val x1 =new Path(x)
x1: org.apache.hadoop.fs.Path = /a/b/c/

scala> val y1 =new Path(y)
y1: org.apache.hadoop.fs.Path = /a/b/c

scala> x1 == y1
res2: Boolean = false

So just the Path with String type can ignore the suffix /, then if we have a URI in hand, and we want to compare with another URI, we should first transform them to String , and use this String to constructor a Path, after this two actions, we can compare them with ignore the suffix /.

But I think it is more complicate, here we normalize the URI with stripSuffix from the Orignal then compare two URI directly, it is more simple.

should we must to convert it to Path to compare?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok if we always compare URI with Path, instead of converting it to string.


package org.apache.spark.sql.catalog

import java.net.URI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary import?

identifier = table,
tableType = tableType,
storage = storage.copy(locationUri = customLocation),
storage = storage.copy(locationUri = customLocation.map{ loc =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

storage.copy(
  locationUri = xxx)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add comments to explain why we don't create URI directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern appears many times in this PR, how about we create a util method toURI in CatalogUtils and add comments there?

dir.delete
val tableLocFile = new File(table.location)
assert(!tableLocFile.exists)
// val tableLocFile = new File(table.location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment it out?

val storageWithPathOption = tableDefinition.storage.copy(
properties = tableDefinition.storage.properties ++ newLocation.map("path" -> _))
properties =
tableDefinition.storage.properties ++newLocation.map("path" -> new Path(_).toString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add a util method URIToString and add comments

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73976 has started for PR 17149 at commit 109e2b5.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73977 has started for PR 17149 at commit dc0a37b.


/**
* Convert URI to String.
* Since URI.toString does not decode for the uri string, we need to use
Copy link
Contributor

@cloud-fan cloud-fan Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since URI.toString does not decode the uri, e.g. you may see `%21` instead of `!`.
Here we create a hadoop Path with the given URI, ad rely on Path.toString to decode the uri

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73979 has started for PR 17149 at commit b6bc466.

* @return the String of the path
*/
def URIToString(uri: Option[URI]): Option[String] = {
uri.map(new Path(_).toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try java.net.URLDecoder? I'd like to avoid hadoop dependency on this kind of basic functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about StringToURI? it still import Path

* FileSystem is changed.
*/
private def makeQualifiedPath(path: String): Path = {
private def makeQualifiedPath(path: URI): Path = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can return URI too

val catalog = newBasicCatalog()
try {
val newLocation = newUriForDatabase()
val newLocation = new Path(newUriForDatabase()).toUri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change?


import org.antlr.v4.runtime.{ParserRuleContext, Token}
import org.antlr.v4.runtime.tree.TerminalNode
import org.apache.hadoop.fs.Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this


val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
Some(sessionState.catalog.defaultTablePath(table.identifier))
Some(new Path(sessionState.catalog.defaultTablePath(table.identifier)).toUri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the defaultTablePath should also return URI now.

Row("Database Name", dbMetadata.name) ::
Row("Description", dbMetadata.description) ::
Row("Location", dbMetadata.locationUri) :: Nil
Row("Location", new Path(dbMetadata.locationUri).toString) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use CatalogUtil.URIToString?

databaseName,
comment.getOrElse(""),
path.getOrElse(catalog.getDefaultDBPath(databaseName)),
new Path(path.getOrElse(catalog.getDefaultDBPath(databaseName))).toUri,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getDefaultDBPath should return URI now.

None
} else {
Some(s"path '${escapeSingleQuotedString(location)}'")
Some(s"path '${escapeSingleQuotedString(new Path(location).toString)}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use URIToString?

name = metadata.name,
description = metadata.description,
locationUri = metadata.locationUri)
locationUri = new Path(metadata.locationUri).toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URIToString?

{
val defaultDbDefinition = CatalogDatabase(
SessionCatalog.DEFAULT_DATABASE, "default database", warehousePath, Map())
SessionCatalog.DEFAULT_DATABASE, "default database", new Path(warehousePath).toUri, Map())
Copy link
Contributor

@cloud-fan cloud-fan Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringToURI?

private def createDatabase(catalog: SessionCatalog, name: String): Unit = {
catalog.createDatabase(
CatalogDatabase(name, "", spark.sessionState.conf.warehousePath, Map()),
CatalogDatabase(name, "", new Path(spark.sessionState.conf.warehousePath).toUri, Map()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringToURI?

val storage =
CatalogStorageFormat(
locationUri = Some(catalog.defaultTablePath(name)),
locationUri = Some(new Path(catalog.defaultTablePath(name)).toUri),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringToURI?

sql(s"CREATE DATABASE $dbName")
val db1 = catalog.getDatabaseMetadata(dbName)
val expectedLocation = makeQualifiedPath(s"spark-warehouse/$dbName.db")
val expectedLocation = new Path(makeQualifiedPath(s"spark-warehouse/$dbName.db")).toUri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringToURI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just implement this function with a option[String], I think for a String without a option, it is already simple, we still to implement one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think StringToURI should convert String to URI, not Option[String].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe we should implement them all? option[string] also exists in many places, and more complicate

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74000 has finished for PR 17149 at commit 681db88.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74004 has finished for PR 17149 at commit 5b92620.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73995 has finished for PR 17149 at commit abfb6f5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73996 has finished for PR 17149 at commit 80f2c40.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74012 has finished for PR 17149 at commit 5b423f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"you can only specify one of them.", ctx)
}
val customLocation = storage.locationUri.orElse(location)
val customLocation = storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: xx.map(CatalogUtils.stringToURI)

}
}

Seq("a b", "a:b", "a%b").foreach { specialChars =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a TODO to remove these duplicated tests when we merge DDLSuite and HiveDDLSuite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces the conflict in my PR #16592. Let me remove the duplicate there.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 096df6d Mar 6, 2017
}
val customLocation = storage.locationUri.orElse(location)
val customLocation = storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this has a double-de/encoding problem when the input path is an URI.

scala> new org.apache.hadoop.fs.Path(new java.io.File("a b").toURI.toString).toUri.toString
res1: String = file:/.../a%2520b

A space character in URI is encoded as %20 and % character is encoded as %25. It seems a URI has a %20 and then is url-encoded from %20 to %2520.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we have a similar util with CatalogUtils.stringToURI in org.apache.spark.util.Utils.resolveURI. Could we consolidate them? I did not replace it when I identified this because some existing tests in org.apache.spark.util.UtilsSuite were broken but I guess it would be fine if there is a coherent reason. These broken cases might be bugs.

@windpiger could you double check my comments and open a followup if I was not wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not replace it when I identified this because some existing tests in org.apache.spark.util.UtilsSuite were broken but I guess it would be fine if there is a coherent reason. These broken cases might be bugs.

FYI.. cc @sarutak

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you reproduce it as a bug, please submit a PR to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could the input path be a URI string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. is the input path always expected to be a path? I thought we support both URI and path forms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ambiguous to support both, what if users do wanna create a path /tmp/%25?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, HDFS's fully qualified path is a URI form. If we do not support this, that virtually means we are going to disallow the fully qualified path. I understand it sounds ambiguous but disallowing does not look a good solution.

Also, if users might want to access to local files or S3 when default scheme is hdfs, I guess it requires changing the default scheme every time.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I guess we already have a lot of tests with URI input paths and I did many of them to pass the tests on Windows, which I guess implicitly some committers do not disagree with this.

IMHO, I guess URIs should be supported first correctly because those local path form is abbreviation of the fully qualified path.

Copy link
Member

@sarutak sarutak Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could the input path be a URI string?

How about external tables on S3?

// compatible format, which means the data source is file-based and must have a `path`.
require(table.storage.locationUri.isDefined,
"External file-based data source table must have a `path` entry in storage properties.")
Some(new Path(table.location).toUri.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon previously we also encode the table location string, I think we never supported uri path string before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it seems the problem itself seems existing before.

I found this while running related tests on Windows which it looks related with this PR (special character stuff) so I think this is a proper JIRA to make a followup with and PR to note this.

It seems we do have tests that use URIs already whether it was mistakenly supported or not. Should we ask it to dev-mailing list? I think this is an important decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, please go ahead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming. Let me send a mail soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did a bit of search more about what org.apache.hadoop.fs.Path expects. It seems the documentation says:

Construct a path from a String. Path strings are URIs, but with unescaped elements and some additional normalization.

It seems those strings are expected to be unescaped. So, it seems we support URI with unescaped characters, which is inherited from Path.

I want to be sure on this because I have fixed many tests to use URIs to pass on Windows and I am about to fix them further in this way. @steveloughran, do you mind if I cc you and ask to take a look and help to confirm this please? I know no one who knows better about Hadoop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah, paths, especially on windows make me weep. Note also lots of fun related to encoded secrets in s3n/s3a/s3 urls (HADOOP-3733 and the associated regression HADOOP-14114)

Because of those patches, and the fact that if you want to embed a / in an S3A URI you must escape it, I don't know what to say here.

If you can come up with a simple set of questions to ask someone who doesn't know this code but could understand Path's logic, I could pass them on to some people I think may know the answer. Maybe @cnauroth might have some insight on what windows gets up to

BTW, Path in Hadoop 2.9 is Serializable. you can pass them down to closures working on RDDs, etc.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @steveloughran. Let me try to produce a simple set of questions purely about org.apache.hadoop.fs.Path soon.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have two questions as below:

  • Is it okay to use both URIs and local file paths for the input string for org.apache.hadoop.fs.Path in general (when they are expected to be unescaped)?

  • What org.apache.hadoop.fs.Path expects for input string in definition? It seems URI strings but with unescaped characters per https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Path.html and there are special cases that requires escaped characters such as s3n/s3a/s3 urls. Is my understanding correct?

Copy link
Contributor

@steveloughran steveloughran Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. there's actually some discussion about defining this properly, related to the problem of "colons in object store paths". That's not going to help directly, but a sign of something which has historically been underspecified
  2. regarding URI strings + escaping, that S3 encoded password is a bit of an outlier. I plan to cut it ASAP —we just need a release of hadoop with it printing out warnings first & supporting the per-bucket configuration people need for the use case "auth as somebody else on a specific bucket". Given Hadoop 2.8 meets these criteria, Hadoop 3.x is the obvious target

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Your answer and the pointer helped me a lot to understand.

So, up to my understanding...

There are few exceptional cases, for example, special characters or encoded characters in those URIs. Also, there are some corner cases that are not clearly defined about what Path expects.

Nevertheless, the baseline is fully qualified URI with unescaped characters or the unescaped path part in URI as the documentation says but there may be some variants for now or in the future.

@cnauroth
Copy link
Contributor

@HyukjinKwon , nice to meet you! I see I got notified here for a bit of Hadoop Path knowledge, and particularly on Windows.

Is it okay to use both URIs and local file paths for the input string for org.apache.hadoop.fs.Path in general (when they are expected to be unescaped)?

Yes, this is correct.

Specifically on the topic of Windows, Path has special case logic for handling a Windows-specific local file path. (This logic is only triggered if it detects the runtime OS is Windows.) On Windows, I expect a call like new Path("C:\\foo\\bar").toUri to yield a correct URI pointing at that local file path, and further calling toString yields a correct String representation of the path. Hadoop code often needs to take a path string that is possibly a relative path and pass it through Path to make it absolute and escape it according to Hadoop code expectations.

The standard invocation for doing this in the Hadoop code is new Path(...).toUri(); or new Path(...).toUri().toString();. This works across all platforms. I don't have any knowledge of the Spark codebase, but I see this patch uses similar invocations, so I expect it's good.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 15, 2017

@cnauroth Thank you so much for your help.

FWIW, actually, in Spark's case, it seems we can't use directly use local paths on Windows via the parser because it seems Spark's parser tries to escape \ characters. So, the path should be manually replaced (to something like \\). This is the reason why I had to replace the tests about this to URIs rather than manually replacing those characters.

@HyukjinKwon
Copy link
Member

(I think I should cc @srowen too FYI because he reviewed all my PRs fixing the tests on Windows. The point here seems to me okay to use URIs in general.)

@gatorsmile
Copy link
Member

Our parser might need a change regarding escape handling. We are having a related discussion in another PR: #15398

@HyukjinKwon
Copy link
Member

@gatorsmile, Thanks for your pointer. There is a good discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants