Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -146,4 +146,22 @@ private CommonPathCapabilities() {
*/
public static final String ABORTABLE_STREAM =
"fs.capability.outputstream.abortable";

/**
* Does this FS support etags?
* That is: will FileStatus entries from listing/getFileStatus
* probes support EtagSource and return real values.
*/
public static final String ETAGS_AVAILABLE =
"fs.capability.etags.available";

/**
* Are etags guaranteed to be preserved across rename() operations..
* FileSystems MUST NOT declare support for this feature
* unless this holds.
*/
public static final String ETAGS_PRESERVED_IN_RENAME =
"fs.capability.etags.preserved.in.rename";


}
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.hadoop.fs;

/**
* An optional interface for {@link FileStatus} subclasses to implement
* to provide access to etags.
* If available FS SHOULD also implement the matching PathCapabilities
* -- etag supported: {@link CommonPathCapabilities#ETAGS_AVAILABLE}.
* -- etag consistent over rename:
* {@link CommonPathCapabilities#ETAGS_PRESERVED_IN_RENAME}.
*/
public interface EtagSource {

/**
* Return an etag of this file status.
* A return value of null or "" means "no etag"
* @return a possibly null or empty etag.
*/
String getEtag();

}
Original file line number Diff line number Diff line change
Expand Up @@ -1240,7 +1240,7 @@ Renaming a file where the destination is a directory moves the file as a child
FS' where:
not exists(FS', src)
and exists(FS', dest)
and data(FS', dest) == data (FS, dest)
and data(FS', dest) == data (FS, source)
result = True


Expand Down Expand Up @@ -1698,3 +1698,92 @@ in:readahead | READAHEAD | CanSetReadahead | Set the readahead on the input st
dropbehind | DROPBEHIND | CanSetDropBehind | Drop the cache.
in:unbuffer | UNBUFFER | CanUnbuffer | Reduce the buffering on the input stream.

## <a name="etagsource"></a> Etag probes through the interface `EtagSource`

FileSystem implementations MAY support querying HTTP etags from `FileStatus`
entries. If so, the requirements are as follows

### Etag support MUST BE across all list/`getFileStatus()` calls.

That is: when adding etag support, all operations which return `FileStatus` or `ListLocatedStatus`
entries MUST return subclasses which are instances of `EtagSource`.

### FileStatus instances MUST have etags whenever the remote store provides them.

To support etags, they MUST BE to be provided in both `getFileStatus()`
and list calls.

Implementors note: the core APIs which MUST BE overridden to achieve this are as follows:

```java
FileStatus getFileStatus(Path)
FileStatus[] listStatus(Path)
RemoteIterator<FileStatus> listStatusIterator(Path)
RemoteIterator<LocatedFileStatus> listFiles([Path, boolean)
```


### Etags of files MUST BE Consistent across all list/getFileStatus operations.

The value of `EtagSource.getEtag()` MUST be the same for list* queries which return etags for calls of `getFileStatus()` for the specific object.

```java
((EtagSource)getFileStatus(path)).getEtag() == ((EtagSource)listStatus(path)[0]).getEtag()
```

Similarly, the same value MUST BE returned for `listFiles()`, `listStatusIncremental()` of the path and
when listing the parent path, of all files in the listing.

### Etags MUST BE different for different file contents.

Two different arrays of data written to the same path MUST have different etag values when probed.
This is a requirement of the HTTP specification.

### Etags of files SHOULD BE preserved across rename operations

After a file is renamed, the value of `((EtagSource)getFileStatus(dest)).getEtag()`
SHOULD be the same as the value of `((EtagSource)getFileStatus(source)).getEtag()`
was before the rename took place.

This is an implementation detail of the store; it does not hold for AWS S3.

If and only if the store consistently meets this requirement, the filesystem SHOULD
declare in `hasPathCapability()` that it supports
`fs.capability.etags.preserved.in.rename`

### Directories MAY have etags

Directory entries MAY return etags in listing/probe operations; these entries MAY be preserved across renames.

Equally, directory entries MAY NOT provide such entries, MAY NOT preserve them acrosss renames,
and MAY NOT guarantee consistency over time.

Note: special mention of the root path "/".
As that isn't a real "directory", nobody should expect it to have an etag.

### All etag-aware `FileStatus` subclass MUST BE `Serializable`; MAY BE `Writable`

The base `FileStatus` class implements `Serializable` and `Writable` and marshalls its fields appropriately.

Subclasses MUST support java serialization (Some Apache Spark applications use it), preserving the etag.
This is a matter of making the etag field non-static and adding a `serialVersionUID`.

The `Writable` support was used for marshalling status data over Hadoop IPC calls;
in Hadoop 3 that is implemented through `org/apache/hadoop/fs/protocolPB/PBHelper.java`and the methods deprecated.
Subclasses MAY override the deprecated methods to add etag marshalling.
However -but there is no expectation of this and such marshalling is unlikely to ever take place.

### Appropriate etag Path Capabilities SHOULD BE declared

1. `hasPathCapability(path, "fs.capability.etags.available")` MUST return true iff
the filesystem returns valid (non-empty etags) on file status/listing operations.
2. `hasPathCapability(path, "fs.capability.etags.consistent.across.rename")` MUST return
true if and only if etags are preserved across renames.

### Non-requirements of etag support

* There is no requirement/expectation that `FileSystem.getFileChecksum(Path)` returns
a checksum value related to the etag of an object, if any value is returned.
* If the same data is uploaded to the twice to the same or a different path,
the etag of the second upload MAY NOT match that of the first upload.

Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.hadoop.fs.contract;

import java.nio.charset.StandardCharsets;

import org.assertj.core.api.Assertions;
import org.junit.Assume;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.apache.hadoop.fs.EtagSource;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;

import static org.apache.hadoop.fs.CommonPathCapabilities.ETAGS_AVAILABLE;
import static org.apache.hadoop.fs.CommonPathCapabilities.ETAGS_PRESERVED_IN_RENAME;

/**
* For filesystems which support etags, validate correctness
* of their implementation.
*/
public abstract class AbstractContractEtagTest extends
AbstractFSContractTestBase {

private static final Logger LOG =
LoggerFactory.getLogger(AbstractContractEtagTest.class);

/**
* basic consistency across operations, as well as being non-empty.
*/
@Test
public void testEtagConsistencyAcrossListAndHead() throws Throwable {
describe("Etag values must be non-empty and consistent across LIST and HEAD Calls.");
final Path path = methodPath();
final FileSystem fs = getFileSystem();

Assertions.assertThat(fs.hasPathCapability(path, ETAGS_AVAILABLE))
.describedAs("path capability %s of %s",
ETAGS_AVAILABLE, path)
.isTrue();

ContractTestUtils.touch(fs, path);

final FileStatus st = fs.getFileStatus(path);
final String etag = etagFromStatus(st);
LOG.info("etag of empty file is \"{}\"", etag);

final FileStatus[] statuses = fs.listStatus(path);
Assertions.assertThat(statuses)
.describedAs("List(%s)", path)
.hasSize(1);
final FileStatus lsStatus = statuses[0];
Assertions.assertThat(etagFromStatus(lsStatus))
.describedAs("etag of list status (%s) compared to HEAD value of %s", lsStatus, st)
.isEqualTo(etag);
}

/**
* Get an etag from a FileStatus which MUST BE
* an implementation of EtagSource and
* whose etag MUST NOT BE null/empty.
* @param st the status
* @return the etag
*/
String etagFromStatus(FileStatus st) {
Assertions.assertThat(st)
.describedAs("FileStatus %s", st)
.isInstanceOf(EtagSource.class);
final String etag = ((EtagSource) st).getEtag();
Assertions.assertThat(etag)
.describedAs("Etag of %s", st)
.isNotBlank();
return etag;
}

/**
* Overwritten data has different etags.
*/
@Test
public void testEtagsOfDifferentDataDifferent() throws Throwable {
describe("Verify that two different blocks of data written have different tags");

final Path path = methodPath();
final FileSystem fs = getFileSystem();
Path src = new Path(path, "src");

ContractTestUtils.createFile(fs, src, true,
"data1234".getBytes(StandardCharsets.UTF_8));
final FileStatus srcStatus = fs.getFileStatus(src);
final String srcTag = etagFromStatus(srcStatus);
LOG.info("etag of file 1 is \"{}\"", srcTag);

// now overwrite with data of same length
// (ensure that path or length aren't used exclusively as tag)
ContractTestUtils.createFile(fs, src, true,
"1234data".getBytes(StandardCharsets.UTF_8));

// validate
final String tag2 = etagFromStatus(fs.getFileStatus(src));
LOG.info("etag of file 2 is \"{}\"", tag2);

Assertions.assertThat(tag2)
.describedAs("etag of updated file")
.isNotEqualTo(srcTag);
}

/**
* If supported, rename preserves etags.
*/
@Test
public void testEtagConsistencyAcrossRename() throws Throwable {
describe("Verify that when a file is renamed, the etag remains unchanged");
final Path path = methodPath();
final FileSystem fs = getFileSystem();
Assume.assumeTrue(
"Filesystem does not declare that etags are preserved across renames",
fs.hasPathCapability(path, ETAGS_PRESERVED_IN_RENAME));
Path src = new Path(path, "src");
Path dest = new Path(path, "dest");

ContractTestUtils.createFile(fs, src, true,
"sample data".getBytes(StandardCharsets.UTF_8));
final FileStatus srcStatus = fs.getFileStatus(src);
LOG.info("located file status string value " + srcStatus);

final String srcTag = etagFromStatus(srcStatus);
LOG.info("etag of short file is \"{}\"", srcTag);

Assertions.assertThat(srcTag)
.describedAs("Etag of %s", srcStatus)
.isNotBlank();

// rename
fs.rename(src, dest);

// validate
FileStatus destStatus = fs.getFileStatus(dest);
final String destTag = etagFromStatus(destStatus);
Assertions.assertThat(destTag)
.describedAs("etag of list status (%s) compared to HEAD value of %s",
destStatus, srcStatus)
.isEqualTo(srcTag);
}

/**
* For effective use of etags, listLocatedStatus SHOULD return status entries
* with consistent values.
* This ensures that listing during query planning can collect and use the etags.
*/
@Test
public void testLocatedStatusAlsoHasEtag() throws Throwable {
describe("verify that listLocatedStatus() and listFiles() are etag sources");
final Path path = methodPath();
final FileSystem fs = getFileSystem();
Path src = new Path(path, "src");
ContractTestUtils.createFile(fs, src, true,
"sample data".getBytes(StandardCharsets.UTF_8));
final FileStatus srcStatus = fs.getFileStatus(src);
final String srcTag = etagFromStatus(srcStatus);
final LocatedFileStatus entry = fs.listLocatedStatus(path).next();
LOG.info("located file status string value " + entry);
final String listTag = etagFromStatus(entry);
Assertions.assertThat(listTag)
.describedAs("etag of listLocatedStatus (%s) compared to HEAD value of %s",
entry, srcStatus)
.isEqualTo(srcTag);

final LocatedFileStatus entry2 = fs.listFiles(path, false).next();
Assertions.assertThat(etagFromStatus(entry2))
.describedAs("etag of listFiles (%s) compared to HEAD value of %s",
entry, srcStatus)
.isEqualTo(srcTag);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.fs.EtagSource;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;

Expand All @@ -30,7 +31,7 @@
*/
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class S3AFileStatus extends FileStatus {
public class S3AFileStatus extends FileStatus implements EtagSource {

private static final long serialVersionUID = -5955674081978903922L;

Expand Down Expand Up @@ -166,8 +167,16 @@ public void setIsEmptyDirectory(Tristate isEmptyDirectory) {

/**
* @return the S3 object eTag when available, else null.
* @deprecated use {@link EtagSource#getEtag()} for
* public access.
*/
@Deprecated
public String getETag() {
return getEtag();
}

@Override
public String getEtag() {
return eTag;
}

Expand Down
Loading