[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight #32114

sumeetgajjar · 2021-04-09T21:15:39Z

What changes were proposed in this pull request?

This patch proposes a fix to prevent triggering BlockManager reregistration while StopExecutor msg is in-flight.
Here on receiving StopExecutor msg, we do not remove the corresponding BlockManagerInfo from blockManagerInfo map, instead we mark it as dead by updating the corresponding executorRemovalTs. There's a separate cleanup thread running to periodically remove the stale BlockManagerInfo from blockManangerInfo map.

Now if a recently removed BlockManager tries to register, the driver simply ignores it since the blockManagerInfo map already contains an entry for it. The same applies to BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, the blockManagerInfo map would contain an entry and we shall not ask the corresponding BlockManager to re-register.

Why are the changes needed?

This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark.
Consider the following scenario:

CoarseGrainedSchedulerBackend issues async StopExecutor on executorEndpoint
CoarseGrainedSchedulerBackend removes that executor from Driver's internal data structures and publishes SparkListenerExecutorRemoved on the listenerBus.
Executor has still not processed StopExecutor from the Driver
Driver receives heartbeat from the Executor, since it cannot find the executorId in its data structures, it responds with HeartbeatResponse(reregisterBlockManager = true)
BlockManager on the Executor reregisters with the BlockManagerMaster and SparkListenerBlockManagerAdded is published on the listenerBus
Executor starts processing the StopExecutor and exits
AppStatusListener picks the SparkListenerBlockManagerAdded event and updates AppStatusStore
statusTracker.getExecutorInfos refers AppStatusStore to get the list of executors which returns the dead executor as alive.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified the existing unittests.
Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached.

sumeetgajjar · 2021-04-09T23:32:40Z

It turns out "on-disk storage" (encryptionTest) under BlockManagerSuite fails when "encryption = on" on my mac. But Intellij does not report it as a failure since the jvm simply exited. It also does not run the next set of tests under BlockManagerSuite given such scenarios and hence I didn't notice NPE for two of those unrunned tests.

However, the same encryption test passes on Github. JVM exits while dynamically loading org.apache.commons.crypto.random.OpenSslCryptoRandom using commons-crypto on my machine.

Lesson learnt: always perform a final sanity test run using sbt/mvn before concluding the test passes 😋

sumeetgajjar · 2021-04-10T01:39:13Z

It turns out "on-disk storage" (encryptionTest) under BlockManagerSuite fails when "encryption = on" on my mac. But Intellij does not report it as a failure since the jvm simply exited. It also does not run the next set of tests under BlockManagerSuite given such scenarios and hence I didn't notice NPE for two of those unrunned tests.

However, the same encryption test passes on Github. JVM exits while dynamically loading org.apache.commons.crypto.random.OpenSslCryptoRandom using commons-crypto on my machine.

Lesson learned: always perform a final sanity test run using sbt/mvn before concluding the test passes 😋

I had [email protected] installed on my mac (Catalina 10.15.7), however, the corresponding shared libs libcrypto.1.1.dylib and libssl.1.1.dylib were missing from my /usr/local/lib dir.
Running the following commands solved the issue and I was able to run the encryptionTest.

cd /usr/local/Cellar/[email protected]/1.1.1k/lib
cp libssl.1.1.dylib libcrypto.1.1.dylib /usr/local/lib
cd /usr/local/lib
ln -s libcrypto.1.1.dylib libcrypto.dylib
ln -s libssl.1.1.dylib libssl.dylib

You could check if the commons-crypto is using the correct openssl shared libs by running the following command

java -jar Desktop/commons-crypto-1.1.0/commons-crypto-1.1.0.jar Crypto
Apache Commons Crypto 1.1.0
Native code loaded OK: 1.1.0
Native name: Apache Commons Crypto
Native built: Aug 28 2020
OpenSSL library loaded OK, version: 0x101010bf
OpenSSL library info: OpenSSL 1.1.1k  25 Mar 2021
Random instance created OK: org.apache.commons.crypto.random.OpenSslCryptoRandom@2a84aee7
Cipher AES/CTR/NoPadding instance created OK: org.apache.commons.crypto.cipher.OpenSslCipher@1fb3ebeb
Additional OpenSSL_version(n) details:
1: compiler: clang -fPIC -arch x86_64 -O3 -Wall -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -D_REENTRANT -DNDEBUG
2: built on: Thu Mar 25 21:01:02 2021 UTC
3: platform: darwin64-x86_64-cc
4: OPENSSLDIR: "/usr/local/etc/[email protected]"
5: ENGINESDIR: "/usr/local/Cellar/[email protected]/1.1.1k/lib/engines-1.1"

sumeetgajjar · 2021-04-10T23:47:34Z

I just realized I can now re-run the checks in my personal fork, instead of pushing empty commits.

HyukjinKwon · 2021-04-11T02:36:16Z

cc @Ngone51 FYI

attilapiros · 2021-04-11T07:58:24Z

I fixed a typo in the title and description: registerations => registrations.
But I will review this properly only on next week.

mridulm · 2021-04-12T02:03:09Z

core/src/main/scala/org/apache/spark/SparkEnv.scala

Any particular reason for such a high cache size ?
Also, expire after some time ?

The high cache size is to ensure the fix works for a large enough job with 30000 executors.

Sure, does an expiry of 10min (or larger) sounds good?
This should give the executor long enough to process StopExecutor (in-flight) message and complete the shutdown.

The timeout should be modeled based on what is the max expected delay for heartbeat to come in from executor.

I think another possible solution is to extend the BlockManagerInfo with the timestamp of the removing. So modelling the removing as a new state and this way we could avoid using this separate cache completely and all the bm related data would be in the same place.

Of course in this case you should implement the cleanup.

For example it could be just a simple Long var which is 0 by default which means the BlockManager is alive/active (this special value can be hidden behind a method of BlockMangerInfo like isAlive(currentTs)). The cleanup would triggered for delay plus some extra time to avoid too frequent iteration on the blockManagerInfo collection.

WDYT?

I thought about this a bit more but haven't checked the code yet: Is it possible to separate driver commanded intentionally removed executors from unintentional executor loss?

Intentionally removed executors shouldn't be re-registered.

cc @Ngone51

When CoarseGrainedSchedulerBackend receives RemoveExecutor, it has the ExecutorLossReason. However, after processing the message, when it publishes SparkListenerExecutorRemoved, the reason is passed in form of a string.

Currently, on RemoveExecutor, we remove the corresponding BlockManagerInfo from blockManagerInfo map...

As per #32114 (comment), I don't think there would be a BlockManagerMessages.RemoveExecutor raised in this PR case.

Could you point out on which code path that the BlockManagerMessages.RemoveExecutor is raised?

If there's no more code path raises the BlockManagerMessages.RemoveExecutor in this PR case, then @attilapiros definitely works. But, I'd also suggest another idea in #32114 (comment).

Is it possible to separate driver commanded intentionally removed executors from unintentional executor loss?

@attilapiros It's possible to know that an executor is removed intentionally by the driver or not. The problem is, currently, the executor info is stored in many different places. So you have to update many methods or messages to add the isIntentional filed (for example), to let all the components know and make certain decisions on it, which could be miscellaneous.

~~Yes, @Ngone51 as pointed out in #32114, BlockManagerMessages.RemoveExecutor is raised in limited cases. I could not find any more code path apart from the one you pointed.~~

Redacting this due to new findings.

I found a code path which will break @attilapiros's proposed solution since the following holds true.

However, we will have to abstract blockManagerInfo: mutable.Map[BlockManagerId, BlockManagerInfo].
Currently, ...

Let's consider the following set of events:

A CoarseGrainedClusterMessage.RemoveExecutor is issued

CoarseGrainedSchedulerBackend issues async StopExecutor on executorEndpoint and then invokes executorLost on TaskSchedulerImpl

spark/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Line 430 in e609395

scheduler.executorLost(executorId, lossReason)

TaskSchedulerImpl in its executorLost invokes dagScheduler.executorLost

spark/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

Lines 998 to 1001 in e609395

if (failedExecutor.isDefined) {

dagScheduler.executorLost(failedExecutor.get, reason)

backend.reviveOffers()

}

DAGScheduler while handling executorLost invokes removeExecutorAndUnregisterOutputs which internally invokes blockManagerMaster.removeExecutor(execId) (as you pointed out in your comment below) which further clears blockManagerId from blockManagerInfo in BlockManagerMasterEndpoint

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Line 2043 in e609395

removeExecutorAndUnregisterOutputs(

The Executor has not yet processed StopExecutor

Executor reports its Heartbeat

HeartbeatReceiver invokes scheduler.executorHeartbeatReceived to check if the BlockManager on the executor requires re-registration

spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

Line 137 in e609395

val unknownExecutor = !scheduler.executorHeartbeatReceived(

TashSchedulerImpl delegates this to DAGScheduler

spark/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

Line 858 in e609395

dagScheduler.executorHeartbeatReceived(execId, accumUpdatesWithTaskIds, blockManagerId,

DAGScheduler asks BlockManagerMasterHeartbeatEndpoint if it knows the BlockManager

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Line 300 in e609395

blockManagerMaster.driverHeartbeatEndPoint.askSync[Boolean](

BlockManagerMasterHeartbeatEndpoint returns false since it cannot find blockManagerId in BlockManagerInfo indicating the blockManager should re-register

spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterHeartbeatEndpoint.scala

Line 51 in e609395

if (!blockManagerInfo.contains(blockManagerId)) {

BlockManager re-registers which publishes the SparkListenerBlockManagerAdded causing the inconsistent book-keeping in AppStatusStore

Executor processes StopExecutor and exits.

mridulm · 2021-04-12T02:11:53Z

core/src/main/scala/org/apache/spark/SparkEnv.scala

Move the cache into BlockManagerMasterEndpoint and hide the impl detail ?
We can have a cleaner interface here ... BlockManagerMasterHeartbeatEndpoint simply needs a way to validate if an executor was recently removed - does not need to know if it was a Cache/Set/etc.

Sure.
The same comment applies to already present blockManagerInfo map as well.
I could refactor that as well. I was thinking of creating a BlockMangerEndpointSharedState class which contains blockManagerInfo and recentlyRemovedExecutors. The class would expose corresponding methods for lookup and updates.

attilapiros · 2021-04-13T11:30:52Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

Nit: for just checking whether an object is null you do not need to build an Option instance (Option is good when when you pass the object to another method to emphasize it can be null).

attilapiros · 2021-04-13T11:32:15Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterHeartbeatEndpoint.scala

Nit: same here (Option is not needed)

Ngone51

Hey guys, I'd like to propose a simpler (but might be a little bit ticky) fix if I understand the issue correctly. The idea is,

Instead of removing the executor directly, we set executorLastSeen(executorId) = -1L when we receives ExecutorRemoved in HeartbeatReceiver. And then,

if Heartbeat comes first before ExpireDeadHosts, we remove the executor from executorLastSeen by checking the value "< 0" and avoid the re-register.
if ExpireDeadHosts comes first before Heartbeat, we set executorLastSeen(executorId) = -2L. We can't remove it this time in ExpireDeadHosts because if Heartbeat comes later we'd have the same issue again.

2.1 if Heartbeat comes later, we remove the executor from executorLastSeen by checking the value "< 0" too and also avoid the re-register.

2.2 if Heartbeat doesn't come (that means the executor stopped before sending the heartbeat), we remove the executor from executorLastSeen by checking the value = -2L in next ExpireDeadHosts.

In this way, we can avoid the extra cache and all changes should be limited to HeartbeatReceiver.

Any thoughts?

mridulm · 2021-04-13T18:26:36Z

In essence, if I understood correctly, we are adding a lostExecutorCandidates:Map[String, ExpirationState] ?

If we detect a request to expire an executor comes in - then expire based on (some) policy : timeout since initial expiry/number of expirations/other reasons : else add/update expiration state of candidate.
If heartbeat comes in, then remove from candidate set.
If explicit remove, then remove from both executorLastSeen and lostExecutorCandidates:Set.

Did I miss anything ? I am fine with this approach.
(I explicitly pulled out magic values out for explanation clarity)

sumeetgajjar · 2021-04-13T20:50:55Z

Hey guys, I'd like to propose a simpler (but might be a little bit ticky) fix if I understand the issue correctly. The idea is,

Instead of removing the executor directly, we set executorLastSeen(executorId) = -1L when we receives ExecutorRemoved in HeartbeatReceiver. And then,
1. if `Heartbeat` comes first before `ExpireDeadHosts`, we remove the executor from `executorLastSeen` by checking the value "< 0" and avoid the re-register.

2. if `ExpireDeadHosts` comes first before `Heartbeat`,  we set `executorLastSeen(executorId) = -2L`. We can't remove it this time in `ExpireDeadHosts` because if `Heartbeat` comes later we'd have the same issue again.
2.1 if Heartbeat comes later, we remove the executor from executorLastSeen by checking the value "< 0" too and also avoid the re-register.

2.2 if Heartbeat doesn't come (that means the executor stopped before sending the heartbeat), we remove the executor from executorLastSeen by checking the value = -2L in next ExpireDeadHosts.

In this way, we can avoid the extra cache and all changes should be limited to HeartbeatReceiver.

Any thoughts?

Thanks for the comment @Ngone51 .
There are two places from where the re-registration can be triggered

From HeartbeatReceiver - by responding HeartbeatResponse(reregisterBlockManager = true).
Your solution will take care of this.
From BlockManager - e.g. reportBlockStatus.
Just modifying HeartbeatReceiver won't solve the re-registration issue here. We will also have to implement a similar kind of tracking inside BlockManagerMasterEndpoint. And now since both tracking are independent of each other, it might introduce some race condition (please correct me if I am wrong).

sumeetgajjar · 2021-04-13T21:04:07Z

In essence, if I understood correctly, we are adding a lostExecutorCandidates:Map[String, ExpirationState] ?
* If we detect a request to expire an executor comes in - then expire based on (some) policy : timeout since initial expiry/number of expirations/other reasons : else add/update expiration state of candidate.

* If heartbeat comes in, then remove from candidate set.

* If explicit remove, then remove from both `executorLastSeen` and `lostExecutorCandidates:Set`.
Did I miss anything ? I am fine with this approach.
(I explicitly pulled out magic values out for explanation clarity)

Thanks for the comment @mridulm .
I believe this comment applies here as well.

I believe @attilapiros suggestion would take care of both cases where re-registration is trigger without introducing another Cache of recentlyRemovedExecutors.

mridulm · 2021-04-13T23:44:52Z

I am getting a little confused between PR description and the subsequent discussion.
What exactly is the behavior we are trying to converge towards/address ?

An expiration of executor from heartbeat master not only sends a StopExecutor to voluntarily get executor to exit, but also gets the cluster manager to force termination (in case of MIA/hung executor). So in steady state, once transitionary/overlapping updates are done, the executor should be gone according to driver.

My understanding was, there is a race here between cluster manager notifying application (after killing executor) and the executor heartbeat/blockmanager re-registration : which ends up causing a dead executor to be marked live indefinitely.

Is this the only case we are addressing ? Or are there any other paths that are impacted ?

(@Ngone51 Not sure if standalone has nuances that I am missing here).

Ngone51 · 2021-04-14T05:54:01Z

Standalone should be the same. @mridulm

Ngone51 · 2021-04-14T06:09:10Z

From BlockManager - e.g. reportBlockStatus.
Just modifying HeartbeatReceiver won't solve the re-registration issue here. We will also have to implement a similar kind of tracking inside BlockManagerMasterEndpoint. And now since both tracking are independent of each other, it might introduce some race condition (please correct me if I am wrong).

You're right. I followed the PR description only so I thought HeartbeatReceiver is the only problematic place.

I checked the code and surprisingly find that we don't remove BlockManager when we remove an executor. And removing BlockManager happens in few cases only,

the corresponding executor of the BlockManager caused the shuffle fetch failure

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Line 2081 in ee7d838

blockManagerMaster.removeExecutor(execId)
an executor is removed redundantly

spark/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Line 434 in ee7d838

scheduler.sc.env.blockManager.master.removeExecutorAsync(executorId)

a new registered BlockManager evicts an old one (if any)

spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

Lines 534 to 537 in ee7d838

    
           // A block manager of the same executor already exists, so remove it (assumed dead) 
        
           logError("Got two different block manager registrations on same executor - " 
        
               + s" will replace old one $oldId with new one $id") 
        
           removeExecutor(id.executorId)

If that's the case (it seems not correct but exits for a long time already), I think posting the SparkListenerBlockManagerAdded inside the if (!blockManagerInfo.contains(id)) would be enough for the whole fix?

spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

Lines 531 to 562 in ee7d838

    
           if (!blockManagerInfo.contains(id)) { 
        
             blockManagerIdByExecutor.get(id.executorId) match { 
        
               case Some(oldId) => 
        
                 // A block manager of the same executor already exists, so remove it (assumed dead) 
        
                 logError("Got two different block manager registrations on same executor - " 
        
                     + s" will replace old one $oldId with new one $id") 
        
                 removeExecutor(id.executorId) 
        
               case None => 
        
             } 
        
             logInfo("Registering block manager %s with %s RAM, %s".format( 
        
               id.hostPort, Utils.bytesToString(maxOnHeapMemSize + maxOffHeapMemSize), id)) 
        
             blockManagerIdByExecutor(id.executorId) = id 
        
             val externalShuffleServiceBlockStatus = 
        
               if (externalShuffleServiceRddFetchEnabled) { 
        
                 val externalShuffleServiceBlocks = blockStatusByShuffleService 
        
                   .getOrElseUpdate(externalShuffleServiceIdOnHost(id), new JHashMap[BlockId, BlockStatus]) 
        
                 Some(externalShuffleServiceBlocks) 
        
               } else { 
        
                 None 
        
               } 
        
             blockManagerInfo(id) = new BlockManagerInfo(id, System.currentTimeMillis(), 
        
               maxOnHeapMemSize, maxOffHeapMemSize, storageEndpoint, externalShuffleServiceBlockStatus) 
        
             if (pushBasedShuffleEnabled) { 
        
               addMergerLocation(id) 
        
             } 
        
           } 
        
           listenerBus.post(SparkListenerBlockManagerAdded(time, id, maxOnHeapMemSize + maxOffHeapMemSize, 
        
               Some(maxOnHeapMemSize), Some(maxOffHeapMemSize)))

sumeetgajjar · 2021-04-21T20:45:38Z

If that's the case (it seems not correct but exits for a long time already), I think posting the SparkListenerBlockManagerAdded inside the if (!blockManagerInfo.contains(id)) would be enough for the whole fix?

@Ngone51 I believe moving SparkListenerBlockManagerAdded inside the if-loop should be enough.
I will give it a try and check if any other cases surface.

Thanks!

sumeetgajjar · 2021-04-21T20:59:37Z

I was wondering what is policy for merging the latest changes to a upstream-dev-branch?
For e.g. I have a branch SPARK-35011 which contains my changes, it is behind upstream/master

Should I be rebasing my commits on the new changes from upstream/master ?
- current branch is SPARK-35011 -- git pull upstream master --rebase
- In this case when I push my changes to PR, I will require a --force push since the commit ordering is now changed.
Or should I merge the upstream/master into SPARK-35011?
- current branch is SPARK-35011 -- git merge upstream/master
- No force push required, however my commits are too far behind in the history.

sumeetgajjar · 2021-04-22T01:03:02Z

If that's the case (it seems not correct but exits for a long time already), I think posting the SparkListenerBlockManagerAdded inside the if (!blockManagerInfo.contains(id)) would be enough for the whole fix?

@Ngone51 I believe moving SparkListenerBlockManagerAdded inside the if-loop should be enough.
I will give it a try and check if any other cases surface.

Thanks!

I tried this and the issue still exists. When we apply the sequence of events mentioned in #32114, the issue surfaces again.

Ngone51 · 2021-04-23T03:04:31Z

Should I be rebasing my commits on the new changes from upstream/master ?

Rebase is recommended whenever it's possible. @sumeetgajjar

Ngone51 · 2021-04-23T03:22:53Z

I tried this and the issue still exists. When we apply the sequence of events mentioned in #32114, the issue surfaces again.

I see. I think I missed the code path of scheduler.executorLost(executorId, lossReason).

Thanks for the experiment.

Ngone51 · 2021-04-24T16:05:45Z

So I think the solution now would be:

For the heartbeat, using #32114 (review).

For theBlokmanager, adding a BlockManagerEndpointSharedState (as mentioned by @sumeetgajjar in #32114 (comment)) for both BlockManagerMasterEndpoint and BlockManagerMasterHeartbeatEndpoint. It's true that if we only adding a removal state to the blockManagerInfo, we have to filter the removed blockmanagers first before traversing it (e.g., we won't expect to return a removed blockmanager in getPeers).

In BlockManagerEndpointSharedState, we'd have both activeBlockManagerInfo and the removedBlockManagerInfo. We don't have to set up a new cleaner to clear the removedBlockManagerInfo. Instead, we can reuse the fix of HeartbeatReceiver as whenever there's a sure removal in HeartbeatReceiver, we can send a removal request to BlockManagerEndpointSharedState as well by following the code path of !scheduler.executorHeartbeatReceived(e.g., we could have scheduler.clearBlockManagerInfo similarly).

WDYT?

sumeetgajjar · 2021-04-26T18:18:17Z

So I think the solution now would be:

For the heartbeat, using #32114 (review).

For theBlokmanager, adding a BlockManagerEndpointSharedState (as mentioned by @sumeetgajjar in #32114 (comment)) for both BlockManagerMasterEndpoint and BlockManagerMasterHeartbeatEndpoint. It's true that if we only adding a removal state to the blockManagerInfo, we have to filter the removed blockmanagers first before traversing it (e.g., we won't expect to return a removed blockmanager in getPeers).

In BlockManagerEndpointSharedState, we'd have both activeBlockManagerInfo and the removedBlockManagerInfo. We don't have to set up a new cleaner to clear the removedBlockManagerInfo. Instead, we can reuse the fix of HeartbeatReceiver as whenever there's a sure removal in HeartbeatReceiver, we can send a removal request to BlockManagerEndpointSharedState as well by following the code path of !scheduler.executorHeartbeatReceived(e.g., we could have scheduler.clearBlockManagerInfo similarly).

WDYT?

@Ngone51 Thank you for this suggestion, I understand the solution, however, I believe this might be slight complex to keep track of things since the cleanup/removal is triggered from a different component i.e. HeartbeatReceiver.

I spoke to @attilapiros offline regarding his solution and it seems he missed to mention one thing that BlockManagerInfo will not be removed here:

spark/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

Line 344 in 1b609c7

blockManagerInfo.remove(blockManagerId)

Insterad, the cleanup thread would take care of removal. This will keep the whole logic in the same component i.e. BlockManagerMasterEndpoint and would be easy to track from a code understanding point of view.

I will proceed with @attilapiros proposal where we model BlockManager removal as a new state, run some tests and update more.

…oveRdd ### What changes were proposed in this pull request? In `BlockManagerMasterEndpoint` for the disk persisted RDDs (when `spark.shuffle.service.fetch.rdd.enable` is enabled) we are keeping track the block status entries by external shuffle service instances (so on YARN we are basically keeping them by nodes). This is the `blockStatusByShuffleService` member val. And when all the RDD blocks are removed for one external shuffle service instance then the key and the empty map can be removed from `blockStatusByShuffleService`. ### Why are the changes needed? It is a small leak and I was asked to take care of it in apache/spark#32114 (comment). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually by adding a temporary log line to check `blockStatusByShuffleService` value before and after the `removeRdd` and run the `SPARK-25888: using external shuffle service fetching disk persisted blocks` test in `ExternalShuffleServiceSuite`. Closes #32790 from attilapiros/SPARK-35543. Authored-by: attilapiros <[email protected]> Signed-off-by: attilapiros <[email protected]>

Ngone51 · 2021-08-17T04:11:39Z

I just realized this bug does cause the real problem when working in conjunction with #24533. Basically, the re-registration issue leads to the driver thinks an executor is alive while it's actually dead, which in turn causes the client to retry the block on the dead executor, while it shouldn't. Could you @sumeetgajjar backport this fix to 3.1/3.0 as well?
cc @mridulm @attilapiros

sumeetgajjar · 2021-08-17T05:08:21Z

I just realized this bug does cause the real problem when working in conjunction with #24533. Basically, the re-registration issue leads to the driver thinks an executor is alive while it's actually dead, which in turn causes the client to retry the block on the dead executor, while it shouldn't. Could you @sumeetgajjar backport this fix to 3.1/3.0 as well?
cc @mridulm @attilapiros

@Ngone51, sure I will backport it to 3.1 and 3.0 as well.

sumeetgajjar · 2021-08-17T18:30:34Z

SPARK-34949 should also be backported to close any gaps.

P.S. It is already in 3.1 we just need to backport it to 3.0

…or msg is in-flight ### What changes were proposed in this pull request? This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. ### Why are the changes needed? This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes apache#32114 from sumeetgajjar/SPARK-35011. Authored-by: Sumeet Gajjar <[email protected]> Signed-off-by: yi.wu <[email protected]>

…xecutor msg is in-flight This PR backports #32114 to 3.1 <hr> ### What changes were proposed in this pull request? This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. ### Why are the changes needed? This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes #33771 from sumeetgajjar/SPARK-35011-br-3.1. Authored-by: Sumeet Gajjar <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…or msg is in-flight This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. No - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes apache#32114 from sumeetgajjar/SPARK-35011. Authored-by: Sumeet Gajjar <[email protected]> Signed-off-by: yi.wu <[email protected]>

…xecutor msg is in-flight This PR backports #32114 to 3.0 <hr>  ### What changes were proposed in this pull request?  This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. ### Why are the changes needed?  This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes #33782 from sumeetgajjar/SPARK-35011-br-3.0. Authored-by: Sumeet Gajjar <[email protected]> Signed-off-by: yi.wu <[email protected]>

…opExecutor msg is in-flight" This reverts commit b9e53f8. ### What changes were proposed in this pull request? Revert #32114 ### Why are the changes needed? It breaks the expected `BlockManager` re-registration (e.g., heartbeat loss of an active executor) due to deferred removal of `BlockManager`, see the check: https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass existing tests. Closes #33942 from Ngone51/SPARK-36700. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…opExecutor msg is in-flight" This reverts commit b9e53f89379c34bde36b9c37471e21f037092749. ### What changes were proposed in this pull request? Revert apache/spark#32114 ### Why are the changes needed? It breaks the expected `BlockManager` re-registration (e.g., heartbeat loss of an active executor) due to deferred removal of `BlockManager`, see the check: https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass existing tests. Closes #33942 from Ngone51/SPARK-36700. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…opExecutor msg is in-flight" This reverts commit b9e53f8. ### What changes were proposed in this pull request? Revert #32114 ### Why are the changes needed? It breaks the expected `BlockManager` re-registration (e.g., heartbeat loss of an active executor) due to deferred removal of `BlockManager`, see the check: https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass existing tests. Closes #33959 from Ngone51/revert-35011-3.2. Authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…xecutor msg is in-flight This PR backports apache#32114 to 3.1 <hr> This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. No - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes apache#33771 from sumeetgajjar/SPARK-35011-br-3.1. Authored-by: Sumeet Gajjar <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ockManager reregistration ### What changes were proposed in this pull request? Also post the event `SparkListenerExecutorRemoved` when removing an executor, which is known by `BlockManagerMaster` but unknown to `SchedulerBackend`. ### Why are the changes needed? In #32114, it reports an issue that `BlockManagerMaster` could register a `BlockManager` from a dead executor due to reregistration mechanism. The side effect is, the executor will be shown on the UI as an active one, though it's already dead indeed. In #32114, we tried to avoid such reregistration for a to-be-dead executor. However, I just realized that we can actually leave such reregistration alone since `HeartbeatReceiver.expireDeadHosts` should clean up those `BlockManager`s in the end. The problem is, the corresponding executors in UI can't be cleaned along with the `BlockManager`s cleaning. Because executors in UI can only be cleaned by `SparkListenerExecutorRemoved`, while `BlockManager`s cleaning only post `SparkListenerBlockManagerRemoved` (which is ignored by `AppStatusListener`). ### Does this PR introduce _any_ user-facing change? Yes, users would see the false active executor be removed in the end. ### How was this patch tested? Pass existing tests. Closes #34536 from Ngone51/SPARK-35011. Lead-authored-by: wuyi <[email protected]> Co-authored-by: yi.wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…r has been lost ### What changes were proposed in this pull request? This PR majorly proposes to reject the block manager re-registration if the executor has been already considered lost/dead from the scheduler backend. Along with the major proposal, this PR also includes a few other changes: * Only post `SparkListenerBlockManagerAdded` event when the registration succeeds * Return an "invalid" executor id when the re-registration fails * Do not report all blocks when the re-registration fails ### Why are the changes needed? BlockManager re-registration from lost executor (terminated/terminating executor or orphan executor) has led to some known issues, e.g., false-active executor shows up in UP (SPARK-35011), [block fetching to the dead executor](#32114 (comment)). And since there's no re-registration from the lost executor itself, it's meaningless to have BlockManager re-registration when the executor is already lost. Regarding the corner case where the re-registration event comes earlier before the lost executor is actually removed from the scheduler backend, I think it is not possible. Because re-registration will only be required when the BlockManager doesn't see the block manager in `blockManagerInfo`. And the block manager will only be removed from `blockManagerInfo` whether when the executor is already know lost or removed by the driver proactively. So the executor should always be removed from the scheduler backend first before the re-registration event comes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #38876 from Ngone51/fix-blockmanager-reregister. Authored-by: Yi Wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…r has been lost ### What changes were proposed in this pull request? This PR majorly proposes to reject the block manager re-registration if the executor has been already considered lost/dead from the scheduler backend. Along with the major proposal, this PR also includes a few other changes: * Only post `SparkListenerBlockManagerAdded` event when the registration succeeds * Return an "invalid" executor id when the re-registration fails * Do not report all blocks when the re-registration fails ### Why are the changes needed? BlockManager re-registration from lost executor (terminated/terminating executor or orphan executor) has led to some known issues, e.g., false-active executor shows up in UP (SPARK-35011), [block fetching to the dead executor](#32114 (comment)). And since there's no re-registration from the lost executor itself, it's meaningless to have BlockManager re-registration when the executor is already lost. Regarding the corner case where the re-registration event comes earlier before the lost executor is actually removed from the scheduler backend, I think it is not possible. Because re-registration will only be required when the BlockManager doesn't see the block manager in `blockManagerInfo`. And the block manager will only be removed from `blockManagerInfo` whether when the executor is already know lost or removed by the driver proactively. So the executor should always be removed from the scheduler backend first before the re-registration event comes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #38876 from Ngone51/fix-blockmanager-reregister. Authored-by: Yi Wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit c3f46d5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…r has been lost ### What changes were proposed in this pull request? This PR majorly proposes to reject the block manager re-registration if the executor has been already considered lost/dead from the scheduler backend. Along with the major proposal, this PR also includes a few other changes: * Only post `SparkListenerBlockManagerAdded` event when the registration succeeds * Return an "invalid" executor id when the re-registration fails * Do not report all blocks when the re-registration fails ### Why are the changes needed? BlockManager re-registration from lost executor (terminated/terminating executor or orphan executor) has led to some known issues, e.g., false-active executor shows up in UP (SPARK-35011), [block fetching to the dead executor](apache#32114 (comment)). And since there's no re-registration from the lost executor itself, it's meaningless to have BlockManager re-registration when the executor is already lost. Regarding the corner case where the re-registration event comes earlier before the lost executor is actually removed from the scheduler backend, I think it is not possible. Because re-registration will only be required when the BlockManager doesn't see the block manager in `blockManagerInfo`. And the block manager will only be removed from `blockManagerInfo` whether when the executor is already know lost or removed by the driver proactively. So the executor should always be removed from the scheduler backend first before the re-registration event comes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#38876 from Ngone51/fix-blockmanager-reregister. Authored-by: Yi Wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…r has been lost ### What changes were proposed in this pull request? This PR majorly proposes to reject the block manager re-registration if the executor has been already considered lost/dead from the scheduler backend. Along with the major proposal, this PR also includes a few other changes: * Only post `SparkListenerBlockManagerAdded` event when the registration succeeds * Return an "invalid" executor id when the re-registration fails * Do not report all blocks when the re-registration fails ### Why are the changes needed? BlockManager re-registration from lost executor (terminated/terminating executor or orphan executor) has led to some known issues, e.g., false-active executor shows up in UP (SPARK-35011), [block fetching to the dead executor](apache#32114 (comment)). And since there's no re-registration from the lost executor itself, it's meaningless to have BlockManager re-registration when the executor is already lost. Regarding the corner case where the re-registration event comes earlier before the lost executor is actually removed from the scheduler backend, I think it is not possible. Because re-registration will only be required when the BlockManager doesn't see the block manager in `blockManagerInfo`. And the block manager will only be removed from `blockManagerInfo` whether when the executor is already know lost or removed by the driver proactively. So the executor should always be removed from the scheduler backend first before the re-registration event comes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#38876 from Ngone51/fix-blockmanager-reregister. Authored-by: Yi Wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit c3f46d5) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

github-actions bot added CORE DSTREAM labels Apr 9, 2021

sumeetgajjar force-pushed the SPARK-35011 branch 2 times, most recently from a04d352 to 01b8ae4 Compare April 11, 2021 00:01

HyukjinKwon changed the title ~~[Spark 35011][CORE] Avoid Block Manager registerations when StopExecutor msg is in-flight~~ [SPARK-35011][CORE] Avoid Block Manager registerations when StopExecutor msg is in-flight Apr 11, 2021

attilapiros changed the title ~~[SPARK-35011][CORE] Avoid Block Manager registerations when StopExecutor msg is in-flight~~ [SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight Apr 11, 2021

mridulm reviewed Apr 12, 2021

View reviewed changes

attilapiros reviewed Apr 13, 2021

View reviewed changes

Ngone51 reviewed Apr 13, 2021

View reviewed changes

sumeetgajjar force-pushed the SPARK-35011 branch from 01b8ae4 to 2613cf0 Compare May 13, 2021 23:30

sumeetgajjar mentioned this pull request Aug 17, 2021

[SPARK-34949][CORE][3.0] Prevent BlockManager reregister when Executor is shutting down #33770

Closed

sumeetgajjar mentioned this pull request Aug 17, 2021

[SPARK-35011][CORE][3.1] Avoid Block Manager registrations when StopExecutor msg is in-flight #33771

Closed

sumeetgajjar mentioned this pull request Aug 19, 2021

[SPARK-35011][CORE][3.0] Avoid Block Manager registrations when StopExecutor msg is in-flight #33782

Closed

Ngone51 mentioned this pull request Sep 9, 2021

Revert "[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight" #33942

Closed

Ngone51 mentioned this pull request Sep 10, 2021

Revert "[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight" #33959

Closed

Ngone51 mentioned this pull request Nov 9, 2021

[SPARK-35011][CORE] Fix false active executor in UI that caused by BlockManager reregistration #34536

Closed

Ngone51 mentioned this pull request Dec 2, 2022

[SPARK-41360][CORE] Avoid BlockManager re-registration if the executor has been lost #38876

Closed

	if (failedExecutor.isDefined) {
	dagScheduler.executorLost(failedExecutor.get, reason)
	backend.reviveOffers()
	}

[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight #32114

[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight #32114

Uh oh!

Conversation

sumeetgajjar commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sumeetgajjar commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sumeetgajjar commented Apr 10, 2021

Uh oh!

sumeetgajjar commented Apr 10, 2021

Uh oh!

HyukjinKwon commented Apr 11, 2021

Uh oh!

attilapiros commented Apr 11, 2021

Uh oh!

mridulm Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumeetgajjar Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm commented Apr 13, 2021

Uh oh!

sumeetgajjar commented Apr 13, 2021

Uh oh!

sumeetgajjar commented Apr 13, 2021

Uh oh!

mridulm commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Apr 14, 2021

Uh oh!

Ngone51 commented Apr 14, 2021

Uh oh!

sumeetgajjar commented Apr 21, 2021

Uh oh!

sumeetgajjar commented Apr 21, 2021

Uh oh!

sumeetgajjar commented Apr 22, 2021

Uh oh!

sumeetgajjar commented Apr 9, 2021 •

edited

Loading

sumeetgajjar commented Apr 9, 2021 •

edited

Loading

mridulm Apr 12, 2021 •

edited

Loading

sumeetgajjar Apr 21, 2021 •

edited

Loading

mridulm Apr 12, 2021 •

edited

Loading

attilapiros Apr 13, 2021 •

edited

Loading

mridulm commented Apr 13, 2021 •

edited

Loading

Ngone51 commented Aug 17, 2021 •

edited

Loading