Skip to content

Commit 498df81

Browse files
kokonguyen191zengqiang.xu
authored andcommitted
HDFS-17459. [FGL] Add documentation (#6737)
1 parent 404cb96 commit 498df81

File tree

1 file changed

+210
-0
lines changed

1 file changed

+210
-0
lines changed
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one or more
3+
contributor license agreements. See the NOTICE file distributed with
4+
this work for additional information regarding copyright ownership.
5+
The ASF licenses this file to You under the Apache License, Version 2.0
6+
(the "License"); you may not use this file except in compliance with
7+
the License. You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
HDFS Namenode Fine-grained Locking
19+
==================================
20+
21+
<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
22+
23+
Overview
24+
--------
25+
26+
HDFS relies on a single master, the Namenode (NN), as its metadata center.
27+
From an architectural point of view, a few elements make NN the bottleneck of an HDFS cluster:
28+
* NN keeps the entire namespace in memory (directory tree, blocks, Datanode related info, etc.)
29+
* Read requests (`getListing`, `getFileInfo`, `getBlockLocations`) are served from memory.
30+
Write requests (`mkdir`, `create`, `addBlock`, `complete`) update the memory state and write a journal transaction into QJM.
31+
Both types of requests need a locking mechanism to ensure data consistency and correctness.
32+
* All requests are funneled into NN and have to go through the global FS lock.
33+
Each write operation acquires this lock in write mode and holds it until that operation is executed.
34+
This lock mode prevents concurrent execution of write operations even if they involve different branches of the directory tree.
35+
36+
NN fine-grained locking (FGL) implementation aims to alleviate this bottleneck by allowing concurrency of disjoint write operations.
37+
38+
JIRA: [HDFS-17366](https://issues.apache.org/jira/browse/HDFS-17366)
39+
40+
Design
41+
------
42+
In theory, fully independent operations can be processed concurrently, such as operations involving different subdirectory trees.
43+
As such, NN can split the global lock into the full path lock, just using the full path lock to protect a special subdirectory tree.
44+
45+
### RPC Categorization
46+
47+
Roughly, RPC operations handled by NN can be divided into 8 main categories
48+
49+
| Category | Operations |
50+
|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
51+
| Involving namespace tree | `mkdir`, `create` (without overwrite), `getFileInfo` (without locations), `getListing` (without locations), `setOwner`, `setPermission`, `getStoragePolicy`, `setStoragePolicy`, `rename`, `isFileClosed`, `getFileLinkInfo`, `setTimes`, `modifyAclEntries`, `removeAclEntries`, `setAcl`, `getAcl`, `setXAttr`, `getXAttrs`, `listXAttrs`, `removeXAttr`, `checkAccess`, `getErasureCodingPolicy`, `unsetErasureCodingPolicy`, `getQuotaUsage`, `getPreferredBlockSize` |
52+
| Involving only blocks | `reportBadBlocks`, `updateBlockForPipeline`, `updatePipeline` |
53+
| Involving only DNs | `registerDatanode`, `setBalancerBandwidth`, `sendHeartbeat` |
54+
| Involving both namespace tree & blocks | `getBlockLocation`, `create` (with overwrite), `append`, `setReplication`, `abandonBlock`, `addBlock`, `getAdditionalDatanode`, `complete`, `concat`, `truncate`, `delete`, `getListing` (with locations), `getFileInfo` (with locations), `recoverLease`, `listCorruptFileBlocks`, `fsync`, `commitBlockSynchronization`, `RedundancyMonitor`, `processMisReplicatedBlocks` |
55+
| Involving both DNs & blocks | `getBlocks`, `errorReport` |
56+
| Involving namespace tree, DNs & blocks | `blockReport`, `blockReceivedAndDeleted`, `HeartbeatManager`, `Decommission` |
57+
| Requiring locking the entire namespace | `rollEditLog`, `startCommonService`, `startActiveService`, `saveNamespace`, `rollEdits`, `EditLogTailer`, `rollingUpgrade` |
58+
| Requiring no locking | `getServerDefaults`, `getStats` |
59+
60+
For operations involving namespace tree, fully independent operations can be handled by NN concurrently. Almost all of them use the full path as a parameter, e.g. `create`, `mkdirs`, `getFileInfo`, etc. So we can use a full path lock to make them thread-safe.
61+
62+
For operations involving blocks, one block belongs to one and only one `INodeFile`, so NN can use the namespace tree to make these operations thread-safe.
63+
64+
For operations involving DNs, NN needs a separate DN lock because DNs operate separately from the namespace tree.
65+
66+
For operations requiring the entire namespace locked, the global lock can be used to make these operations thread-safe. In general, these operations have low frequency and thus low impact despite the global locking.
67+
68+
### Full Path Lock
69+
70+
Used to protect operations involving namespace tree.
71+
72+
All of these operations receive a path or INodeID as a parameter and can further be divided into 3 main subcategories:
73+
1. Parameters contain only one path (`create`, `mkdir`)
74+
2. Parameters contain multiple paths (`rename`, `concat`)
75+
3. Parameters contain INodeID (`addBlock`, `complete`)
76+
77+
For type 1, NN acquires a full path lock according to its semantics. Take `setPermission("/a/b/c/f.txt")` for example, the set of locks to acquire are ReadLock("/"), ReadLock("a"), ReadLock("b"), ReadLock("c") and WriteLock("f.txt"). Different lock patterns are explained in a later section.
78+
79+
For type 2, NN acquires full path locks in a predefined order, such as the lexicographic order, to avoid deadlocks.
80+
81+
For type 3, NN acquires a full path lock by in the following fashion:
82+
- Unsafely obtains full path recursively
83+
- Acquires the full path lock according to the lock mode
84+
- Rechecks whether the last node of the full path is equal to the INodeID given
85+
- If not, that means that the `INodeFile` might have been renamed or concatenated, need to retry
86+
- If the max retry attempts have been reached, throw a `RetryException` to client to let client retry
87+
88+
### `INodeFile` Lock
89+
90+
Used to protect operations involving blocks.
91+
92+
One block belongs to one and only one `INodeFile`, so NN can use the INodeFile lock to make operations thread-safe. Normally, there is no need to acquire the full path lock since changing the namespace tree structure does not affect the block.
93+
94+
`concat` might change the `INodeFile` a block belongs to. Since both block related operations and `concat` need to acquire the `INodeFile` write lock, only one of them can be processed at a time.
95+
96+
### DN Lock
97+
98+
Used to protect operations involving DNs.
99+
100+
NN uses a `DatanodeDescriptor` object to store the information of one DN and uses `DatanodeManager` to manage all DNs in the memory. `DatanodeDescriptor` uses `DatanodeStorageInfo` to store the information of one storage device on one DN.
101+
102+
DNs have nothing to do with the namespace tree, so NN uses a separate DN lock for these operations. Since DNs are independent of one another, NN can assign a lock to each DN.
103+
104+
### Global Lock
105+
106+
Used for operations requiring the entire namespace locked.
107+
108+
There are some operations that need to lock the entire namespace, e.g. safe mode related operations, HA service related, etc. NN uses the global lock to make these operations thread-safe. Outside of these infrequent operations that require the global write lock, all other operations have to acquire the global read lock. The only exception to this rule is JMX operations being allowed to bypass locking entirely to ensure that metrics can be collected regardless of long write lock holding.
109+
110+
### Lock Order
111+
112+
As mentioned above, there are the global lock, DN lock, and full path lock. NN acquires locks in this specific order to avoid deadlocks.
113+
114+
Locks are to be acquired in this order:
115+
- Global > DN > Full path
116+
- Global > DN > Last `INodeFile`
117+
118+
Possible lock combinations are as follows:
119+
- Global write lock
120+
- Global read lock > Full path lock
121+
- Global read lock > DN read/write lock
122+
- Global read lock > DN read/write lock > Read/Write lock of last `INodeFile`
123+
- Global read lock > DN read/write lock > Full path lock
124+
125+
### Lock Pools
126+
127+
NN allocates locks as needed to the INodes used by active threads, and deletes them after the locks are no longer in use. Locks for commonly accessed `INode`s like the root are cached.
128+
129+
NN uses an `INodeLockPool` to manage these locks. The lock pool:
130+
- Returns a closeable lock for an INode based on the lock type,
131+
- Removes this lock if it is no longer used by any threads.
132+
133+
Similar to `INodeLockPool`, a `DNLockPool` is used to manage the locks for DNs. Unlike `INodeLockPool`, `DNLockPool` keeps all locks in memory due to the comparatively lower number of locks.
134+
135+
### Lock Modes
136+
137+
Operations related to namespace tree have different semantics and may involve the modification or access of different INodes, for example: `getBlockLocation` only accesses the last iNodeFile, `delete` modifies both the parent and the last INode, `mkdir` may modify multiple ancestor INodes.
138+
139+
Four lock modes (plus no locking):
140+
- LOCK_READ
141+
- This lock mode acquires the read locks for all INodes down the path.
142+
- Example operations: `getBlockLocation`, `getFileInfo`.
143+
- LOCK_WRITE
144+
- This lock mode acquires the write lock for the last INode and the read locks for all ancestor INodes in the full path.
145+
- Example operations: `setPermission`, `setReplication`.
146+
- LOCK_PARENT
147+
- This lock mode acquires the write lock for the last two INodes and the read locks for all remaining ancestor INodes in the full path.
148+
- Example operations: `rename`, `delete`, `create` (when the parent directory exists).
149+
- LOCK_ANCESTOR
150+
- This lock mode acquires the write lock for the last existing INode and the read locks for all remaining ancestor INodes in the full path.
151+
- Example operations: `mkdir`, `create` (when the parent directory doesn't exist).
152+
- NONE
153+
- This lock mode does not acquire any locks for the given path.
154+
155+
Roadmap
156+
-------
157+
158+
#### Stage 1: Split the global lock into FSLock and BMLock
159+
160+
Split the global lock into two global locks, FSLock and BMLock.
161+
- FSLock for operations that relate to namespace tree.
162+
- BMLock for operations related to blocks and/or operations related to DNs.
163+
- Both FSLock and BMLock for HA related operations.
164+
After this step, FGL contains global FSLock and global BMLock.
165+
166+
No big logic changes in this step. The original logic with the global lock retains. This step aims to make the lock mode configurable.
167+
168+
JIRA: [HDFS-17384](https://issues.apache.org/jira/browse/HDFS-17384) [Progress: Done]
169+
170+
#### Stage 2: Split the global FSLock
171+
172+
After splitting the global lock into FSLock and BMLock, this step aims to split the global FSLock into full path locks so that fully independent operations that only involve namespace tree can be processed concurrently.
173+
In this step, NN still uses the global BMLock to protect block related operations and DN related operations.
174+
After this step, FGL contains global FSLock, full path lock, and global BMLock.
175+
176+
JIRA: [HDFS-17385](https://issues.apache.org/jira/browse/HDFS-17385) [Progress: Ongoing]
177+
178+
#### Stage 3: Split the global BMLock
179+
180+
This step aims to split the global BMLock into full path locks and DN locks.
181+
After this step, FGL contains global FSLock, DN lock, and full path lock.
182+
183+
JIRA: [HDFS-17386](https://issues.apache.org/jira/browse/HDFS-17386) [Progress: Ongoing]
184+
185+
Configuration
186+
------------
187+
188+
NN FGL implementation can be used by adding this configuration to `hdfs-site.xml`.
189+
190+
<property>
191+
<name>dfs.namenode.lock.model.provider.class</name>
192+
<value>org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock</value>
193+
<description>
194+
An implementation class of FSNamesystem lock.
195+
Defaults to GlobalFSNamesystemLock.class
196+
</description>
197+
</property>
198+
199+
The lock manager class must implement the interface defined by `org.apache.hadoop.hdfs.server.namenode.fgl.FSNLockManager`. Currently, there are two implementations:
200+
* `org.apache.hadoop.hdfs.server.namenode.fgl.GlobalFSNamesystemLock`: the original lock mode that utilizes one global FS lock, also the default value for this config;
201+
* `org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock`: FGL implementation.
202+
203+
Adding RPC
204+
----------
205+
206+
For developers adding a new RPC operation, the operation should follow FGL locking schematic to ensure data integrity:
207+
* Global FSLock should be acquired in read mode, unless it is an administrative operation (related to HA, edit logs, etc.)
208+
* If the operation requires access/modification of `DatanodeDescriptor` and/or `DatanodeStorageInfo`, DN lock should be acquired in read/write mode accordingly.
209+
* Only applicable in stage 3 once DN lock is implemented. During stage 1 and stage 2, global BMLock is to be used instead.
210+
* If the operation deals with one or more paths/blocks, the full path lock(s) should be acquired based on the implementation details described above. It is best to check an existing RPC operation that has a similar method of access to the new operation to consult the lock implementation.

0 commit comments

Comments
 (0)