-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-11650. ContainerId list to track all containers created in a datanode #7402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…anode Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a
Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95
Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8 # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219
Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330
Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e
Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2
Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec
Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651
Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989
Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d
Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f
Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5
Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35
Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482
Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17
Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95
…containers Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e
|
|
||
| long containerId = container.getContainerData().getContainerID(); | ||
| State containerState = container.getContainerData().getState(); | ||
| if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we cannot just check if the container is in the ContainerIDTable? Why do we need to use this missingContainerSet instead?
I traced the creation of the missing set to a ratis snapshot (I know nothing about it) and a diff against the containers found in memory. What about EC containers? Do they all become part of this missing set?
Prior to this change, the missingSet doesn't seem to be used for anything inside this class, but not it is used for these checks. What was missingSet used for before?
I would have thought the ContainerID table starts initially empty after this change is committed. On first start, it gets populated with all containers in the ContainerSet (scanned from disk). After initial startup, or later startups, it can also have all containers added, but they should already all be there. Then we can just track previously created in the containerID table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container set is first initialised with all the containers present on disk by performing an ls on the data volumes. Then we iterate through the containerIDs table on the rocksdb, and any container present in this table and not present on disk will be added to the missing container set. This would include both ratis and ec containers.
Now on container creation we just have to check this in the in-memory set than going to the rocksdb everytime which would incur an io cost, however small it is not need in my opinion when we have everything in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I mis-read the Java doc and thought it was a ratis snapshot. However my question still stands - why not simply use the ContainerIDTable to check for previous container creation, than augmenting this missing list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just reusing the data structure which was already there. This missing container set was just used for ratis container initially, I am now making it track even the EC containers. I didn't want to change a lot of code and reuse most of the stuff which is already there.
Moreover an in memory get is faster than fetching from rocksdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this change, the missingContainerSet variable in the ContainerSet was not used for anything within that class. Now, we are potentially adding to it from the containerList - what impact does that have more widely? What was the missingContainerSet used for before?
We also have to open the new containerList table - it would be more clear if we just loaded that into memory and use the new "seenContainerList" for previously existing checks.
However backing up a bit - how can containers get created?
One way, is via the write path - that is the problem we are trying to solve.
Another is via EC Reconstruction - there is already logic there to deal with a duplicate container I think, however it may not cover volume failure. However EC reconstruction effectively uses the same write path as normal writes, so it its basically the same as the write path.
Then there is replication, which is a very different path into the system.
Maybe this problem is better not solved in ContainerSet, but at the higher level in the dispatcher where the writes are being handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should track all the containers getting added to the system. This flow also accomodates for replication both RATIS & EC. Barring the exception we give replication job to overwrite missing containers we should prevent overwriting any container coming through any flow. If we have this check in a central place, any future code changes would be foolproof. At least things would break and not silently end up writing something which is wrong.
|
For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes. It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR. |
All of the tests use these containerSet. If we pass a null value it would make the container set a read only data structure(I have put guardrails in place so that we don't end up inadvertently writing containers without persisting the data). Since we have made the table a must have arg we are having to pass a mocked table in all the existing tests. Another big change. We didn't have a pre-existing rocksdb on the master volume. DatanodeStore was only defined for data volumes, I wanted to reuse the code for creating a master volume that ended up adding generic everywhere, having created a base interface everywhere. |
| long containerId = container.getContainerData().getContainerID(); | ||
| State containerState = container.getContainerData().getState(); | ||
| if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) { | ||
| throw new StorageContainerException(String.format("Container with container Id %d is missing in the DN " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyValueHandler has this similar logic - why is it needed in both places? Feels like containerSet (ie here) may not need to check this, if its checked in the handler that creates the containers on the write path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyValueHandler is catching the exception and wrapping the exception as an error response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. This is redundant. Actually let me shift this logic to a single function so that we don't have redundant code. But we should perform the missing container check on add container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function which doesn't overwrite containers. We should be performing this check on every container add. In KeyValueHandler we are having to do it there because the creation of KeyValueContainerData creates the required directory structure in the data volume. So we are having to provide this check. But whenever we are adding to data structure from other paths we have to ensure the container is not missing by default unless we explicity intend to overwrite the missing container.
Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009
| * false | ||
| */ | ||
| public boolean addContainer(Container<?> container) throws | ||
| public boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a new method here - addContainerIgnoringMissing (or something similar), rather than changing the existing interface. That way, it avoids changes in existing calling code, and only code that explicitly needs to pass the boolean can be changed to call the new method. I suspect this change would reduce the number of files impacted by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a method for that.
Lines 106 to 108 in af0f757
| public boolean addContainer(Container<?> container) throws StorageContainerException { | |
| return addContainer(container, false); | |
| } |
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
| * DB handle abstract class. | ||
| */ | ||
| public abstract class DBHandle implements Closeable { | ||
| public abstract class DBHandle<STORE extends AbstractStore> implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid making changes to this class and get the desired functionality another way (extending another base class, something else)? Changing this has a knock on effect on many other classes and this feels like refactoring that perhaps isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I have limited the changes to only the AbstractStore changes. I don't see a point in bringing down the changes further & I don't think there is going to be a lot of conflict while backporting the AbstractStore changes
Then what is the purpose of missing container set? We shouldn't have 2 data structures solving the same stuff. We should remove the missing container set and only rely on the witnessed container list. If we have 2 out of the 3 we can compute the 3rd one. With the above approach proposed we would have the logic present spread across all the 3 set which would complicate things further in future. We shouldn't be adding redundant data structures if not required.
I had to do the reference counted db implementation just because of the test cases. The PR would have gotten much bigger to fix the test case. A simpler approach was to use the reference counted, we can have only one instance of a writable db.
|
Change-Id: I24693b81d1409120537f37289aec886bad845588
Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b
|
MissingContainerSet in the containerSet was added as part of solving HDDS-935, which is trying to solve the same problem (Avoid creating an already created container on a datanode in case of disk removal followed by datanode restart) we are trying to tackle here. But unfortunately the solution only accounted for ratis container which was because it was the only container type back then. The solution looks into adding all the containers that are present in the ratis snapshot's containerBCSIdMap but not found on disk after scanning through all the data volumes in the node. |
…oid column family option close Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b
kerneltime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better now. I want to go over the DB code refactoring and test code.
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb
| * false | ||
| */ | ||
| public boolean removeContainer(long containerId) { | ||
| private boolean removeContainer(long containerId, boolean markMissing, boolean removeFromDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeFromDB do we expect it to be true outside of testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that once a container is created, it will remain the DB forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any use of that container information. Why keep it if it is not required?
Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27
Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231
| } | ||
| }, 1000, 30000); | ||
|
|
||
| // Read Chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a second write chunk as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a loop for testing creating open container. Then post restarting I am creating a recovering container and performing the same set operations expecting it to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovering container accepting writes should be for EC only? Do we want write chunk coming for a non EC container to fail? In this test this is a single node cluster
cluster = MiniOzoneCluster.newBuilder(conf)
.setNumDatanodes(1)
.build();
cluster.waitForClusterToBeReady();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing both creating an open container and recovering container. Agnostic of whether it is an EC key or a ratis key we land in the same dispatcher flow. So we don't have to test this separately
Lines 253 to 256 in 827bc86
| // Create Container | |
| request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline); | |
| response = client.sendCommand(request); | |
| assertEquals(ContainerProtos.Result.CONTAINER_MISSING, response.getResult()); |
Lines 292 to 296 in 827bc86
| // Create Recovering Container | |
| request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline, | |
| ContainerProtos.ContainerDataProto.State.RECOVERING); | |
| response = client.sendCommand(request); | |
| assertEquals(ContainerProtos.Result.SUCCESS, response.getResult()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a follow up jira please accept write chunks only for EC pipelines if container is recovering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a follow up jira for performing this validation:
https://issues.apache.org/jira/browse/HDDS-11761
|
The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small). |
Change-Id: I03aac18fd7547ea39519721798015adc3c09033c
Change-Id: I1967faced513f6a6439b66034661b2342ab0475b
Change-Id: Icb96283b7496a3c02b0a16d3b7410fd4e649c30c
Change-Id: Id25cc023de401c9033078fd8b37b7b28f89c04cc
Change-Id: I52a3676fa80fd87c218dc749acd4cf5e99a78308
Should we keep the scope of this PR limited? We can look into adding a follow up jira for performing fault injection tests. The PR is already big as it is. |
|
Thanks for the review on the patch @kerneltime @sodonnel @adoroszlai |
| private static final ConcurrentMap<String, WitnessedContainerMetadataStore> INSTANCES = | ||
| new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi , It seems that the size of this map is at most 1. Is it the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @swamirishi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see the initial set of commits this was originally a singleton. But there were too many existing test case failures where all the tests run in the same jvm. So I had to change it to a concurrent map. The PR size increased a lot, so just took a short cut to get one instance. We can make the initial size of the the hash table to be just one. I guess that should bring in some optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it has at most one entry, we should lazily initialize it or use AtomicReference.
| } | ||
| try { | ||
| if (containerIdsTable != null) { | ||
| containerIdsTable.put(containerId, containerState.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi , why not putting the bcsid ? Then, we can get it back (instead of setting it to zero) when rebuilding the container set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the bcsid since bcsid is already stored in the ContainerTable. We don't use the value anywhere today. But eventually we want to store the replica index for EC case. We intend to use this for correctness. Look at https://issues.apache.org/jira/browse/HDDS-12722
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it put the containerState in this table. However, it won't be updated if the containerState is changed. The containerState value could be confusing.
| * false | ||
| */ | ||
| public boolean addContainer(Container<?> container) throws | ||
| private boolean addContainer(Container<?> container, boolean overwrite) throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi , it won't overwrite when overwrite is true. It will just ensureContainerNotMissing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would do a putToMap which would overwrite the existing data in the Map. Skipping ensureContainerNotMissing will do this since ensureContainerNotMissing will throw an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is putToMap? Do you mean containerMap.putIfAbsent(..)?
When putIfAbsent(..) returns non-null, it will
throw new StorageContainerException("Container already exists ...");There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the container would be either in the missingContainerSet or containerMap. It cannot be in both. On overwrite we are saying we will skip missingContainerSet check and forcefully overwrite the missing container replica.
What changes were proposed in this pull request?
We need to persist the list of containers created on a datanode to ensure that volume failures doesn't lead to recreation of containers on write chunk leading to partial inconsistent data in datanodes.
For ratis containers this is taken care of by persisting the containerInfo to bcsiD map in ratis snapshot file, but this list won't be present for EC containers.
Details to follow in a doc.
https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11650
How was this patch tested?
Adding more unit tests & integration tests