Skip to content

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Nov 7, 2024

What changes were proposed in this pull request?

We need to persist the list of containers created on a datanode to ensure that volume failures doesn't lead to recreation of containers on write chunk leading to partial inconsistent data in datanodes.

For ratis containers this is taken care of by persisting the containerInfo to bcsiD map in ratis snapshot file, but this list won't be present for EC containers.
Details to follow in a doc.

https://docs.google.com/document/d/1k4Tc7P-Jszdwn4dooRQZiKctlcGWNlnjZO1fh_TBneM/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11650

How was this patch tested?

Adding more unit tests & integration tests

…anode

Change-Id: I94fd413a2d778ac5d86a7da5126cf3d1cac8113a
Change-Id: I22654091edbd3a11c585aa95ca2b554eba0f9d95
Change-Id: Ibad0d380486c03dda6c61f87b329c98f90a01fd8

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
Change-Id: Icaa5ae0b29ec0ffccf5914bec0fd6ed6ae117219
Change-Id: I995fc25b93f16aa859eeb8f0418aa774e3719330
Change-Id: Ibeadc9330185f699e4cf1d9c1c8631d1af52683e
Change-Id: I5ac6a685a49e79be5ea43717294dd649383433f2
Change-Id: I18f48f9d97b0cc16a3c97a3137ee01ebda4fcbec
Change-Id: I16cec52ea1c2853c80ee9a6e3279a23408d05651
Change-Id: Icf779e2bff8ace1721b529e3c89edbe9effa9989
Change-Id: I485646e86105a8a1bab6b638262669fc5f92d94d
Change-Id: I03ab7dd188ae39248ca889f40b9490eb2870579f
Change-Id: I82647ef09edc6fd9432652d911bf2ff4bccf25a5
Change-Id: I73091fc280dea5ad447b9df8bb0a1877d8f1ff35
Change-Id: Ife63a4ab2a69869cce9d1c407bfdeba2540d2482
Change-Id: Ic9fe75b9efe885080e3ad440f132eb0100c41a17
Change-Id: I5a8e092d8fb751a2ca69256740df59edd59b9b95
…containers

Change-Id: Ic67537ed852920d8945430665e22eeddc7350d6e
@swamirishi swamirishi marked this pull request as ready for review November 9, 2024 15:06

long containerId = container.getContainerData().getContainerID();
State containerState = container.getContainerData().getState();
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we cannot just check if the container is in the ContainerIDTable? Why do we need to use this missingContainerSet instead?

I traced the creation of the missing set to a ratis snapshot (I know nothing about it) and a diff against the containers found in memory. What about EC containers? Do they all become part of this missing set?

Prior to this change, the missingSet doesn't seem to be used for anything inside this class, but not it is used for these checks. What was missingSet used for before?

I would have thought the ContainerID table starts initially empty after this change is committed. On first start, it gets populated with all containers in the ContainerSet (scanned from disk). After initial startup, or later startups, it can also have all containers added, but they should already all be there. Then we can just track previously created in the containerID table?

Copy link
Contributor Author

@swamirishi swamirishi Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container set is first initialised with all the containers present on disk by performing an ls on the data volumes. Then we iterate through the containerIDs table on the rocksdb, and any container present in this table and not present on disk will be added to the missing container set. This would include both ratis and ec containers.
Now on container creation we just have to check this in the in-memory set than going to the rocksdb everytime which would incur an io cost, however small it is not need in my opinion when we have everything in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I mis-read the Java doc and thought it was a ratis snapshot. However my question still stands - why not simply use the ContainerIDTable to check for previous container creation, than augmenting this missing list?

Copy link
Contributor Author

@swamirishi swamirishi Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just reusing the data structure which was already there. This missing container set was just used for ratis container initially, I am now making it track even the EC containers. I didn't want to change a lot of code and reuse most of the stuff which is already there.
Moreover an in memory get is faster than fetching from rocksdb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change, the missingContainerSet variable in the ContainerSet was not used for anything within that class. Now, we are potentially adding to it from the containerList - what impact does that have more widely? What was the missingContainerSet used for before?

We also have to open the new containerList table - it would be more clear if we just loaded that into memory and use the new "seenContainerList" for previously existing checks.

However backing up a bit - how can containers get created?

One way, is via the write path - that is the problem we are trying to solve.

Another is via EC Reconstruction - there is already logic there to deal with a duplicate container I think, however it may not cover volume failure. However EC reconstruction effectively uses the same write path as normal writes, so it its basically the same as the write path.

Then there is replication, which is a very different path into the system.

Maybe this problem is better not solved in ContainerSet, but at the higher level in the dispatcher where the writes are being handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should track all the containers getting added to the system. This flow also accomodates for replication both RATIS & EC. Barring the exception we give replication job to overwrite missing containers we should prevent overwriting any container coming through any flow. If we have this check in a central place, any future code changes would be foolproof. At least things would break and not silently end up writing something which is wrong.

@sodonnel
Copy link
Contributor

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 15, 2024

For me, this change is too big, and its going to cause conflicts and be difficult to backport. Its also hard to review, as the number of changed files is massive, although many are test files and simple one line changes.

It feels like this change could be largely contained with the ContainerSet class, without changing a lot of generic classes and causing the ripple effect into the wider system. All it has to do, is create a new overloaded constructor and not change the existing one. Make the existing one call the new one with the correct runtime requirements. Then we just need to check if a container has previously existing or not, and add new containers to the "previously exists" list. This shouldn't require changing 90 something files in a single PR.

All of the tests use these containerSet. If we pass a null value it would make the container set a read only data structure(I have put guardrails in place so that we don't end up inadvertently writing containers without persisting the data). Since we have made the table a must have arg we are having to pass a mocked table in all the existing tests.

Another big change. We didn't have a pre-existing rocksdb on the master volume. DatanodeStore was only defined for data volumes, I wanted to reuse the code for creating a master volume that ended up adding generic everywhere, having created a base interface everywhere.

long containerId = container.getContainerData().getContainerID();
State containerState = container.getContainerData().getState();
if (!overwriteMissingContainers && missingContainerSet.contains(containerId)) {
throw new StorageContainerException(String.format("Container with container Id %d is missing in the DN " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyValueHandler has this similar logic - why is it needed in both places? Feels like containerSet (ie here) may not need to check this, if its checked in the handler that creates the containers on the write path?

Copy link
Contributor Author

@swamirishi swamirishi Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyValueHandler is catching the exception and wrapping the exception as an error response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. This is redundant. Actually let me shift this logic to a single function so that we don't have redundant code. But we should perform the missing container check on add container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function which doesn't overwrite containers. We should be performing this check on every container add. In KeyValueHandler we are having to do it there because the creation of KeyValueContainerData creates the required directory structure in the data volume. So we are having to provide this check. But whenever we are adding to data structure from other paths we have to ensure the container is not missing by default unless we explicity intend to overwrite the missing container.

Change-Id: Icf8b45e0c2de6d353f3f880c441de7d7a6138009
* false
*/
public boolean addContainer(Container<?> container) throws
public boolean addContainer(Container<?> container, boolean overwriteMissingContainers) throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a new method here - addContainerIgnoringMissing (or something similar), rather than changing the existing interface. That way, it avoids changes in existing calling code, and only code that explicitly needs to pass the boolean can be changed to call the new method. I suspect this change would reduce the number of files impacted by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a method for that.

public boolean addContainer(Container<?> container) throws StorageContainerException {
return addContainer(container, false);
}

* DB handle abstract class.
*/
public abstract class DBHandle implements Closeable {
public abstract class DBHandle<STORE extends AbstractStore> implements Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid making changes to this class and get the desired functionality another way (extending another base class, something else)? Changing this has a knock on effect on many other classes and this feels like refactoring that perhaps isn't needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I have limited the changes to only the AbstractStore changes. I don't see a point in bringing down the changes further & I don't think there is going to be a lot of conflict while backporting the AbstractStore changes

@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 19, 2024

Why is the logic embedded into hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java A cleaner approach would be expand via aggregation the container set logic.

  1. Create a new Java class (WitnessedContainers.java) that reads and writes to the DB and keeps a concurrent copy in memory to handle concurrency (the current code does this)
  2. Do not bother with missing Container logic at the level of WitnessedContainers.java. When the replication manager tries to create a container, it checks if it is in the witnessed list and then adds it. So we only need two methods for addition AddContainer and AddContainerIfNotPresent.

Then what is the purpose of missing container set? We shouldn't have 2 data structures solving the same stuff. We should remove the missing container set and only rely on the witnessed container list.
IMO we only need 2 out of the 3 data structures:
WitnessedContainerSet:
Containers with data on disk:
Missing Container List

If we have 2 out of the 3 we can compute the 3rd one. With the above approach proposed we would have the logic present spread across all the 3 set which would complicate things further in future. We shouldn't be adding redundant data structures if not required.
I took the approach of reusing missing container set since it's usage is spread across multiple places in the code. I didn't want to change the logic there.

Do we need reference counted code to be introduced? Can we get away with just embedding it within the ContainerSet?

I had to do the reference counted db implementation just because of the test cases. The PR would have gotten much bigger to fix the test case. A simpler approach was to use the reference counted, we can have only one instance of a writable db.

I think this code change can go through more than one round of simplification for what it is trying to solve.

Change-Id: I24693b81d1409120537f37289aec886bad845588
Change-Id: Ie1d1f7163a04990f0447517b0e4e2f89fc8bc48b
@swamirishi swamirishi marked this pull request as draft November 19, 2024 23:41
@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 20, 2024

MissingContainerSet in the containerSet was added as part of solving HDDS-935, which is trying to solve the same problem (Avoid creating an already created container on a datanode in case of disk removal followed by datanode restart) we are trying to tackle here. But unfortunately the solution only accounted for ratis container which was because it was the only container type back then. The solution looks into adding all the containers that are present in the ratis snapshot's containerBCSIdMap but not found on disk after scanning through all the data volumes in the node.
HDDS-8448 caused a regression resulting in removing the containers on a failed volume instead of marking it as missing.
This patch is trying to do something along the same lines by extending the working for EC containers and is trying to fix the issues in the implementation for handling volume failures in the system. I believe this particular implementation was working fine for ratis container prior to HDDS-8448 which caused a regression. Creating a duplicate data structure performing the same functionalities is just redundant and could lead to future bugs because we would have to update 2 structures in place of one. I also don't see any value in implementing something all over again which would also involve thorough testing of that flow. IMO we should be reusing the existing frameworks in the system to achieve something rather than add more flows complicating the already existing complicated system further.
If you guys still strongly feel we should move this logic out of the ContainerSet class then my opinion would be to get rid of the missingContainerSet entirely from all of our code flows and have only one of the 2 but not both. @sodonnel @kerneltime @adoroszlai what do you think we should do here? Adding @ashishkumar50 since he was involved in HDDS-8448
I have addressed the other review comments on the patch (Getting rid of the Proto2EnumCodec, Removing ReferenceCountedDB for the WitnessedContainerMetadataStore)

…oid column family option close

Change-Id: I71ae4da25b0340a12eb6bbaab7338ceb0823ad8b
Copy link
Contributor

@kerneltime kerneltime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better now. I want to go over the DB code refactoring and test code.

Change-Id: Ie9c55a9111e215b4ae0fc58d80079ee71a5cdffb
* false
*/
public boolean removeContainer(long containerId) {
private boolean removeContainer(long containerId, boolean markMissing, boolean removeFromDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeFromDB do we expect it to be true outside of testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false, false : This happens on finding a duplicate container on startup where the container info in memory gets swapped by removing and adding the updated container info.
true, false : This happens on volume failure and we have to mark it as missing. This is a plain inmemory operation where in we remove the container from containerMap and add the containerId to the missingContainerSet.
false, true : This is a regular delete flow which happens when SCM sends a delete container command. This will only happen after deleting the entire container data from the disk.
true, true : This is an invalid case, this doesn't happen anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that once a container is created, it will remain the DB forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any use of that container information. Why keep it if it is not required?

@swamirishi swamirishi marked this pull request as ready for review November 20, 2024 03:11
Change-Id: I5ba57cc8de58aed79ed41c9f1198cf9ee7fd3c27
Change-Id: I24bc0bce41a0a8cccec5b0e872359502835f1231
}
}, 1000, 30000);

// Read Chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a second write chunk as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a loop for testing creating open container. Then post restarting I am creating a recovering container and performing the same set operations expecting it to succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovering container accepting writes should be for EC only? Do we want write chunk coming for a non EC container to fail? In this test this is a single node cluster

      cluster = MiniOzoneCluster.newBuilder(conf)
          .setNumDatanodes(1)
          .build();
      cluster.waitForClusterToBeReady();

Copy link
Contributor Author

@swamirishi swamirishi Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing both creating an open container and recovering container. Agnostic of whether it is an EC key or a ratis key we land in the same dispatcher flow. So we don't have to test this separately

// Create Container
request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline);
response = client.sendCommand(request);
assertEquals(ContainerProtos.Result.CONTAINER_MISSING, response.getResult());

// Create Recovering Container
request = ContainerTestHelper.getCreateContainerRequest(testContainerID, pipeline,
ContainerProtos.ContainerDataProto.State.RECOVERING);
response = client.sendCommand(request);
assertEquals(ContainerProtos.Result.SUCCESS, response.getResult());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow up jira please accept write chunks only for EC pipelines if container is recovering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a follow up jira for performing this validation:
https://issues.apache.org/jira/browse/HDDS-11761

@kerneltime
Copy link
Contributor

The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small).

Change-Id: I03aac18fd7547ea39519721798015adc3c09033c
Change-Id: I1967faced513f6a6439b66034661b2342ab0475b
Change-Id: Icb96283b7496a3c02b0a16d3b7410fd4e649c30c
Change-Id: Id25cc023de401c9033078fd8b37b7b28f89c04cc
Change-Id: I52a3676fa80fd87c218dc749acd4cf5e99a78308
@swamirishi
Copy link
Contributor Author

swamirishi commented Nov 20, 2024

The change seems reasonable now. Please ensure we are testing the APIs we care about for a container after the container goes missing. I think we can generate more test cases via fault injection. We can discuss if it needs to be part of this PR or a separate PR (to keep this PR small).

Should we keep the scope of this PR limited? We can look into adding a follow up jira for performing fault injection tests. The PR is already big as it is.

@swamirishi
Copy link
Contributor Author

Thanks for the review on the patch @kerneltime @sodonnel @adoroszlai

@swamirishi swamirishi merged commit fc6a2ea into apache:master Nov 20, 2024
40 checks passed
Comment on lines +40 to +41
private static final ConcurrentMap<String, WitnessedContainerMetadataStore> INSTANCES =
new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , It seems that the size of this map is at most 1. Is it the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see the initial set of commits this was originally a singleton. But there were too many existing test case failures where all the tests run in the same jvm. So I had to change it to a concurrent map. The PR size increased a lot, so just took a short cut to get one instance. We can make the initial size of the the hash table to be just one. I guess that should bring in some optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it has at most one entry, we should lazily initialize it or use AtomicReference.

}
try {
if (containerIdsTable != null) {
containerIdsTable.put(containerId, containerState.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , why not putting the bcsid ? Then, we can get it back (instead of setting it to zero) when rebuilding the container set.

Copy link
Contributor Author

@swamirishi swamirishi Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the bcsid since bcsid is already stored in the ContainerTable. We don't use the value anywhere today. But eventually we want to store the replica index for EC case. We intend to use this for correctness. Look at https://issues.apache.org/jira/browse/HDDS-12722

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it put the containerState in this table. However, it won't be updated if the containerState is changed. The containerState value could be confusing.

* false
*/
public boolean addContainer(Container<?> container) throws
private boolean addContainer(Container<?> container, boolean overwrite) throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , it won't overwrite when overwrite is true. It will just ensureContainerNotMissing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would do a putToMap which would overwrite the existing data in the Map. Skipping ensureContainerNotMissing will do this since ensureContainerNotMissing will throw an exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is putToMap? Do you mean containerMap.putIfAbsent(..)?

When putIfAbsent(..) returns non-null, it will

throw new StorageContainerException("Container already exists ...");

Copy link
Contributor Author

@swamirishi swamirishi Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the container would be either in the missingContainerSet or containerMap. It cannot be in both. On overwrite we are saying we will skip missingContainerSet check and forcefully overwrite the missing container replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants