Embedding-based near-duplicate detection using FAISS or Annoy, is important. Implementing a configurable threshold for “similarity score” will help remove redundant rows.