Skip to content

GroupDocs.Comparison for Node.js - Sample demonstrating high‑performance batch comparison of Word (DOCX) documents with sequential, parallel and progress‑tracking support.

Notifications You must be signed in to change notification settings

groupdocs-comparison/batch-document-comparison-performance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Batch Word Document Comparison with Performance Optimization

Product Page Docs Blog Free Support Temporary License

Overview

This repository provides production‑ready code examples for implementing batch Word document comparison using GroupDocs.Comparison for Node.js via Java. Built with Node.js 20 LTS, these examples demonstrate optimized batch processing, parallel execution, and performance monitoring. Designed for developers who need to efficiently compare large sets of Word files.

Technology Stack

  • Platform: Node.js 20 LTS
  • Product: GroupDocs.Comparison
  • Language: JavaScript (Node.js)
  • Framework: None (plain Node.js runtime)

Problem Statement

Developers working with large collections of Word documents often need to detect differences across versions, regulatory revisions, or content updates. Performing a manual side‑by‑side review is error‑prone and does not scale. Traditional line‑by‑line diff tools cannot handle the rich structure of DOCX files, leading to missed formatting changes or broken layout detection. Moreover, processing thousands of document pairs sequentially can take hours, consuming valuable compute resources.

GroupDocs.Comparison simplifies this challenge by offering a high‑level API that understands Word document internals, automatically detects textual, structural, and formatting changes, and produces a visual comparison document. The library abstracts away low‑level XML handling, delivering reliable, repeatable results while exposing performance‑tuning options.

Solution Overview

GroupDocs.Comparison addresses these challenges through a native‑accelerated comparison engine accessed via a thin Node.js wrapper. The library provides a clear, promise‑based API that can be orchestrated in sequential or parallel workflows. Key technical features include:

  • High‑Fidelity Comparison: Detects text, style, images, tables, and footnotes.
  • Batch Processing API: Helper functions to locate matching file pairs and drive bulk operations.
  • Parallel Execution: Configurable concurrency to fully utilize multi‑core CPUs.
  • Progress Callbacks: Real‑time feedback for long‑running jobs.
  • Performance Options: Adjustable sensitivity, optional summary pages, and memory‑friendly modes.

Prerequisites

Before running these examples, ensure you have:

  • Node.js – 20 LTS or later (node --version)
  • Java Runtime – JRE/JDK 8+ (recommended 17 LTS) (java -version)
  • JAVA_HOME – Environment variable pointing to your JDK installation
  • GroupDocs.Comparison npm package – Installed via npm install
  • Temporary License – Obtain from the temporary‑license badge link if you do not have a permanent key

Getting Started

Installation

npm install

Configuration

  1. Set JAVA_HOME to point at your JDK directory.
  2. If you have a permanent license, replace the placeholder in src/utils/licenseHelper.js with your license string.
  3. Optionally adjust compareOptions in the Optimized Batch Comparison example to control sensitivity or enable summary pages.

Repository Structure

groupdocs-comparison-batch-word-nodejs/
│
├── package.json
├── README.md
├── src/
│   ├── batchComparison.js            # Core batch comparison functions
│   ├── examples/
│   │   ├── basicBatchComparison.js      # Sequential processing demo
│   │   ├── parallelBatchComparison.js   # Parallel processing demo
│   │   ├── optimizedBatchComparison.js # Performance‑tuned demo
│   │   ├── batchWithProgress.js        # Progress‑tracking demo
│   │   └── performanceBenchmark.js     # Benchmarking demo
│   └── utils/
│       ├── fileHelper.js                # File utilities
│       ├── licenseHelper.js              # License handling
│       ├── performanceMonitor.js         # Monitoring helpers
│       └── constants.js                  # Shared constants
├── sample-files/                            # Input documents
│   ├── source/
│   └── target/
└── output/                                 # Generated comparison results

File Descriptions

  • package.json – Project metadata and npm dependencies.
  • src/batchComparison.js – Implements compareWordPair, batch sequential/parallel flows, pair discovery, and report generation.
  • src/examples/basicBatchComparison.js – Demonstrates a simple sequential batch run.
  • src/examples/parallelBatchComparison.js – Shows how to run comparisons in parallel with configurable concurrency.
  • src/examples/optimizedBatchComparison.js – Applies CompareOptions for speed/accuracy trade‑offs.
  • src/examples/batchWithProgress.js – Provides real‑time progress feedback via a console progress bar.
  • src/examples/performanceBenchmark.js – Benchmarks sequential vs parallel strategies and prints the best‑performing configuration.
  • src/utils/ – Helper modules for file I/O, license loading, performance timing, and constant definitions.

Code Implementation

Implementation: Compares a single pair of Word documents and generates a comparison result document.

This function validates the existence of the source and target files, ensures the output directory exists, creates a Comparer instance, adds the target file, runs the comparison (optionally with custom options), and returns a metadata object containing the operation duration, file size, and any error information.

  const startTime = Date.now();
  
  try {
    // Validate files exist
    if (!fs.existsSync(sourcePath)) {
      throw new Error(`Source file not found: ${sourcePath}`);
    }
    if (!fs.existsSync(targetPath)) {
      throw new Error(`Target file not found: ${targetPath}`);
    }

    // Ensure output directory exists
    const outputDir = path.dirname(outputPath);
    if (!fs.existsSync(outputDir)) {
      fs.mkdirSync(outputDir, { recursive: true });
    }

    // Initialize comparer
    const comparer = new groupdocs.Comparer(sourcePath);
    comparer.add(targetPath);

    // Perform comparison
    const compareOptions = options.compareOptions || null;
    if (compareOptions) {
      await comparer.compare(outputPath, compareOptions);
    } else {
      await comparer.compare(outputPath);
    }

    const duration = Date.now() - startTime;
    const fileSize = fs.existsSync(outputPath) ? fs.statSync(outputPath).size : 0;

    return {
      success: true,
      sourcePath,
      targetPath,
      outputPath,
      duration,
      fileSize,
      error: null
    };
  } catch (error) {
    const duration = Date.now() - startTime;
    return {
      success: false,
      sourcePath,
      targetPath,
      outputPath,
      duration,
      fileSize: 0,
      error: error.message
    };
  }

Technical Details

  • Comparer – Wrapper around the native GroupDocs.Comparison engine; instantiated with the source document path.
  • add() – Registers the target document for comparison.
  • compare() – Executes the comparison and writes a result DOCX; can accept a CompareOptions object to tune sensitivity, generate summary pages, etc.
  • Error handling – Catches any I/O or API errors, returning a consistent result shape.
  • Performance – Measures elapsed time with Date.now() and reports the output file size.

Key Components:

  • compareWordPair: Core function handling a single comparison.
  • fs & path: Node.js built‑ins for file system interactions.
  • groupdocs.Comparer: Main API class from the npm package.

Parameters:

  • sourcePath – Full path to the original document.
  • targetPath – Full path to the document to compare against.
  • outputPath – Destination for the generated comparison file.
  • options – Optional object containing compareOptions for fine‑tuning.

Output: Returns a JSON‑compatible object summarising success, timing, and size.


Implementation: Compares multiple Word document pairs sequentially, processing one pair at a time.

The function iterates over an array of document pair descriptors, invoking compareWordPair for each pair. It aggregates results, tracks processing statistics, optionally reports progress via a callback, and finally returns a summary object containing totals, average duration, and the individual results.

  const startTime = Date.now();
  const results = [];
  let processed = 0;
  let succeeded = 0;
  let failed = 0;

  for (const pair of documentPairs) {
    const result = await compareWordPair(
      pair.source,
      pair.target,
      pair.output,
      options
    );

    results.push(result);
    processed++;
    
    if (result.success) {
      succeeded++;
    } else {
      failed++;
      console.error(`✗ [${processed}/${documentPairs.length}] ${path.basename(pair.source)} - ${result.error}`);
    }

    if (progressCallback) {
      progressCallback({
        processed,
        total: documentPairs.length,
        succeeded,
        failed,
        percentage: Math.round((processed / documentPairs.length) * 100)
      });
    }
  }

  const totalDuration = Date.now() - startTime;
  const avgDuration = results.reduce((sum, r) => sum + r.duration, 0) / results.length;

  return {
    total: documentPairs.length,
    succeeded,
    failed,
    totalDuration,
    avgDuration,
    results
  };

Technical Details

  • Sequential loop – Guarantees only one comparison runs at a time, keeping memory usage minimal.
  • Progress callback – Allows UI or console progress bars; emits every change in processed count.
  • Result aggregation – Collects each pair's metadata for later reporting or storing as JSON.
  • Error resilience – Failures are logged but do not abort the whole batch.

Key Components:

  • compareWordPair: Re‑used for each iteration.
  • progressCallback: Optional user‑supplied function for real‑time feedback.

Parameters:

  • documentPairs – Array of {source, target, output} objects.
  • options – Optional comparison options passed through to each pair.
  • progressCallback – Optional function invoked with processing stats.

Output: Summary object with totals, timing, and the list of per‑pair results.


Implementation: Compares multiple Word document pairs in parallel for improved performance.

The function processes the input array in configurable batches, launching a Promise.all for each batch to run several comparisons concurrently. It respects a concurrency limit to avoid overwhelming the system, reports progress, and returns a detailed summary similar to the sequential version.

  const startTime = Date.now();
  const results = [];
  let processed = 0;
  let succeeded = 0;
  let failed = 0;

  // Process in batches to control concurrency
  for (let i = 0; i < documentPairs.length; i += concurrency) {
    const batch = documentPairs.slice(i, i + concurrency);
    
    const batchResults = await Promise.all(
      batch.map(pair => compareWordPair(pair.source, pair.target, pair.output, options))
    );

    for (const result of batchResults) {
      results.push(result);
      processed++;
      
      if (result.success) {
        succeeded++;
      } else {
        failed++;
        console.error(`✗ [${processed}/${documentPairs.length}] ${path.basename(result.sourcePath)} - ${result.error}`);
      }

      if (progressCallback) {
        progressCallback({
          processed,
          total: documentPairs.length,
          succeeded,
          failed,
          percentage: Math.round((processed / documentPairs.length) * 100)
        });
      }
    }

    if (i + concurrency < documentPairs.length) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  }

  const totalDuration = Date.now() - startTime;
  const avgDuration = results.reduce((sum, r) => sum + r.duration, 0) / results.length;

  return {
    total: documentPairs.length,
    succeeded,
    failed,
    totalDuration,
    avgDuration,
    concurrency,
    results
  };

Technical Details

  • Batch concurrency control – The concurrency parameter caps the number of simultaneous compareWordPair promises, preventing excessive memory pressure.
  • Small inter‑batch delay – Adds a 100 ms pause between batches to give the OS time to flush I/O buffers.
  • Aggregated metrics – Includes concurrency in the final summary for traceability.
  • Error handling – Mirrors the sequential version; individual failures are logged without aborting other tasks.

Key Components:

  • Promise.all: Executes a batch of comparisons in parallel.
  • concurrency: User‑defined limit controlling parallelism.

Parameters: Same as the sequential version, plus concurrency.

Output: Provides total duration, average per‑document time, and the concurrency level used.


Implementation: Finds matching Word document pairs in source and target directories by filename.

The function scans the supplied source and target directories, filters for .docx and .doc files, and builds an array of pair objects where a file with the same base name exists in both locations. It also constructs output paths for the comparison results.

  if (!fs.existsSync(sourceDir)) {
    throw new Error(`Source directory not found: ${sourceDir}`);
  }
  if (!fs.existsSync(targetDir)) {
    throw new Error(`Target directory not found: ${targetDir}`);
  }

  const sourceFiles = fs.readdirSync(sourceDir)
    .filter(f => f.toLowerCase().endsWith('.docx') || f.toLowerCase().endsWith('.doc'))
    .map(f => {
      const baseName = path.basename(f, path.extname(f));
      return {
        name: f,
        source: path.join(sourceDir, f),
        target: path.join(targetDir, f),
        output: path.join(outputDir, `comparison_${baseName}.docx`)
      };
    })
    .filter(f => fs.existsSync(f.target)); // Only include pairs where target exists

  return sourceFiles;

Technical Details

  • Directory validation – Throws early if either input directory is missing.
  • Extension filter – Accepts both .docx and legacy .doc formats.
  • Base‑name matching – Relies on identical filenames (excluding extension) to pair documents.
  • Automatic output naming – Prefixes comparison_ and ensures a .docx result.

Key Components:

  • fs.readdirSync: Reads directory entries.
  • path.basename / path.extname: Manipulate filenames.

Parameters:

  • sourceDir, targetDir, outputDir – Paths to the respective folders.

Output: Array of objects containing source, target, and output paths for each matched pair.


Implementation: Generates a formatted summary report from batch comparison results.

The function receives the aggregated batch result object and produces a multi‑line formatted string summarising total documents, successes, failures, success rate, total run time, average per‑document duration, throughput, and the concurrency strategy used.

  const { total, succeeded, failed, totalDuration, avgDuration, concurrency } = batchResults;
  
  const report = `
================================================================================
Batch Comparison Summary
================================================================================
Total Documents:     ${total}
Successful:          ${succeeded}
Failed:              ${failed}
Success Rate:        ${((succeeded / total) * 100).toFixed(2)}%

Performance Metrics:
  Total Duration:    ${(totalDuration / 1000).toFixed(2)}s
  Average Duration:   ${avgDuration.toFixed(2)}ms per document
  Throughput:        ${(succeeded / (totalDuration / 1000)).toFixed(2)} documents/second
  ${concurrency ? `Concurrency:        ${concurrency}` : 'Processing:        Sequential'}

================================================================================
`;

  return report;

Technical Details

  • Template literals – Build a human‑readable report with aligned columns.
  • Dynamic concurrency label – Shows either the concurrency value or marks the run as sequential.
  • Metrics calculations – Derive success percentage, average duration, and throughput.

Key Components:

  • batchResults: Object produced by the sequential or parallel batch functions.

Parameters:

  • batchResults – Summary of the batch run.

Output: Multi‑line string suitable for console output or log files.


Best Practices

When implementing batch Word document comparison, consider these best practices:

  • Validate Input Paths – Always check that source, target, and output directories exist before processing.
  • Control Concurrency – Start with a modest concurrency (e.g., 3‑5) and adjust based on CPU, memory, and I/O characteristics.
  • Use Progress Callbacks – Provide users with real‑time feedback to improve perceived performance.
  • Enable Summary Pages Sparingly – Generating visual summary pages adds overhead; enable only when needed.
  • Monitor Memory Usage – Large DOCX files can consume significant RAM; process in batches and release references after each comparison.

Additional Resources

For more in‑depth information about batch Word document comparison, explore these technical resources:

  • [Document Comparison Using GroupDocs.Comparison] – A step‑by‑step guide covering API basics, options, and advanced scenarios: Read the article →

  • [Optimizing Performance for Large‑Scale Comparisons] – Techniques for concurrency, memory management, and tuning CompareOptions: Read the article →

  • [GroupDocs.Comparison API Reference] – Full reference of classes, methods, and enumeration values: Read the article →

Keywords

GroupDocs.Comparison, Node.js, Java, batch comparison, Word, docx, document diff, parallel processing, performance optimization, compareWordPair, compareBatchSequential, compareBatchParallel, findWordPairs, generateSummaryReport, progress tracking, temporary license, document automation, API, JavaScript, npm, high‑volume, scalable, concurrency

Support

For technical support, visit:

About

GroupDocs.Comparison for Node.js - Sample demonstrating high‑performance batch comparison of Word (DOCX) documents with sequential, parallel and progress‑tracking support.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published