A Python tool that converts CIS benchmark PDF documents into structured data formats (CSV or Excel) for easier analysis.
- Intelligent PDF Parsing: Automatically extracts and structures CIS recommendation data from PDF documents
- Multiple Output Formats: Supports both CSV and Excel (.xlsx) output formats
- Table of Contents Processing: Leverages TOC to accurately identify and extract individual recommendations
- Rich Data Extraction: Captures standard CIS fields
- Excel Enhancements: Creates formatted Excel files with
- Data validation dropdowns for compliance tracking (OK, KO, Partial, N/A, ?)
- Conditional formatting for visual status indicators
- Named tables for easy data manipulation
- Batch Processing: Process multiple CIS benchmark PDFs in a single operation
- Debugging Support: Comprehensive logging and debug output for troubleshooting
- Python 3.6 or higher
- pip package manager
pip install -r requirements.txtPyMuPDF: For PDF text extraction and processingxlsxwriter: For Excel file generation (required only for Excel output format)
Convert a CIS benchmark PDF to Excel format:
python cis-converter.py path/to/cis-benchmark.pdfConvert to CSV format:
python cis-converter.py -f CSV path/to/cis-benchmark.pdfProcess multiple PDF files:
python cis-converter.py -f EXCEL -o output/ file1.pdf file2.pdf file3.pdfConvert single PDF to Excel with custom output directory:
python cis-converter.py -f EXCEL -o ./results/ CIS_Ubuntu_Linux_20.04_Benchmark_v1.1.0.pdfConvert to CSV with custom delimiter and debug logging:
python cis-converter.py -f CSV --csv-delimiter ";" -l DEBUG benchmark.pdfBatch process multiple benchmarks:
python cis-converter.py -o ./compliance-data/ *.pdfThe tool extracts the following information from each CIS recommendation:
| Field | Description |
|---|---|
| Benchmark | Source PDF filename |
| CIS # | Recommendation number (e.g., 2.3.1.6) |
| Scored | Scoring type (Scored/Not Scored/Manual/Automated) |
| Type | Profile level (L1/L2) or applicability |
| Policy | Recommendation title/name |
| Profile Applicability | Target systems and environments |
| Description | Detailed explanation of the control |
| Rationale | Why this control is important |
| Audit | Steps to verify compliance |
| Result | Compliance status (for tracking) |
| Comments | Additional notes (for tracking) |
| Remediation | Steps to implement the control |
| Impact | Potential effects of implementation |
| Default Value | System default configuration |
| References | Related documentation and resources |
| Additional Information | Extra context and notes |
| CIS Controls | Mapping to CIS Controls framework |
usage: cis-converter.py [-h] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--debug-file DEBUG_FILE] [-f {CSV,EXCEL}]
[-o OUTPUT_DIR] [--csv-quoting {ALL,MINIMAL,NONNUMERIC,NONE,NOTNULL,STRINGS}]
[--csv-delimiter CSV_DELIMITER] [--csv-quotechar CSV_QUOTECHAR]
[--csv-escapechar CSV_ESCAPECHAR]
input_files [input_files ...]
positional arguments:
input_files path to the input file(s)
options:
-h, --help show this help message and exit
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
set the logging level (default: INFO)
--debug-file DEBUG_FILE
output file for TXT extract from PDF, only if --log-level=DEBUG (default: cis-debug.txt)
-f {CSV,EXCEL}, --format {CSV,EXCEL}
set the output format (default: EXCEL)
-o OUTPUT_DIR, --output-folder OUTPUT_DIR
path to the folder for storing files generated by the script (default: ./)
CSV options:
--csv-quoting {ALL,MINIMAL,NONNUMERIC,NONE,NOTNULL,STRINGS}
set the CSV quoting style (default: ALL)
--csv-delimiter CSV_DELIMITER
set the CSV delimiter (default: ,)
--csv-quotechar CSV_QUOTECHAR
set the CSV quote character (default: ")
--csv-escapechar CSV_ESCAPECHAR
set the CSV escape character (default: \)
- Main Worksheet: Contains all extracted CIS recommendations with formatted columns
- Data Worksheet: Provides validation lists for compliance tracking
- Features:
- Data validation dropdowns in the "Result" column
- Conditional formatting with color-coded compliance status
- Text wrapping and proper cell formatting
- Named tables for easy filtering and sorting
- UTF-8 encoded with Byte Order Mark (BOM) for proper character display
- Customizable delimiters and quoting options
- Compatible with spreadsheet applications and data analysis tools
"Table of Contents could not be found"
- Ensure the PDF follows standard CIS benchmark format
- Check if the PDF has a proper Table of Contents section
- Run with
--log-level=DEBUGto see detailed extraction information
Missing or incomplete data
- Some CIS PDFs may have formatting variations
- Use debug mode to examine the extracted text:
--log-level=DEBUG - Check the debug output file (default:
cis-debug.txt) for parsing details
Enable debug logging to troubleshoot parsing issues:
python cis-converter.py --log-level=DEBUG --debug-file=debug.txt input.pdfThis will create a detailed log file showing:
- Raw text extraction from each page
- Formatted and cleaned text
- Table of contents parsing results
- Section identification and data extraction
Contributions are welcome! Please feel free to submit pull requests or open issues for:
- Bug fixes and improvements
- Support for additional CIS benchmark formats
- Enhanced parsing algorithms
- New output formats
This project is licensed under the MIT License.
- Based on the original CISConverter by Fragtastic
- Enhanced for improved parsing performance and reliability