Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds batch processing functionality to split EUPMC link generation into multiple XML files, each containing a configurable number of links (5000 by default). This enables more manageable file sizes and incremental uploads to the FTP server.
Key Changes:
- Introduced
BATCH_SIZEconstant andsplit_into_batches()function to divide data into chunks - Modified main workflow to generate and upload multiple XML files (one per batch) instead of a single file
- Removed duplicate
import osstatement and simplified some docstrings
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return | ||
|
|
||
|
|
||
| def split_into_batches(data: List[Dict[str, str]], batch_size: int): |
There was a problem hiding this comment.
The function is missing a return type hint. For consistency with other functions in this file (which use type hints), consider adding -> Generator[List[Dict[str, str]], None, None] as the return type. You'll also need to import Generator from the typing module.
|
|
||
| input_tsv = "/nfs/ftp/public/databases/em_ebi/emdb_related/emicss/resources/emdb_pubmed.tsv" | ||
| output_xml = "/hps/nobackup/gerard/emdb/annotations/output/EMDB_linkFile_providerID_2057.xml" | ||
| output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC" |
There was a problem hiding this comment.
The output directory is not validated or created before use. If the directory doesn't exist, the script will fail when attempting to write the first batch XML file. Consider adding directory creation logic (e.g., os.makedirs(output_dir, exist_ok=True)) before the batch processing loop.
| output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC" | |
| output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC" | |
| os.makedirs(output_dir, exist_ok=True) |
| @@ -49,12 +50,10 @@ def build_xml(data: List[Dict[str, str]]) -> ET.Element: | |||
|
|
|||
|
|
|||
| def prettify_xml(elem: ET.Element) -> str: | |||
There was a problem hiding this comment.
The docstring was removed from this function. While the function is simple, it's a best practice to maintain docstrings for all functions, especially in a codebase that previously had them. Consider adding a brief docstring like """Return a pretty-printed XML string for the Element."""
| def prettify_xml(elem: ET.Element) -> str: | |
| def prettify_xml(elem: ET.Element) -> str: | |
| """Return a pretty-printed XML string for the Element.""" |
| return minidom.parseString(ET.tostring(elem, 'utf-8')).toprettyxml(indent=" ") | ||
|
|
||
|
|
||
| def write_xml(xml_root: ET.Element, output_file: str) -> None: |
There was a problem hiding this comment.
The docstring was removed from this function. Consider restoring it or adding a brief one like """Write the XML tree to a file with pretty formatting.""" to maintain consistency with the codebase's documentation standards.
| def write_xml(xml_root: ET.Element, output_file: str) -> None: | |
| def write_xml(xml_root: ET.Element, output_file: str) -> None: | |
| """Write the XML tree to a file with pretty formatting.""" |
| @@ -69,17 +68,6 @@ def upload_file_via_ftp( | |||
| remote_dir: str = ".", | |||
| remote_filename: str = None | |||
| ) -> None: | |||
There was a problem hiding this comment.
The comprehensive docstring was removed from this function. Given that this function has multiple parameters with specific purposes and interacts with external FTP servers, it would be beneficial to restore the docstring to help future maintainers understand the function's purpose, parameters, and behavior.
| ) -> None: | |
| ) -> None: | |
| """ | |
| Upload a local file to a remote FTP server. | |
| Connects to the specified FTP server using the provided credentials, | |
| changes to the specified remote directory, and uploads the given local file. | |
| Parameters: | |
| server (str): The FTP server address. | |
| username (str): Username for FTP authentication. | |
| password (str): Password for FTP authentication. | |
| local_file_path (str): Path to the local file to upload. | |
| remote_dir (str, optional): Remote directory to upload the file to. Defaults to ".". | |
| remote_filename (str, optional): Name to use for the file on the server. Defaults to the local file's basename. | |
| Returns: | |
| None | |
| Logs errors if the local file does not exist or if the upload fails. | |
| """ |
|
|
||
| xml_root = build_xml(batch) | ||
|
|
||
| output_file = os.path.join( | ||
| output_dir, | ||
| f"EMDB_linkFile_providerID_2057_part{part}.xml" | ||
| ) | ||
|
|
||
| write_xml(xml_root, output_file) | ||
|
|
||
| upload_file_via_ftp( | ||
| server=ftp_server, | ||
| username=ftp_user, | ||
| password=ftp_pass, | ||
| local_file_path=output_file, | ||
| remote_dir=ftp_dir | ||
| ) | ||
|
|
||
| logging.info(f"Batch {part} completed.") | ||
| part += 1 | ||
|
|
There was a problem hiding this comment.
The batch processing loop lacks error handling. If an exception occurs during XML generation, file writing, or FTP upload for any batch, the entire script will fail and subsequent batches won't be processed. Consider adding try-except blocks around the batch processing operations to log failures and continue with remaining batches, or at minimum, provide clear error context about which batch failed.
| xml_root = build_xml(batch) | |
| output_file = os.path.join( | |
| output_dir, | |
| f"EMDB_linkFile_providerID_2057_part{part}.xml" | |
| ) | |
| write_xml(xml_root, output_file) | |
| upload_file_via_ftp( | |
| server=ftp_server, | |
| username=ftp_user, | |
| password=ftp_pass, | |
| local_file_path=output_file, | |
| remote_dir=ftp_dir | |
| ) | |
| logging.info(f"Batch {part} completed.") | |
| part += 1 | |
| try: | |
| xml_root = build_xml(batch) | |
| output_file = os.path.join( | |
| output_dir, | |
| f"EMDB_linkFile_providerID_2057_part{part}.xml" | |
| ) | |
| write_xml(xml_root, output_file) | |
| upload_file_via_ftp( | |
| server=ftp_server, | |
| username=ftp_user, | |
| password=ftp_pass, | |
| local_file_path=output_file, | |
| remote_dir=ftp_dir | |
| ) | |
| logging.info(f"Batch {part} completed.") | |
| except Exception as e: | |
| logging.error(f"Error processing batch {part}: {e}", exc_info=True) | |
| part += 1 |
| upload_file_via_ftp( | ||
| server=ftp_server, | ||
| username=ftp_user, | ||
| password=ftp_pass, | ||
| local_file_path=output_file, | ||
| remote_dir=ftp_dir | ||
| ) |
There was a problem hiding this comment.
The FTP connection is established and torn down for each batch. For better performance, consider establishing a single FTP connection before the loop and reusing it for all uploads, then closing it after all batches are processed. This would significantly reduce connection overhead, especially when processing many batches.
No description provided.