Python scripts to perform the following file operations:
- Jump to a line number in a file and read a line.
- Read a large file as a stream of lines and filter only the lines that match some criteria.
- Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
- Read a JSON input and load it into an object.
The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).
-
Adding line numbers to the file:
addLineNumber : 58.024850428 s
addLineNumber_inplace : 103.272668963 s -
Reading a line at a given line number:
getlinefrom thelinecachemodule is not practical for large files.
getLineusesenumerate()to read the file line-by-line until it reaches the target line number.
getLine_binarysearchsearches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.
Use ./tools/timingplot.py to generate an interactive plotly plot. The timing data can be found at: ./data/
Test Data
shakespeare.txt : "As You Like It" by William Shakespeare.
exoplanets.json : list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json file.
Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/
