GitHub - seyedb/file-ops: Efficient information extraction from large files

File Operations

Python scripts to perform the following file operations:

Jump to a line number in a file and read a line.
Read a large file as a stream of lines and filter only the lines that match some criteria.
Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
Read a JSON input and load it into an object.

Timing Results

The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).

Adding line numbers to the file:
addLineNumber : 58.024850428 s
addLineNumber_inplace : 103.272668963 s
Reading a line at a given line number:
getline from the linecache module is not practical for large files.
getLine uses enumerate() to read the file line-by-line until it reaches the target line number.
getLine_binarysearch searches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.

Use ./tools/timingplot.py to generate an interactive plotly plot. The timing data can be found at: ./data/

Test Data
shakespeare.txt : "As You Like It" by William Shakespeare.
exoplanets.json : list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json file.

Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
data		data
docs		docs
src/fileOps		src/fileOps
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

File Operations

Timing Results

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

seyedb/file-ops

Folders and files

Latest commit

History

Repository files navigation

File Operations

Timing Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages