pdpart - biggish data on a laptop

Many data sets I encounter in practice could be categorized as biggish data: They do not fit into RAM, but can be processed with satisfactory performance on a single machine. The iterator capabilities of pandas allow to do many common operations on such data sets, but are limited when it comes to aggregations or joins on a key. While dask is often helpful for these cases, it does not offer the same level of maturity and versatility as pandas.

pdpart is a small utility that helps to do out of core/parallel operations by writing dataframes into partitioned csv files. Partitions are defined by a deterministic hash of the key, so that all rows with the same key are guaranteed to be in the same partitions - even across dataframes.

Partitions effectively act as a hash index, so that aggregations and joins on the key become embarrassingly parallel. pdpart does not actually do any of these operations, but enables using other libraries like multiprocessing.

Despite (or because?) of its narrow scope, I find that pdpart fits well into my workflow. Beware that at present, I view it as a working example rather than a mature piece of software, and e.g. unit tests are still to be written.

Basic usage

import pandas as pd
from pdpart import Partitioned

# test data to be partitioned by key
data = pd.DataFrame({
	'key': np.random.choice(['a', 'b', 'c', 'd', 'e', 'f', 'g'], size=100),
	'value': np.arange(100),
})

# create new directory in which partitions are put
parts = Partitioned.create('/data/parts/', by='key', n_partition=3)

# add data to the partitions, each key is mapped to a partition deterministically
parts.append(data)

# do something with the partitions
for fn in parts.partitions:
    df = pd.read_csv(fn)
    # do something here ...

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
examples		examples
pdpart		pdpart
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
License.md		License.md
Readme.md		Readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdpart - biggish data on a laptop

Basic usage

Change Log

0.2.0

0.1.0

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

mossadnik/pdpart

Folders and files

Latest commit

History

Repository files navigation

pdpart - biggish data on a laptop

Basic usage

Change Log

0.2.0

0.1.0

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages