Skip to content

Added switch for legacy-data, new parallel processing and progress bars#19

Open
Alessi0X wants to merge 3 commits intoourresearch:mainfrom
Alessi0X:main
Open

Added switch for legacy-data, new parallel processing and progress bars#19
Alessi0X wants to merge 3 commits intoourresearch:mainfrom
Alessi0X:main

Conversation

@Alessi0X
Copy link

@Alessi0X Alessi0X commented Feb 10, 2026

As of late 2025, the openalex-snapshot features two major subfolders: data and legacy-data.

By default, the flatten script operates on data. Yet, for keeping the legacy and for reproducibility, one might need to refer to legacy data.

This modification of the flatten script allows the user, by changing the isDataLegacy variable, to drive the flattening script towards using the legacy-data subfolder (variable to be set True) or the default data subfolder (variable to set False).

Second, I have included a new script that does the flattening in parallel: the seven flatten_*() functions are executed in parallel thanks to the multiprocess library.

Lastly, the original serial flattening script had quite a lot of print() to monitor progress, which have been substituted with nicer tqdm progress bars.

@Alessi0X Alessi0X changed the title Added switch for legacy-data Added switch for legacy-data & parallel processing Feb 10, 2026
@Alessi0X Alessi0X changed the title Added switch for legacy-data & parallel processing Added switch for legacy-data, new parallel processing and progress bars Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant