load-balance-lines
Parallelize newline-delimited data processing by load balancing lines between multiple processes
Summary
Install
# Make the executable accessible within your project npm scripts as load-balance-lines # or, out of npm scripts, as ./node_modules/.bin/load-balance-lines npm i load-balance-lines# or globally npm i -g load-balance-lines
Basic use
Take a huge pile of data with atomic data elements separated by newline breaks, typically NDJSON.
# Make sure your executable is... executable chmod +x /path/to/my/executable# and let's go! cat data.ndjson | load-balance-lines /path/to/my/executable some args
or without the cat command, using <
load-balance-lines /path/to/my/executable some args
Simple demo
see test
Real case demo
For the needs of wikidata-rank, we need to parse a full dump of Wikidata
- get the latest dump (currently 31G gzipped)
wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
- Use nice to use the maximum amount of CPU possible while letting the priority to other processes
- Use pigz to decompress it using threads (drop-in replacement to the single threaded gzip)
nice pigz -d < latest-all.json.gz | nice load-balance-lines /path/to/wikidata-rank/scripts/calculate_base_scores
Options
Number of processes
By default, there will be as many processes as CPU cores, but it can be modified by setting an environment variable
export LBL_PROCESSES=4 ; cat data.ndjson | load-balance-lines ./my/script
Verbose
By default, the load balancer is silent to let stdout free for sub-processes outputs, but you can get some basic informations by setting LBL_VERBOSE
export LBL_VERBOSE=true ; cat data.ndjson | load-balance-lines ./my/script