# jrf reads files (or STDIN) and emits NDJSON or pretty-prints JSON values
jrf 'STAGE >> STAGE >> STAGE ...' < file.ndjson
jrf --parallel 8 'STAGE >> STAGE >> STAGE ...' *.ndjson.gz
jrf --lax 'STAGE >> STAGE >> STAGE ...' < multiline.json
jrf --lax 'STAGE >> STAGE >> STAGE ...' < file.jsonseq
jrf -o pretty '_' file.json file.ndjson
jrf -o tsv 'group_by(_["status"]) { |row| average(row["latency"]) }'
jrf --require ./my_helpers.rb 'my_method(_["value"])'
jrf --help
# Extract
jrf '_["foo"]'
# Filter then extract
jrf 'select(_["x"] > 10) >> _["foo"]'
# Aggregate
jrf 'select(_["item"] == "Apple") >> sum(_["count"])'
jrf 'percentile(_["ttlb"], 0.50)'
jrf '_["msg"] >> reduce(nil) { |acc, v| acc ? "#{acc} #{v}" : v }'
# Transform array elements
jrf 'map { |x| select(x >= 1) }'
jrf 'map { |x| x + 1 }'
# Transform object values
jrf 'map_values { |v| select(v >= 1) }'
jrf 'map_values { |v| v * 10 }'
# Flatten arrays into rows
jrf '_["items"] >> flat'
# Sort rows by key expression
jrf 'sort(_["at"]) >> _["id"]'
# Group rows into arrays by key
jrf 'group_by(_["status"])'
# Group by key and aggregate
jrf 'group_by(_["item"]) { |row| sum(row["count"] * row["price"]) }'
# Group by key and aggregate, using a global as a stash
jrf '$perc ||= 0.005.step(0.995, 0.01); group_by(_["group"]) { |row| percentile(row["score"], $perc) }'Need help writing a filter? Ask ChatGPT!
I had been using jq for years, but its unique DSL was always a pain — I could never remember the syntax without looking it up. It is also slow on large inputs and eats up a lot of memory.
Then one day, a carefully-written jq script started swapping and ground to a halt. That was the last straw.
What I wanted was:
- SQL-like syntax for aggregation, e.g.,
sum(cost * price) - extensibility backed by a popular programming language
- speed and memory efficiency
Ruby turned out to be a natural fit. Any Ruby expression can be used as an argument to the built-in functions — no special DSL to learn:
jrf 'select(_["path"] =~ /^\/api/)'
jrf 'sort(_["name"].downcase)'When built-ins alone aren't enough, Ruby blocks let you extend the logic naturally; custom ruby code can be preloaded as well:
jrf 'group_by(_["status"]) { |row| average(row["latency"]) }'
jrf --require ./my_helpers.rb 'my_method(_)'Ruby is also fast and memory-efficient: jrf’s core logic and user-supplied expressions are optimized together by the same JIT, strings are copied only when necessary, and Ruby comes with a heavily optimized JSON parser.
For example, calculating min over a large NDJSON file is 3x faster than jq1:
% jq -n 'reduce inputs as $x (null; ($x.tid) as $t | if . == null or $t < . then $t else . end)' < large.ldjson
327936
# elapsed: 4.919s
% jrf 'min(_["tid"])' < large.ldjson
327936
# elapsed: 1.470sOn many JSON files, including gzipped ones, jrf can parallelize processing across files with -P, making it over 20x faster here:
% (for i in large.ldjson.*.gz; do gzip -cd $i; done) | jq -n 'reduce inputs as $x (null; ($x.tid) as $t | if . == null or $t < . then $t else . end)'
327936
# elapsed: 49.837s
% jrf -P 10 'min(_["tid"])' large.ldjson.*.gz
327936
# elapsed: 2.260sGive it a try — install via RubyGems: gem install jrf
- By default, input is NDJSON (one JSON value per line); empty lines are skipped.
--laxallows multiline JSON texts and parses whitespace-delimited streams (also detects RS0x1efor JSON-SEQ).- If no filenames are provided, data is read from the standard input.
- If the provided filename ends with
.gz, the file is decompressed automatically.
- Output format is controlled by
-o/--output FORMAT:json(default) — one compact JSON value per line (NDJSON).pretty— pretty-prints each output JSON value.tsv— tab-separated values. Hashes become rows keyed by their keys; arrays of arrays become rows directly. Scalar and null cells are printed as-is; nested arrays and objects are rendered as compact JSON. Useful for pasting into spreadsheets or piping throughcolumn -t.- Short outputs are grouped into atomic writes (4 KB by default; configurable via
--atomic-write-bytes N), allowing safe use with parallel pipelines such asxargs -P.
-P Nopportunistically parallelizes compatible pipelines acrossNworker processes when multiple input files are provided.jrflooks for the first aggregation stage and executes all stages upfront in parallel. The first aggregation stage is also parallelized when it has a known shape (e.g.,sum). If parallelization does not apply cleanly, execution falls back to single-process. Processing order is not guaranteed under-P, so order-sensitive reducers may produce different results than serial execution.
jrf processes the input using a multi-stage pipeline that is connected by top-level >>.
Within each stage, the current JSON value is available as _, and the following built-in functions are provided.
Inside nested block contexts such as map, map_values, and group_by, _ remains the surrounding row value, while implicit-input built-ins operate on the current target object for that block.
For aggregation functions, nil values are ignored.
Examples below use input → output comments, and those examples are intended to be testable.
Filters rows. If predicate is true, the current value passes through; if false, the row is dropped.
jrf 'select(_["status"] == 200) >> _["path"]'
# {"status":200,"path":"/ok"}, {"status":404,"path":"/ng"} → "/ok"Expands an Array into multiple rows, one output row per element.
jrf '_["items"] >> flat'
# {"items":[1,2]}, {"items":[3]}, {"items":[]} → 1, 2, 3Collects values into one Array. This is the opposite of flat.
group (without arguments) collects the current target object as-is.
group(expr) first evaluates expr and collects that result instead.
jrf '_["id"] >> group'
# {"id":1}, {"id":2}, {"id":3} → [1,2,3]
jrf 'group(_["id"])'
# {"id":1}, {"id":2}, {"id":3} → [1,2,3]Computes the average value across rows.
jrf '_["latency"] >> average(_)'
# {"latency":10}, {"latency":30} → 20.0Computes the minimum value across rows.
jrf '_["latency"] >> min(_)'
# {"latency":10}, {"latency":30} → 10Computes the maximum value across rows.
jrf '_["latency"] >> max(_)'
# {"latency":10}, {"latency":30} → 30Computes the standard deviation across rows.
jrf '_["latency"] >> stdev(_)'
# {"latency":1}, {"latency":3} → 1.0Computes the sum across rows.
jrf '_["price"] * _["unit"] >> sum(_)'
# {"price":10,"unit":2}, {"price":5,"unit":4} → 40count() counts rows.
count(expr) counts non-nil values of expr.
jrf 'count()'
# {"status":200}, {"status":404}, {"status":200} → 3
jrf 'select(_["status"] == 200) >> count()'
# {"status":200}, {"status":404}, {"status":200} → 2Counts rows where condition is truthy.
jrf 'count_if(_["status"] == 200)'
# {"status":200}, {"status":404}, {"status":200} → 2
jrf '[count_if(_["x"] > 0), count_if(_["x"] < 0)]'
# {"x":1}, {"x":-2}, {"x":3} → [2,1]Computes percentiles for p in [0.0, 1.0].
If a scalar is given as a percentile, emits the value as a scalar.
If an enumerable of percentiles is given, emits one array of values in the same order as the requested percentiles.
For example, with [0.1, 0.5, 0.9], the output is [p10_value, p50_value, p90_value].
jrf 'percentile(_["latency"], 0.5)'
# {"latency":10}, {"latency":20}, {"latency":30} → 20
jrf 'percentile(_["latency"], [0.25, 0.5, 1.0])'
# {"latency":10}, {"latency":20}, {"latency":30}, {"latency":40} → [10,20,40]Generic custom reducer API.
Most built-in aggregations are convenience wrappers around reduce, and many reshaping patterns can also be expressed with reduce.
jrf '_["msg"] >> reduce(nil) { |acc, v| acc ? "#{acc} #{v}" : v }'
# {"msg":"hello"}, {"msg":"world"} → "hello world"
jrf '_["count"] >> reduce(0) { |acc, v| acc + v }'
# {"count":10}, {"count":20} → 30Sorts rows. With one argument, rows are sorted by key expression. With a block, rows are sorted by custom comparator. Wit no argument, rows are sorted by the current target value. This is most useful when the target value is a number or a string.
jrf 'sort(_["at"]) >> _["id"]'
# {"id":"b","at":2}, {"id":"a","at":1}, {"id":"c","at":3} → "a", "b", "c"
jrf 'sort { |a, b| b["at"] <=> a["at"] } >> _["id"]'
# {"id":"b","at":2}, {"id":"a","at":1}, {"id":"c","at":3} → "c", "b", "a"
jrf 'sort'
# 3, 1, 2 → 1, 2, 3Maps each element of an Array, or each entry of a Hash (yielding [key, value] pairs like Ruby's Hash#map), returning an Array.
By default operates on the current value; pass an explicit collection to operate on a different one.
Inside the block, _ remains the surrounding row value; use the block parameter for the element.
If the block is a plain expression, map transforms each element per row.
If the block uses aggregations (e.g. sum), each array position (or hash key) gets its own independent accumulator across rows.
jrf 'map { |x| x + 1 }'
# [1,10], [2,20] → [2,11], [3,21]
jrf 'map { |x| sum(x) }'
# [1,10], [2,20], [3,30] → [6,60]
jrf 'map { |(k, v)| "#{k}=#{v}" }'
# {"a":1,"b":10} → ["a=1","b=10"]
jrf 'map { |(k, v)| sum(v) }'
# {"a":1,"b":10}, {"a":2,"b":20} → [3,30]
jrf '_["values"] >> map { |x| min(x) }'
# {"values":[3,30]}, {"values":[1,10]}, {"values":[2,20]} → [1,10]
jrf 'map(_["items"]) { |x| x * 2 }'
# {"items":[1,2,3]} → [2,4,6]Maps each value of a Hash and returns a Hash.
By default operates on the current value; pass an explicit collection to operate on a different one.
Inside the block, _ remains the surrounding row value; use the block parameter for the value.
If the block is a plain expression, map_values transforms each value per row.
If the block uses aggregations, each key gets its own independent accumulator across rows.
jrf 'map_values { |v| v * 10 }'
# {"a":1,"b":2} → {"a":10,"b":20}
jrf 'map_values { |v| sum(v) }'
# {"a":1,"b":10}, {"a":2,"b":20} → {"a":3,"b":30}Runs an expression over the current value (an Array), processing all elements within that single value.
By default operates on the current value; pass an explicit collection to operate on a different one.
Unlike map which accumulates across rows (the same position across multiple inputs), apply aggregates within one value (all elements of a single array), completing immediately.
Inside the block, _ remains the surrounding row value; use the block parameter for each element.
For example:
# normalize values by their sum
jrf 'total = apply { |x| sum(x) }; map { |x| x.to_f / total }'
# [3,7] → [0.3,0.7]
# aggregate a nested array
jrf 'map { |o| [o["name"], apply(o["scores"]) { |x| average(x) }] }'
# [{"name":"a","scores":[1,2]},{"name":"b","scores":[10,20]}] → [["a",1.5],["b",15.0]]Groups rows by key expression and applies a reducer per group.
Without a block, collects rows into arrays (equivalent to group_by(key) { group }).
With a block, applies the given reducer independently per group.
Inside the block, _ still refers to the surrounding row, and the current row is also yielded as the block parameter.
jrf 'group_by(_["status"])'
# {"status":200,"path":"/a"}, {"status":404,"path":"/b"}, {"status":200,"path":"/c"} → {"200":[{"status":200,"path":"/a"},{"status":200,"path":"/c"}],"404":[{"status":404,"path":"/b"}]}
jrf 'group_by(_["item"]) { |row| sum(row["count"] * row["price"]) }'
# {"item":"Apple","count":2,"price":100}, {"item":"Orange","count":3,"price":50}, {"item":"Apple","count":1,"price":100} → {"Apple":300,"Orange":150}
jrf 'group_by(_["status"]) { |row| average(row["latency"]) }'
# {"status":200,"latency":10}, {"status":404,"latency":50}, {"status":200,"latency":30} → {"200":20.0,"404":50.0}Aggregation built-ins accept ordinary Ruby expressions as arguments, but their results are not ordinary Ruby values during evaluation. They can appear as standalone values in reducer templates such as a stage result, an array, a hash, or a reducer-aware block, but they cannot be combined with operators or wrapped inside arbitrary Ruby expressions, leading to an error or an incorrect result.
Good examples:
count()
sum(_["x"])
sum(2 * _["x"])
sum(_["count"] * _["price"])
average(_.abs)
{total: sum(_["x"]), n: count()}
[count(), sum(_["x"])]
group_by(_["k"]) { {total: sum(_["x"]), n: count()} }
map_values { |v| sum(v) }Bad examples:
1 + count() # use: count() >> _ + 1
2 * sum(_["x"]) # use: sum(2 * _["x"])
sum(_["x"]).round # use: sum(_["x"]) >> _.round
[1 + count()] # use: count() >> [_ + 1]jrf can also be used as a Ruby library. Create a pipeline with Jrf.new, passing one or more procs as stages. The returned object is callable.
require "jrf"
# Extract and filter
j = Jrf.new(
proc { select(_["status"] == 200) },
proc { _["path"] }
)
j.call(input_array) # => ["/a", "/c", "/d"]
# Aggregate
j = Jrf.new(proc { {total: sum(_["price"]), n: count()} })
j.call(input_array) # => [{total: 1250, n: 42}]
# Local variables are captured via closure
threshold = 10
j = Jrf.new(proc { select(_["x"] > threshold) })Inside each proc, _ is the current value and all built-in functions documented above are available.
The pipeline streams output when a block is given:
j = Jrf.new(proc { _["id"] })
j.call(input_array) { |value| puts value }MIT
Footnotes
-
In
jq, calculating the minimum over multiple objects can be written asjq -s 'map(.tid) | mintoo, however that does not scale as it buffers the entire input. ↩