jrf - a JSON filter with the power and speed of Ruby

SYNOPSIS

# jrf reads files (or STDIN) and emits NDJSON or pretty-prints JSON values
jrf 'STAGE >> STAGE >> STAGE ...' < file.ndjson
jrf --parallel 8 'STAGE >> STAGE >> STAGE ...' *.ndjson.gz
jrf --lax 'STAGE >> STAGE >> STAGE ...' < multiline.json
jrf --lax 'STAGE >> STAGE >> STAGE ...' < file.jsonseq
jrf -o pretty '_' file.json file.ndjson
jrf -o tsv 'group_by(_["status"]) { |row| average(row["latency"]) }'
jrf --require ./my_helpers.rb 'my_method(_["value"])'
jrf --help

# Extract
jrf '_["foo"]'

# Filter then extract
jrf 'select(_["x"] > 10) >> _["foo"]'

# Aggregate
jrf 'select(_["item"] == "Apple") >> sum(_["count"])'
jrf 'percentile(_["ttlb"], 0.50)'
jrf '_["msg"] >> reduce(nil) { |acc, v| acc ? "#{acc} #{v}" : v }'

# Transform array elements
jrf 'map { |x| select(x >= 1) }'
jrf 'map { |x| x + 1 }'

# Transform object values
jrf 'map_values { |v| select(v >= 1) }'
jrf 'map_values { |v| v * 10 }'

# Flatten arrays into rows
jrf '_["items"] >> flat'

# Sort rows by key expression
jrf 'sort(_["at"]) >> _["id"]'

# Group rows into arrays by key
jrf 'group_by(_["status"])'

# Group by key and aggregate
jrf 'group_by(_["item"]) { |row| sum(row["count"] * row["price"]) }'

# Group by key and aggregate, using a global as a stash
jrf '$perc ||= 0.005.step(0.995, 0.01); group_by(_["group"]) { |row| percentile(row["score"], $perc) }'

Need help writing a filter? Ask ChatGPT!

WHY JRF?

I had been using jq for years, but its unique DSL was always a pain — I could never remember the syntax without looking it up. It is also slow on large inputs and eats up a lot of memory.

Then one day, a carefully-written jq script started swapping and ground to a halt. That was the last straw.

What I wanted was:

SQL-like syntax for aggregation, e.g., sum(cost * price)
extensibility backed by a popular programming language
speed and memory efficiency

Ruby turned out to be a natural fit. Any Ruby expression can be used as an argument to the built-in functions — no special DSL to learn:

jrf 'select(_["path"] =~ /^\/api/)'
jrf 'sort(_["name"].downcase)'

When built-ins alone aren't enough, Ruby blocks let you extend the logic naturally; custom ruby code can be preloaded as well:

jrf 'group_by(_["status"]) { |row| average(row["latency"]) }'
jrf --require ./my_helpers.rb 'my_method(_)'

Ruby is also fast and memory-efficient: jrf’s core logic and user-supplied expressions are optimized together by the same JIT, strings are copied only when necessary, and Ruby comes with a heavily optimized JSON parser.

For example, calculating min over a large NDJSON file is 3x faster than jq¹:

% jq -n 'reduce inputs as $x (null; ($x.tid) as $t | if . == null or $t < . then $t else . end)' < large.ldjson
327936
# elapsed: 4.919s

% jrf 'min(_["tid"])' < large.ldjson
327936
# elapsed: 1.470s

On many JSON files, including gzipped ones, jrf can parallelize processing across files with -P, making it over 20x faster here:

% (for i in large.ldjson.*.gz; do gzip -cd $i; done) | jq -n 'reduce inputs as $x (null; ($x.tid) as $t | if . == null or $t < . then $t else . end)'
327936
# elapsed: 49.837s

% jrf -P 10 'min(_["tid"])' large.ldjson.*.gz
327936
# elapsed: 2.260s

Give it a try — install via RubyGems: gem install jrf

INPUT AND OUTPUT

By default, input is NDJSON (one JSON value per line); empty lines are skipped.
- --lax allows multiline JSON texts and parses whitespace-delimited streams (also detects RS 0x1e for JSON-SEQ).
- If no filenames are provided, data is read from the standard input.
- If the provided filename ends with .gz, the file is decompressed automatically.
Output format is controlled by -o/--output FORMAT:
- json (default) — one compact JSON value per line (NDJSON).
- pretty — pretty-prints each output JSON value.
- tsv — tab-separated values. Hashes become rows keyed by their keys; arrays of arrays become rows directly. Scalar and null cells are printed as-is; nested arrays and objects are rendered as compact JSON. Useful for pasting into spreadsheets or piping through column -t.
- Short outputs are grouped into atomic writes (4 KB by default; configurable via --atomic-write-bytes N), allowing safe use with parallel pipelines such as xargs -P.
-P N opportunistically parallelizes compatible pipelines across N worker processes when multiple input files are provided. jrf looks for the first aggregation stage and executes all stages upfront in parallel. The first aggregation stage is also parallelized when it has a known shape (e.g., sum). If parallelization does not apply cleanly, execution falls back to single-process. Processing order is not guaranteed under -P, so order-sensitive reducers may produce different results than serial execution.

BUILT-IN FUNCTIONS

jrf processes the input using a multi-stage pipeline that is connected by top-level >>.

Within each stage, the current JSON value is available as _, and the following built-in functions are provided. Inside nested block contexts such as map, map_values, and group_by, _ remains the surrounding row value, while implicit-input built-ins operate on the current target object for that block. For aggregation functions, nil values are ignored. Examples below use input → output comments, and those examples are intended to be testable.

select(predicate)

Filters rows. If predicate is true, the current value passes through; if false, the row is dropped.

jrf 'select(_["status"] == 200) >> _["path"]'
# {"status":200,"path":"/ok"}, {"status":404,"path":"/ng"} → "/ok"

flat

Expands an Array into multiple rows, one output row per element.

jrf '_["items"] >> flat'
# {"items":[1,2]}, {"items":[3]}, {"items":[]} → 1, 2, 3

group

group(expr)

Collects values into one Array. This is the opposite of flat. group (without arguments) collects the current target object as-is. group(expr) first evaluates expr and collects that result instead.

jrf '_["id"] >> group'
# {"id":1}, {"id":2}, {"id":3} → [1,2,3]
jrf 'group(_["id"])'
# {"id":1}, {"id":2}, {"id":3} → [1,2,3]

average(expr)

Computes the average value across rows.

jrf '_["latency"] >> average(_)'
# {"latency":10}, {"latency":30} → 20.0

min(expr)

Computes the minimum value across rows.

jrf '_["latency"] >> min(_)'
# {"latency":10}, {"latency":30} → 10

max(expr)

Computes the maximum value across rows.

jrf '_["latency"] >> max(_)'
# {"latency":10}, {"latency":30} → 30

stdev(expr)

Computes the standard deviation across rows.

jrf '_["latency"] >> stdev(_)'
# {"latency":1}, {"latency":3} → 1.0

sum(expr)

Computes the sum across rows.

jrf '_["price"] * _["unit"] >> sum(_)'
# {"price":10,"unit":2}, {"price":5,"unit":4} → 40

count()

count(expr)

count() counts rows. count(expr) counts non-nil values of expr.

jrf 'count()'
# {"status":200}, {"status":404}, {"status":200} → 3
jrf 'select(_["status"] == 200) >> count()'
# {"status":200}, {"status":404}, {"status":200} → 2

count_if(condition)

Counts rows where condition is truthy.

jrf 'count_if(_["status"] == 200)'
# {"status":200}, {"status":404}, {"status":200} → 2
jrf '[count_if(_["x"] > 0), count_if(_["x"] < 0)]'
# {"x":1}, {"x":-2}, {"x":3} → [2,1]

percentile(expr, 0.95)

percentile(expr, [0.1, 0.5, 0.9])

percentile(expr, 0.1.step(0.9, 0.4))

Computes percentiles for p in [0.0, 1.0].

If a scalar is given as a percentile, emits the value as a scalar.

If an enumerable of percentiles is given, emits one array of values in the same order as the requested percentiles. For example, with [0.1, 0.5, 0.9], the output is [p10_value, p50_value, p90_value].

jrf 'percentile(_["latency"], 0.5)'
# {"latency":10}, {"latency":20}, {"latency":30} → 20
jrf 'percentile(_["latency"], [0.25, 0.5, 1.0])'
# {"latency":10}, {"latency":20}, {"latency":30}, {"latency":40} → [10,20,40]

reduce(initial) { |acc, v| ... }

Generic custom reducer API. Most built-in aggregations are convenience wrappers around reduce, and many reshaping patterns can also be expressed with reduce.

jrf '_["msg"] >> reduce(nil) { |acc, v| acc ? "#{acc} #{v}" : v }'
# {"msg":"hello"}, {"msg":"world"} → "hello world"
jrf '_["count"] >> reduce(0) { |acc, v| acc + v }'
# {"count":10}, {"count":20} → 30

sort(key_expr)

sort(key_expr) { |a, b| ... }

sort

Sorts rows. With one argument, rows are sorted by key expression. With a block, rows are sorted by custom comparator. Wit no argument, rows are sorted by the current target value. This is most useful when the target value is a number or a string.

jrf 'sort(_["at"]) >> _["id"]'
# {"id":"b","at":2}, {"id":"a","at":1}, {"id":"c","at":3} → "a", "b", "c"
jrf 'sort { |a, b| b["at"] <=> a["at"] } >> _["id"]'
# {"id":"b","at":2}, {"id":"a","at":1}, {"id":"c","at":3} → "c", "b", "a"
jrf 'sort'
# 3, 1, 2 → 1, 2, 3

map { |x| ... }

map(collection) { |x| ... }

Maps each element of an Array, or each entry of a Hash (yielding [key, value] pairs like Ruby's Hash#map), returning an Array. By default operates on the current value; pass an explicit collection to operate on a different one. Inside the block, _ remains the surrounding row value; use the block parameter for the element.

If the block is a plain expression, map transforms each element per row. If the block uses aggregations (e.g. sum), each array position (or hash key) gets its own independent accumulator across rows.

jrf 'map { |x| x + 1 }'
# [1,10], [2,20] → [2,11], [3,21]

jrf 'map { |x| sum(x) }'
# [1,10], [2,20], [3,30] → [6,60]

jrf 'map { |(k, v)| "#{k}=#{v}" }'
# {"a":1,"b":10} → ["a=1","b=10"]

jrf 'map { |(k, v)| sum(v) }'
# {"a":1,"b":10}, {"a":2,"b":20} → [3,30]

jrf '_["values"] >> map { |x| min(x) }'
# {"values":[3,30]}, {"values":[1,10]}, {"values":[2,20]} → [1,10]

jrf 'map(_["items"]) { |x| x * 2 }'
# {"items":[1,2,3]} → [2,4,6]

map_values { |v| ... }

map_values(collection) { |v| ... }

Maps each value of a Hash and returns a Hash. By default operates on the current value; pass an explicit collection to operate on a different one. Inside the block, _ remains the surrounding row value; use the block parameter for the value.

If the block is a plain expression, map_values transforms each value per row. If the block uses aggregations, each key gets its own independent accumulator across rows.

jrf 'map_values { |v| v * 10 }'
# {"a":1,"b":2} → {"a":10,"b":20}

jrf 'map_values { |v| sum(v) }'
# {"a":1,"b":10}, {"a":2,"b":20} → {"a":3,"b":30}

apply { |x| ... }

apply(collection) { |x| ... }

Runs an expression over the current value (an Array), processing all elements within that single value. By default operates on the current value; pass an explicit collection to operate on a different one. Unlike map which accumulates across rows (the same position across multiple inputs), apply aggregates within one value (all elements of a single array), completing immediately. Inside the block, _ remains the surrounding row value; use the block parameter for each element.

For example:

# normalize values by their sum
jrf 'total = apply { |x| sum(x) }; map { |x| x.to_f / total }'
# [3,7] → [0.3,0.7]

# aggregate a nested array
jrf 'map { |o| [o["name"], apply(o["scores"]) { |x| average(x) }] }'
# [{"name":"a","scores":[1,2]},{"name":"b","scores":[10,20]}] → [["a",1.5],["b",15.0]]

group_by(key_expr)

group_by(key_expr) { ... }

Groups rows by key expression and applies a reducer per group.

Without a block, collects rows into arrays (equivalent to group_by(key) { group }).

With a block, applies the given reducer independently per group. Inside the block, _ still refers to the surrounding row, and the current row is also yielded as the block parameter.

jrf 'group_by(_["status"])'
# {"status":200,"path":"/a"}, {"status":404,"path":"/b"}, {"status":200,"path":"/c"} → {"200":[{"status":200,"path":"/a"},{"status":200,"path":"/c"}],"404":[{"status":404,"path":"/b"}]}

jrf 'group_by(_["item"]) { |row| sum(row["count"] * row["price"]) }'
# {"item":"Apple","count":2,"price":100}, {"item":"Orange","count":3,"price":50}, {"item":"Apple","count":1,"price":100} → {"Apple":300,"Orange":150}

jrf 'group_by(_["status"]) { |row| average(row["latency"]) }'
# {"status":200,"latency":10}, {"status":404,"latency":50}, {"status":200,"latency":30} → {"200":20.0,"404":50.0}

LIMITATIONS

Aggregation built-ins accept ordinary Ruby expressions as arguments, but their results are not ordinary Ruby values during evaluation. They can appear as standalone values in reducer templates such as a stage result, an array, a hash, or a reducer-aware block, but they cannot be combined with operators or wrapped inside arbitrary Ruby expressions, leading to an error or an incorrect result.

Good examples:

count()
sum(_["x"])
sum(2 * _["x"])
sum(_["count"] * _["price"])
average(_.abs)
{total: sum(_["x"]), n: count()}
[count(), sum(_["x"])]
group_by(_["k"]) { {total: sum(_["x"]), n: count()} }
map_values { |v| sum(v) }

Bad examples:

1 + count()            # use: count() >> _ + 1
2 * sum(_["x"])        # use: sum(2 * _["x"])
sum(_["x"]).round      # use: sum(_["x"]) >> _.round
[1 + count()]          # use: count() >> [_ + 1]

RUBY LIBRARY

jrf can also be used as a Ruby library. Create a pipeline with Jrf.new, passing one or more procs as stages. The returned object is callable.

require "jrf"

# Extract and filter
j = Jrf.new(
  proc { select(_["status"] == 200) },
  proc { _["path"] }
)
j.call(input_array)  # => ["/a", "/c", "/d"]

# Aggregate
j = Jrf.new(proc { {total: sum(_["price"]), n: count()} })
j.call(input_array)  # => [{total: 1250, n: 42}]

# Local variables are captured via closure
threshold = 10
j = Jrf.new(proc { select(_["x"] > threshold) })

Inside each proc, _ is the current value and all built-in functions documented above are available.

The pipeline streams output when a block is given:

j = Jrf.new(proc { _["id"] })
j.call(input_array) { |value| puts value }

LICENSE

MIT

In jq, calculating the minimum over multiple objects can be written as jq -s 'map(.tid) | min too, however that does not scale as it buffers the entire input. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.github/workflows		.github/workflows
exe		exe
lib		lib
test		test
.gitignore		.gitignore
Gemfile		Gemfile
README.md		README.md
Rakefile		Rakefile
jrf.gemspec		jrf.gemspec

Folders and files

Latest commit

History

Repository files navigation

jrf - a JSON filter with the power and speed of Ruby

SYNOPSIS

WHY JRF?

INPUT AND OUTPUT

BUILT-IN FUNCTIONS

select(predicate)

flat

group

group(expr)

average(expr)

min(expr)

max(expr)

stdev(expr)

sum(expr)

count()

count(expr)

count_if(condition)

percentile(expr, 0.95)

percentile(expr, [0.1, 0.5, 0.9])

percentile(expr, 0.1.step(0.9, 0.4))

reduce(initial) { |acc, v| ... }

sort(key_expr)

sort(key_expr) { |a, b| ... }

sort

map { |x| ... }

map(collection) { |x| ... }

map_values { |v| ... }

map_values(collection) { |v| ... }

apply { |x| ... }

apply(collection) { |x| ... }

group_by(key_expr)

group_by(key_expr) { ... }

LIMITATIONS

RUBY LIBRARY

LICENSE

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages