LazyGrids ndgrid

This page explains the ndgrid method(s) in the Julia package LazyGrids.

This page comes from a single Julia file: 1-ndgrid.jl.

You can access the source code for such Julia documentation using the 'Edit on GitHub' link in the top right. You can view the corresponding notebook in nbviewer here: 1-ndgrid.ipynb, or open it in binder here: 1-ndgrid.ipynb.

Setup

Packages needed here.

using LazyGrids: ndgrid, ndgrid_array
using LazyGrids: btime, @timeo # not exported; just for timing tests here
using BenchmarkTools: @benchmark
using InteractiveUtils: versioninfo

Overview

We begin with simple illustrations.

The Julia method ndgrid_array in this package is comparable to Matlab's ndgrid function. It is given a long name here to discourage its use, because the lazy ndgrid version is preferable. The package provides ndgrid_array mainly for testing and timing comparisons.

(xa, ya) = ndgrid_array(1.0:3.0, 1:2)
([1.0 1.0; 2.0 2.0; 3.0 3.0], [1 2; 1 2; 1 2])
xa
3×2 Matrix{Float64}:
 1.0  1.0
 2.0  2.0
 3.0  3.0
ya
3×2 Matrix{Int64}:
 1  2
 1  2
 1  2

This package provides a "lazy" version of ndgrid that appears to the user to be the same, but under the hood it is not storing huge arrays.

(xl, yl) = ndgrid(1.0:3.0, 1:2)
([1.0 1.0; 2.0 2.0; 3.0 3.0], [1 2; 1 2; 1 2])
xl
3×2 LazyGrids.GridSL{Float64, 1, 2, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}:
 1.0  1.0
 2.0  2.0
 3.0  3.0
yl
3×2 LazyGrids.GridUR{Int64, 2, 2}:
 1  2
 1  2
 1  2

The following example illustrates the memory savings (thanks to Julia's powerful AbstractArray type):

(xl, yl) = ndgrid(1:100, 1:200)
(xa, ya) = ndgrid_array(1:100, 1:200)
sizeof(xl), sizeof(xa)
(24, 160000)

One can do everything with a lazy array that one would expect from a "normal" array, e.g., multiplication and summation:

sum(xl * xl'), sum(xa * xa')
(5100500000, 5100500000)

Using lazy ndgrid

Many applications with multiple variables involve evaluating functions over a grid of values.

As a simple example (for illustration), one can numerically approximate the area of the unit circle by sampling that circle over a grid of x,y values, corresponding to numerical evaluation of the double integral $∫ ∫ 1_{\{x^2 + y^2 < 1\}} \, dx \, dy$. There are many ways to implement this approximation in Julia, given a vector of x and y samples.

Δ = 1/2^10
x = range(-1, stop=1, step=Δ)
y = copy(x)

@inline circle(x::Real, y::Real) = abs2(x) + abs2(y) < 1
@inline circle(xy::NTuple{2}) = circle(xy...)
circle (generic function with 2 methods)

The documentation below has many timing comparisons. The times in the Julia comments are on a 2017 iMac with Julia 1.6.1; the times printed out are whatever server GitHub actions uses. Using a trick to capture output, let's find out:

io = IOBuffer()
versioninfo(io)
split(String(take!(io)), '\n')
12-element Vector{SubString{String}}:
 "Julia Version 1.10.3"
 "Commit 0b4590a5507 (2024-04-30 10:59 UTC)"
 "Build Info:"
 "  Official https://julialang.org/ release"
 "Platform Info:"
 "  OS: Linux (x86_64-linux-gnu)"
 "  CPU: 4 × AMD EPYC 7763 64-Core Processor"
 "  WORD_SIZE: 64"
 "  LIBM: libopenlibm"
 "  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)"
 "Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)"
 ""

A basic double loop is the C/Fortran way. It uses minimal memory (only 48 bytes) but is somewhat slow.

function method0(x,y) # basic double loop
    sum = 0.0
    for x in x, y in y
        sum += circle(x,y)
    end
    return sum * Δ^2
end

area0 = method0(x,y)
t = @benchmark method0($x,$y) # 10.5 ms (3 allocations: 48 bytes)
btime(t)
"time=10.3ms mem=48 alloc=3"

The loop version does not look much like the math. It often seems natural to think of a grid of x,y values and simply sum over that grid, accounting for the grid spacing, using a function like this:

area(xx,yy) = sum(circle.(xx,yy)) * Δ^2
area (generic function with 1 method)

Users coming from Matlab who are unfamiliar with its newer broadcast capabilities might use an ndgrid of arrays, like in the following code, to compute the area. But this array approach is much slower and uses much more memory, so it does not scale well to higher dimensions.

function area_array(x, y)
    (xa, ya) = ndgrid_array(x, y)
    return area(xa, ya)
end
@assert area_array(x, y) ≈ area0
t = @benchmark area_array($x, $y) # 21.4 ms (11 allocations: 64.57 MiB)
btime(t)
"time=41.0ms mem=67704816 alloc=20"

To be fair, one might have multiple uses of the grids xa,ya so perhaps they should be excluded from the timing. Separating that allocation makes the timing look faster, but it still uses a lot of memory, both for allocating the grids, and for the circle. broadcast in the area function above:

(xa, ya) = ndgrid_array(x, y)
@assert area(xa, ya) ≈ area0
t = @benchmark area($xa, $ya) # 5.2 ms (7 allocations: 516.92 KiB)
btime(t)
"time=2.4ms mem=529296 alloc=7"

The main point of this package is to provide a lazy version of ndgrid that uses minimal memory.

(xl, yl) = ndgrid(x, y)
@assert xl == xa
@assert yl == ya
sizeof(xa), sizeof(xl)
(33587208, 64)

Now there is essentially no memory overhead for the grids, but memory is still used for the circle. broadcast.

@assert area(xl,yl) ≈ area0
t = @benchmark area($xl,$yl) # 3.7 ms (7 allocations: 516.92 KiB)
btime(t)
"time=12.1ms mem=529296 alloc=7"

Furthermore, creating this lazy ndgrid is so efficient that we can include its construction time and still have performance comparable to the array version that had pre-allocated arrays.

function area_lazy(x, y)
    (xl, yl) = ndgrid(x, y)
    return area(xl, yl)
end
@assert area_lazy(x, y) ≈ area0
t = @benchmark area_lazy($x, $y) # 3.7 ms (7 allocations: 516.92 KiB)
btime(t)
"time=12.1ms mem=529296 alloc=7"

More details

The comparisons below here might be more for the curiosity of the package developers than for most users...

One can preallocate memory to store the circle. array, to avoid additional memory during the area calculation:

out = Array{Float64}(undef, length(x), length(y))
function area!(xx, yy)
    global out .= circle.(xx,yy)
    return sum(out) * Δ^2
end
@assert area!(xl,yl) ≈ area0
t = @benchmark area!(xl,yl) # 4.8 ms (4 allocations: 128 bytes)
btime(t)
"time=14.7ms mem=192 alloc=4"

Interestingly, the lazy version is faster than the array version, presumably because of the overheard of moving data from RAM to CPU:

@assert area!(xa,ya) ≈ area0
t = @benchmark area!(xa,ya) # 6.2 ms (4 allocations: 80 bytes)
btime(t)
"time=3.6ms mem=80 alloc=4"

One can avoid allocating the output array by using a loop with CartesianIndices:

function area_ci(xx, yy)
    size(xx) == size(yy) || throw("size")
    sum = 0.0
    @inbounds for c in CartesianIndices(xx)
        sum += circle(xx[c], yy[c])
    end
    return sum * Δ^2
end
area_ci (generic function with 1 method)

With this approach the lazy version is a bit faster than the array version:

@assert area_ci(xl,yl) ≈ area0
t = @benchmark area_ci(xl,yl) # 5.2 ms (3 allocations: 48 bytes)
btime(t)
"time=21.3ms mem=48 alloc=3"
@assert area_ci(xa,ya) ≈ area0
t = @benchmark area_ci(xa,ya) # 5.9 ms (3 allocations: 48 bytes)
btime(t)
"time=5.7ms mem=48 alloc=3"

Alternatively one can use a linear index for loop, that also avoids the extra memory of circle. above, but is slower, especially for the lazy arrays that are optimized for Cartesian indexing:

function area_for2(xx,yy)
    size(xx) == size(yy) || throw("size")
    sum = 0.0
    @inbounds for i in 1:length(xx)
        sum += circle(xx[i], yy[i])
    end
    return sum * Δ^2
end
@assert area_for2(xa, ya) ≈ area0
t = @benchmark area_for2($xa, $ya) # 5.9 ms (3 allocations: 48 bytes)
btime(t)
"time=5.5ms mem=48 alloc=3"
@assert area_for2(xl, yl) ≈ area0
t = @benchmark area_for2($xl, $yl) # 15.4 ms (3 allocations: 48 bytes)
btime(t)
"time=32.4ms mem=48 alloc=3"

Some Julia users would recommend using broadcast. In this case, broadcast is reasonably fast, but still uses a lot of memory for the circle. output in the simplest implementation.

areab(x,y) = sum(circle.(x,y')) * Δ^2
@assert areab(x,y) ≈ area0
t = @benchmark areab($x,$y) # 11.6 ms (7 allocations: 516.92 KiB)
btime(t)
"time=9.7ms mem=529296 alloc=7"

Using zip can avoid the "extra" memory beyond the grids, but seems to have some undesirable overhead, presumably because zip uses linear indexing:

areaz(xa, ya) = sum(circle, zip(xa,ya)) * Δ^2
@assert areaz(xa, ya) ≈ area0
t = @benchmark areaz($xa, $ya) # 3.9 ms (3 allocations: 48 bytes)
btime(t)
"time=1.7ms mem=48 alloc=3"
@assert areaz(xl, yl) ≈ area0
t = @benchmark areaz($xl, $yl) # 12.2 ms (3 allocations: 48 bytes)
btime(t)
"time=39.6ms mem=48 alloc=3"

One can also ensure low memory by using a product iterator, but the code starts to look pretty different from the math at this point and it is not much faster than broadcast here.

areap(x,y) = sum(circle, Iterators.product(x, y)) * Δ^2
@assert areap(x,y) ≈ area0
t = @benchmark areap($x, $y) # 9.9 ms (3 allocations: 48 bytes)
btime(t)
"time=10.0ms mem=48 alloc=3"

3D case

A 3D example is finding (verifying) the volume of a unit sphere.

sphere(x::Real,y::Real,z::Real) = abs2(x) + abs2(y) + abs2(z) < 1
sphere(r::NTuple) = sum(abs2, r) < 1
sphere (generic function with 2 methods)

Storing three 3D arrays of size 2049^3 Float64 would take 192GB, so already we must greatly reduce the sampling to use either broadcast or ndgrid_array. Furthermore, the broadcast requires annoying reshape steps:

Δc = 1/2^8 # coarse grid
xc = range(-1, stop=1, step=Δc)
yc = xc
zc = xc
nc = length(zc)
3 * nc^3 * 8 / 1024^3 # GB prediction
3.0176124796271324

Here is broadcast in 3D (yuch!):

vol_br(x,y,z,Δ) = sum(sphere.(
        repeat(x, 1, length(y), length(z)),
        repeat(reshape(y, (1, :, 1)), length(x), 1, length(z)),
        repeat(reshape(z, (1, 1, :)), length(x), length(y), 1),
    )) * Δ^3
vol_br([0.],[0.],[0.],Δc) # warm-up
@timeo vol0 = vol_br(xc,yc,zc,Δc) # 2.7 sec, 3.0 GiB, roughly (4/3)π

function vol_ci(xx, yy, zz, Δ)
    size(xx) == size(yy) == size(zz) || throw("size")
    sum = 0.0
    @inbounds for c in CartesianIndices(xx)
        sum += sphere(xx[c], yy[c], zz[c])
    end
    return sum * Δ^3
end
vol_ci (generic function with 1 method)

Here is the lazy version:

(xlc, ylc, zlc) = ndgrid(xc, yc, zc) # warm-up
@timeo (xlc, ylc, zlc) = ndgrid(xc, yc, zc); # 0.000022 sec (1.8 KiB)
vol_ci([0.], [0.], [0.], Δc) # warm-up
@timeo vol_ci(xlc, ylc, zlc, Δc) # 0.2 sec, 1.4 MiB
"time=1347.2ms mem=6163112 alloc=89010"

Creating the grid of arrays itself is quite slow, even for the coarse grid:

(xac, yac, zac) = ndgrid_array(xc, yc, zc) # warm-up
@timeo (xac, yac, zac) = ndgrid_array(xc, yc, zc) # 1.8 sec 3.0GiB
"time=2032.0ms mem=3240139312 alloc=24"

Once created, the array version is no faster than the lazy version:

@timeo vol_ci(xac, yac, zac, Δc) # 0.2 seconds (1.1 MiB)
"time=214.6ms mem=692752 alloc=9326"

Using zip is more concise (but slower):

vol_zip(xx, yy, zz, Δ) = sum(sphere, zip(xx,yy,zz)) * Δ^3
vol_zip (generic function with 1 method)
vol_zip([0.], [0.], [0.], Δc) # warm-up
@timeo vol_zip(xlc, ylc, zlc, Δc) # 1.0 sec, 26 MiB
"time=2102.3ms mem=17644480 alloc=275116"

Using zip for the array version seems to have less overhead so that is a potential for future improvement:

@assert vol_zip(xac, yac, zac, Δc) ≈ vol0
@timeo vol_zip(xac, yac, zac, Δc) # 0.19 sec, 16 byte
"time=95.5ms mem=16 alloc=1"

Importantly, with the lazy ndgrid now we can return to the fine scale; it takes a few seconds, but it is feasible because of the low memory.

z = copy(x)
@timeo (xlf, ylf, zlf) = ndgrid(x, y, z) # 0.000023 sec
"time=0.0ms mem=2384 alloc=19"
@timeo vol_ci(xlf, ylf, zlf, Δ) # 12.7 sec, 16 bytes
"time=82374.1ms mem=16 alloc=1"

I was hoping that with a lazy grid, now we could explore higher-dimensional spheres. But with the current zip overhead it was too slow, even with coarse grid.

@timeo (π^2/2, sum(sphere, zip(ndgrid(xc,xc,xc,xc)...)) * Δc^4)

Probably I need to learn more about stuff like pairs(IndexCartesian(), A) e.g., this PR. Another day...

Reproducibility

This page was generated with the following version of Julia:

using InteractiveUtils: versioninfo
io = IOBuffer(); versioninfo(io); split(String(take!(io)), '\n')
12-element Vector{SubString{String}}:
 "Julia Version 1.10.3"
 "Commit 0b4590a5507 (2024-04-30 10:59 UTC)"
 "Build Info:"
 "  Official https://julialang.org/ release"
 "Platform Info:"
 "  OS: Linux (x86_64-linux-gnu)"
 "  CPU: 4 × AMD EPYC 7763 64-Core Processor"
 "  WORD_SIZE: 64"
 "  LIBM: libopenlibm"
 "  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)"
 "Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)"
 ""

And with the following package versions

import Pkg; Pkg.status()
Status `~/work/LazyGrids.jl/LazyGrids.jl/docs/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [e30172f5] Documenter v1.4.1
  [7031d0ef] LazyGrids v1.0.0 `~/work/LazyGrids.jl/LazyGrids.jl`
  [98b081ad] Literate v2.18.0
  [b77e0a4c] InteractiveUtils

This page was generated using Literate.jl.