Monday, March 26, 2012

Julia functions for the Rmath library

1.1 Signatures of the d-p-q-r functions in libRmath

Users of R are familiar with the functions for evaluating properties of and for sampling from probability distributions, the so-called d-p-q-r functions. The designation d-p-q-r reflects the fact that, for each distribution, there are up to 4 such functions, each beginning with one of these letters and followed by an abbreviated name of the distribution. The prefixes indicate:

d
the density of a continuous random variable or the probability mass function of a discrete random variable
p
the cumulative probability function, also called the cumulative distribution function or c.d.f.
q
the quantile function, which is the inverse of the c.d.f. defined on the interval (0, 1)
r
random sampling from the distribution

Members of R-core, notably Martin Maechler, have devoted considerable effort to ensuring that these functions are both reliable and general. Because they are part of an Open Source system, they can be used in other Open Source projects. In fact, there is the capability to collect these functions in a stand-alone library. One of the R packages for Debian Linux and distributions that are derived from it, such as Ubuntu, is r-mathlib which contains this stand-alone library. The header file for this library is $RHOME/include/Rmath.h on most systems (/usr/share/R/include/Rmath.h on Debian and derived distributions because of the rules for Debian packaging).

The names and arguments of these functions follow well-defined patterns. For example, those for the χ2 distribution have signatures

double  dchisq(double, double, int);
double  pchisq(double, double, int, int);
double  qchisq(double, double, int, int);
double  rchisq(double);

The first argument of dchisq is x, the abscissa, of pchisq is q, the quantile, and of qchisq is p, the probability. The next argument of the d-p-q functions, and the only argument of rchisq, is the distribution parameter, df, which is the degrees of freedom of the χ2. The last argument in the d-p-q functions controls whether the probability is on the logarithm scale. It is named give_log for the density function and log_p for the probability and quantile functions. (At first glance this argument may seem unnecessary as the probability density, for example, could be calculated on the probability scale then transformed to the logarithm scale. However, operations that are mathematically equivalent do not always produce the same result in floating point arithmetic. There are good reasons for these cases.)

The second last argument in the p and q functions determines whether probability is accumulated from the left, the lower tail, or from the right, the upper tail. The default is the lower tail. Again, there is an apparent redundancy because one can always subtract the probability in the lower tail from 1 to get the probability in the upper tail. However when x is extremely small but positive, 1 - x rounds to 1 and small upper-tail probabilities truncate too quickly. (Remember Kernighan and Plauger's aphorism that "10 times 0.1 is hardly ever 1" - that's what you live with in floating point computations.)

1.2 Calling these functions from Julia

It is surprisingly easy from within Julia to call C functions like these having simple signatures. First you call the Julia function dlopen to access the shared object file and save the handle

julia> _jl_libRmath = dlopen("libRmath")
Ptr{Void} @0x0000000003461050

I use an awkward, patterned name for this variable to avoid potential name conflicts. At present the Julia namespace is flat (but that is likely to change as namespaces are already being discussed).

Then a call to, say, pchisq needs only

julia> ccall(dlsym(_jl_libRmath,:pchisq),Float64,(Float64,Float64,Int32,Int32), 1.96^2, 1, true, false)
0.950004209703559

The first argument to the ccall function is the symbol extracted with dlsym, followed by the specific type of the function value, the argument signatures and the actual arguments. The actual arguments are converted to the signature shown before being passed to the C function. Thus the true is converted to 1 and the false to 0.

The value of this function call, 0.95000, is what we would expect, because a χ2 on 1 degree of freedom is the square of a standard normal distribution and the interval [-1.960, 1.960] contains approximately 95% of the probability of the standard normal.

1.3 Creating a Julia function for scalar arguments

Obviously, calling such a C function from the Rmath library is simpler if we wrap the call in a Julia function. We can define such a function as a one-liner (well, two lines but only because it would be a very long single line)

pchisq(q::Number, df::Number, lower_tail::Bool, log_p::Bool) =
   ccall(dlsym(_jl_libRmath,:pchisq),Float64,(Float64,Float64,Int32,Int32), q, df, lower_tail, log_p)

This illustrates one of the methods of defining a function in Julia,

fname(arg1, arg2, ...) = expression

This form, which reads much like the mathematical description f(x) = value, is useful when the function body is a single expression.

This code defines a particular method, with a reasonably general signature, for pchisq. The C function pchisq requires double precision values (called Float64 in Julia) for the first two arguments and integers for the last two. However, these integers are treated as Boolean values, so we require the arguments passed to the Julia function to be Boolean. The Number type in Julia is an abstract type incorporating all the integer and real number types (including complex and rational numbers, as it turns out). The Julia function, pchisq, can have methods of many different signatures. In a sense a function in Julia is just a name used to index into a method table.

We could have allowed unrestricted arguments, as in

pchisq(q, df, lower_tail, log_p) = <something>

but then we would need to define checks on the argument types, to ensure that they are appropriate and to check for vector versus scalar arguments, in the function body. It is better to use the method signature to check for general forms of arguments and to distinguish the scalar and vector cases. (Vector methods are defined in the next section).

This is one of the organizing principles of Julia; it is built around functions but actually all functions are methods chosen via multiple dispatch. So thinking in terms of signatures right off the bat is a good idea.

1.4 Default argument values

In R we can match arguments by names in a function call and we can define default values for arguments that are not given a value. At present neither of these capabilities is available in Julia. However, we can provide defaults for trailing arguments by creating methods with reduced signatures

pchisq(q::Number, df::Number, lower_tail::Bool) = pchisq(q, df, lower_tail, false)
pchisq(q::Number, df::Number) = pchisq(q, df, true, false)

The fact that the signature is recognized by position and not by using named actual arguments means that whenever log_p is true we must specify lower_tail, even if its value is the default value, true.

In the case of pchisq the distribution parameter, df, does not have a default value. For many other distributions there are default values of distribution parameters. For example, the distribution parameters for the logistic distribution are location and scale with defaults 0 and 1. Having defined the 5-argument scalar version of plogis

plogis(q::Number, l::Number, s::Number, lo::Bool, lg::Bool) = 
    ccall(dlsym(_jl_libRmath,:plogis),Float64,(Float64,Float64,Float64,Int32,Int32), q, l, s, lo, lg)

(for brevity we have used the names l, s, lo and lg for the last four arguments) we can define the methods with default values as

plogis(q::Number, l::Number, lo::Bool, lg::Bool) = plogis(q, l, 1, lo, lg)
plogis(q::Number, lo::Bool, lg::Bool) = plogis(q, 0, 1, true, false)
plogis(q::Number, l::Number, s::Number, lo::Bool) = plogis(q, l, s, lo, false)
plogis(q::Number, l::Number, lo::Bool) = plogis(q, l, 1, lo, false)
plogis(q::Number, lo::Bool) = plogis(q, 0, 1, lo, false)
plogis(q::Number, l::Number, s::Number) = plogis(q, l, s, true, false)
plogis(q::Number, l::Number) = plogis(q, l, 1, true, false)
plogis(q::Number) = plogis(q, 0, 1, true, false)

Because a Bool is not a Number, a signature such as (Number, Bool) can be distinguished from (Number, Number).

These method definitions allow for some combinations of default values but not all combinations. Having named arguments, possibly with default values, is still a more flexible system but these method signatures do handle the most important cases.

Defining all those methods can get tedious (not to mention providing the possibility of many transcription errors) so it is worthwhile scripting the process using the macro language for Julia. See the file extras/Rmath.jl in the Julia distribution for the macros that generate both these methods and the vectorized methods described in the next section.

1.5 Vectorizing the function

In R there are no scalars, only vectors of length 1, so, in that sense, all functions are defined for vector arguments. In addition, many functions, including the d-p-q functions, are vectorized in the sense that they create a result of the appropriate form from vector arguments whose length is greater than 1. Thus providing a vector of quantiles to pchisq will return a vector of probabilities. We would want the same to be true for the Julia functions.

In Julia scalars are distinct from vectors and we use the method dispatch to distinguish scalar and vector cases. I think a general programming paradigm for Julia is first to define a function for scalar arguments and then build the vectorized version from that. However, I don't yet have enough experience to say if this really is a general paradigm.

We can write methods for a vector q as

pchisq(q::Vector{Number}, df::Number, lower_tail::Bool, log_p::Bool) = 
    [ pchisq(q[i], df, lower_tail, log_p) | i=1:length(q) ]
pchisq(q::Vector{Number}, df::Number, lower_tail::Bool) = 
    [ pchisq(q[i], df, lower_tail, false) | i=1:length(q) ]
pchisq(q::Vector{Number}, df::Number) = 
    [ pchisq(q[i], df, true, false) | i=1:length(q) ]

These methods use a "comprehension" to specify the loop over the individual elements of the array producing another array. They could, of course, be written in a more conventional looping notation but the comprehension notation is powerful and compact, similar to the "apply" family of functions in R, so we use it here. Like the one-line function definition, the comprehension reads much like a mathematical expression to create an array of values of a certain form for i = 1,…,n.

By the way, experienced R programmers, who know not to write 1:length(q) because of unexpected results when length(q) is zero, need not worry. The sequence notation, a:b doesn't count down in Julia when a is greater than b so 1:0 has length zero. That is, 1:length(q) in Julia always produces the same answer as seq_along(q) does in R.

We could also vectorize the df argument which leads us to consider the case of vector df and scalar q and the case of vector df and vector q, etc. At this point we realize that we are seeing many variations on a theme and decide that it is best to script this vectorization. Fortunately the file base/operations.jl has macros _jl_vectorize_1arg and _jl_vectorize_2arg which we can adapt for this purpose.

1.6 The result

As of this writing, loading the file extras/Rmath.jl

julia> load("../extras/Rmath.jl")

defines

julia> whos()
R_pow                         Function
_jl_libRmath                  Ptr{None}
dbeta                         Function
dbinom                        Function
dcauchy                       Function
dchisq                        Function
dexp                          Function
df                            Function
dgamma                        Function
dgeom                         Function
dlnorm                        Function
dlogis                        Function
dnbinom                       Function
dnchisq                       Function
dnorm                         Function
dpois                         Function
dsignrank                     Function
dt                            Function
dunif                         Function
dweibull                      Function
dwilcox                       Function
pbeta                         Function
pbinom                        Function
pcauchy                       Function
pchisq                        Function
pexp                          Function
pf                            Function
pgamma                        Function
pgeom                         Function
plnorm                        Function
plogis                        Function
pnbinom                       Function
pnchisq                       Function
pnorm                         Function
ppois                         Function
psignrank                     Function
pt                            Function
punif                         Function
pweibull                      Function
pwilcox                       Function
qbeta                         Function
qbinom                        Function
qcauchy                       Function
qchisq                        Function
qexp                          Function
qf                            Function
qgamma                        Function
qgeom                         Function
qlnorm                        Function
qlogis                        Function
qnbinom                       Function
qnchisq                       Function
qnorm                         Function
qpois                         Function
qsignrank                     Function
qt                            Function
qunif                         Function
qweibull                      Function
qwilcox                       Function
rchisq                        Function
rgeom                         Function
rpois                         Function
rsignrank                     Function
rt                            Function
set_seed                      Function

and a function like plogis provides many method signatures

julia> plogis
Methods for generic function plogis
plogis(Number,Number,Number,Bool,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:275
plogis(Number,Number,Bool,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:281
plogis(Number,Number,Number,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:277
plogis(Number,Bool,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:287
plogis(Number,Number,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:283
plogis(Number,Number,Number) at /home/bates/build/julia/extras/../extras/Rmath.jl:279
plogis(Number,Bool) at /home/bates/build/julia/extras/../extras/Rmath.jl:289
plogis(Number,Number) at /home/bates/build/julia/extras/../extras/Rmath.jl:285
plogis(Number,) at /home/bates/build/julia/extras/../extras/Rmath.jl:291
plogis{T1<:Number,T2<:Number}(T1<:Number,AbstractArray{T2<:Number,N}) at operators.jl:162
plogis{T1<:Number,T2<:Number}(AbstractArray{T1<:Number,N},T2<:Number) at operators.jl:165
plogis{T1<:Number,T2<:Number}(AbstractArray{T1<:Number,N},AbstractArray{T2<:Number,N}) at operators.jl:169
plogis{T1<:Number,T2<:Number,T3<:Number}(AbstractArray{T1<:Number,N},T2<:Number,T3<:Number) at /home/bates/build/julia/extras/../extras/Rmath.jl:7
plogis{T1<:Number,T2<:Number,T3<:Number}(T1<:Number,AbstractArray{T2<:Number,N},T3<:Number) at /home/bates/build/julia/extras/../extras/Rmath.jl:10
plogis{T1<:Number,T2<:Number,T3<:Number}(T1<:Number,T2<:Number,AbstractArray{T3<:Number,N}) at /home/bates/build/julia/extras/../extras/Rmath.jl:13
plogis{T1<:Number,T2<:Number,T3<:Number}(AbstractArray{T1<:Number,N},AbstractArray{T2<:Number,N},T3<:Number) at /home/bates/build/julia/extras/../extras/Rmath.jl:16
plogis{T1<:Number,T2<:Number,T3<:Number}(T1<:Number,AbstractArray{T2<:Number,N},AbstractArray{T3<:Number,N}) at /home/bates/build/julia/extras/../extras/Rmath.jl:20
plogis{T1<:Number,T2<:Number,T3<:Number}(AbstractArray{T1<:Number,N},T2<:Number,AbstractArray{T3<:Number,N}) at /home/bates/build/julia/extras/../extras/Rmath.jl:24

You will see that the vectorized methods have slightly more general signatures involving AbstractArrays of Numbers, which also handles cases such as matrix arguments.

The methods allow for calling plogis in a way that is natural to R programmers

julia> plogis(-1:0.2:1, 0)
[0.268941, 0.310026, 0.354344, 0.401312, 0.450166, 0.5, 0.549834, 0.598688, 0.645656, 0.689974, 0.731059]

julia> plogis(-1:0.2:1, 0, 2)
[0.377541, 0.401312, 0.425557, 0.450166, 0.475021, 0.5, 0.524979, 0.549834, 0.574443, 0.598688, 0.622459]

(In running that example I realized that I had omitted some cases from my macros. That will be repaired "in the fullness of time" as Bill Venables is wont to say.)

6 comments:

  1. Thank's for this useful post...julia is very interesting (my thir choice after R and C++)

    Just a small typo:
    pchisq(q::Number, df::Number, lower_tail::Bool, log_p::Bool) =
    ccall(dlsym(_jl_libRmath,:pchisq),Float64,(Float64,Float64,Int32,Int32), x, df, lower_tail, log_p)

    Should be :
    pchisq(q::Number, df::Number, lower_tail::Bool, log_p::Bool) =
    ccall(dlsym(_jl_libRmath,:pchisq),Float64,(Float64,Float64,Int32,Int32), q, df, lower_tail, log_p)

    "q" instead of "x" inside the function.

    ReplyDelete
  2. Yes, thanks for catching that. Having spent the day in various airports I had enough time to read over the posting and edit and expand it. So I am going to do a major update including fixing that typo.

    ReplyDelete
  3. For some reason, I cannot load the Rmath bundle:

    julia> load("Rmath.jl")
    dlopen(libRmath.bundle, 2): image not found
    could not load module libRmath
    in dlopen at base.jl:98
    in include at src/boot.jl:198
    in load at util.jl:176
    in load at util.jl:188
    at Rmath.jl:2
    in load at util.jl:199

    this is despite having R_HOME defined and Rmath.h in $R_HOME/include. Am I missing a step?

    ReplyDelete
  4. But do you have libRmath compiled as a separate library? On an Ubuntu system I can install the r-mathlib package which creates /usr/lib/libRmath.so. As a result I have forgotten how to create the separate library and I can't see it immediately. You may find it described in the R Administration manual available at any of the CRAN sites such as cran.us.R-project.org

    ReplyDelete
    Replies
    1. Installation of the separate math library is described in chapter 9 of the R Administration and Installation manual, available under the Manuals tab at the left side of any of the CRAN sites.

      Delete