Saturday, April 7, 2012

An R programmer looks at Julia

In January of this year I first saw mention of the Julia language in the release notes for LLVM. I mentioned this to Dirk Eddelbuettel and later we got in contact with Viral Shah regarding a Debian package for Julia.

There are many aspects of Julia that are quite intriguing to an R programmer. I am interested in programming languages for "Computing with Data", in John Chambers' term, or "Technical Computing", as the authors of Julia classify it. I believe that learning a programming language is somewhat like learning a natural language in that you need to live with it and use it for a while before you feel comfortable with it and with the culture surrounding it.

A common complaint for those learning R is finding the name of the function to perform a particular task. In writing a bit of Julia code for fitting generalized linear models, as described below, I found myself in exactly the same position of having to search through documentation to find how to do something that I felt should be simple. The experience is frustrating but I don't know of a way of avoiding it. One word of advice for R programmers looking at Julia, the names of most functions correspond to the Matlab/octave names, not the R names. One exception is the d-p-q-r functions for distributions, as I described in an earlier posting.

Evaluating a new language

There is a temptation to approach a new programming language with the idea that it should be exactly like the language I am currently using but better, faster, easier, ... This is, of course, ridiculous. If you want R syntax, semantics, function names, and packages then use R. If you consider another language you should accept that many tasks will be done differently. Some will be done better than in your current language and some will not be done as well. And the majority will just be done differently.

Two primary Julia developers, Jeff Bezanson and Stefan Karpinski, recently gave a presentation at the lang.NEXT conference. The slides and video can help you get a flavor of the language.

So, what does Julia provide that R doesn't?

  • A JIT. The first time that a function is called with a particular argument signature it is compiled to machine code using the LLVM compiler backend. This, naturally, provides a speed boost. It can also affect programming style. In R one instinctively writes vectorized code to avoid a performance hit. In Julia non-vectorized code can have similar and, under some circumstances, better performance than vectorized code.

  • A richer type system. Scalars, vectors/arrays, tuples, composite types and several others can be defined in Julia. In R the atomic types are actually atomic vectors and the composite types are either lists (which is actually a type of vector in R, not a lisp-like list), vectors with attributes (called structures in R but not to be confused with C's structs). The class systems, S3 and S4, are built on top of these structures.

    Currently it is difficult to work with 64-bit integers in R but not in Julia

  • Multiple dispatch. In Julia all functions provide multiple dispatch to handle different signatures of arguments. In some sense, functions are just names used to index into method tables. In R terminology all functions are generic functions with S4-like dispatch, not S3-like which only dispatches on the first argument.

  • Parallelism. A multiple process model and distributed arrays are a basic part of the Julia language.

  • A cleaner model for accessing libraries of compiled code. As Jeff and Stefan point out, languages like R and Matlab/octave typically implement the high-level operations in "glue" code that, in turn, accesses library code. In Julia one can short-circuit this process and call the library code from the high-level language. See the earlier blog posting on accessing the Rmath library from Julia. This may not seem like a big deal unless you have written thousands of lines of glue code, as I have.

There are some parts of R not present in Julia that programmers may miss.

  • Default arguments of functions and named actual arguments in function calls. The process of matching actual arguments to formal arguments in R function calls is richer. All matching in Julia is by position within function signature. Function design with many arguments requires more thought and care in Julia than in R.

  • The R package system and CRAN. One could also throw in namespaces and function documentation standards as part of the package system. These are important parts of R that are not yet available. However, that does not preclude simular facilities being developed for Julia. The R package system did not spring fully grown from the brow of Zeus.

  • Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2 or lattice. Work is underway on graphics models for Julia but, again, it is early days still.

An example: Generalized Linear Models (GLMs)

Most R users are familiar with the basic model-fitting functions like lm() and glm(). These functions call model.frame and model.matrix to form the numerical representation of the model from the formula and data specification, taking into account optional arguments such as subset, na.action and contrasts. After that they call the numerical workhorses, lm.fit and glm.fit. What I will describe is a function like glm.fit (the name in Julia cannot include a period and I call it glmFit) and structures like the glm "family" in R.

A glm family is a collection of functions that describe the distribution of the response, given the mean, and the link between the linear predictor and the mean response. The link function takes the mean response, mu, to the "linear predictor", eta, of the form

X %*% beta  # in R

or

X * beta    # in Julia

where X is the model matrix and beta is the coefficient vector. In practice we more frequently evaluate the inverse link, from eta to mu, and its derivative.

We define a composite type

type Link
    name::String          # name of the link
    linkFun::Function     # link function  mu -> eta
    linkInv::Function     # inverse link  eta -> mu
    muEta::Function       # derivative    eta -> d mu/d eta
end

with specific instances

logitLink =
    Link("logit",
        mu  -> log(mu ./ (1. - mu)),
        eta -> 1. ./ (1. + exp(-eta)),
        eta -> (e = exp(-abs(eta)); f = 1. + e; e ./ (f .* f)))

logLink =
    Link("log",
        mu  -> log(mu),
        eta -> exp(eta),
        eta -> exp(eta))


identityLink =
    Link("identity",
        mu  -> mu,
        eta -> eta,
        eta -> ones(eltype(eta), size(eta)))

inverseLink =
    Link("inverse",
        mu  ->  1. ./ mu,
        eta ->  1. ./ eta,
        eta -> -1. ./ (eta .* eta))

for the common links. Here I have used the short-cut syntax

mu -> log(mu ./ (1. - mu))

to create anonymous functions. The muEta function for the logitLink shows the use of multiple statements within the body of the anonymous function. Like R, the last expression evaluated in a function is the value of a Julia function.

I also ensure that the functions in the Link object are safe for vector arguments by using the componentwise-forms of multiplicative operators ("./" and ".*"). These forms apply to scalars as well as vector. I use explicit floating-point constants, such as 1., to give the compiler a hint that the result should be a floating-point scalar or vector.

A distribution characterizes the variance as a function of the mean, and the "deviance residuals" used to evaluate the model-fitting criterion. For some distributions, including the Gaussian distribution the model-fitting criterion (residual sum of squares) is different from the deviance itself but minimizing this simpler criterion provides the maximum-likelihood (or minimum deviance) estimates of the coefficients.

When we eventually fit the model we want the value of the deviance itself, which may not be the same as the sum of squared deviance residuals. Other functions are used to check for valid mean vectors or linear predictors. Also each distribution in what is called the "exponential family", which includes many common distributions like Bernoulli, binomial, gamma, Gaussian and Poisson, has a canonical link function derived from the distribution itself.

The composite type representing the distribution is

type Dist
    name::String             # the name of the distribution
    canonical::Link          # the canonical link for the distribution
    variance::Function       # variance function mu -> var
    devResid::Function       # vector of squared deviance residuals
    deviance::Function       # the scalar deviance
    mustart::Function        # derive a starting estimate for mu
    validmu::Function        # check validity of the mu vector
    valideta::Function       # check validity of the eta vector
end

with specific distributions defined as

## utilities used in some distributions
logN0(x::Number) = x == 0 ? x : log(x)
logN0{T<:Number}(x::AbstractArray{T}) = reshape([ logN0(x[i]) | i=1:numel(x) ], size(x))
y_log_y(y, mu) = y .* logN0(y ./ mu)    # provides correct limit at y == 0

BernoulliDist =
    Dist("Bernoulli",
         logitLink,
         mu  -> max(eps(Float64), mu .* (1. - mu)),
         (y, mu, wt)-> 2 * wt .* (y_log_y(y, mu) +  y_log_y(1. - y, 1. - mu)),
         (y, mu, wt)-> -2. * sum(y .* log(mu) + (1. - y) .* log(1. - mu)),
         (y, wt)-> (wt .* y + 0.5) ./ (wt + 1.),
         mu  -> all((0 < mu) & (mu < 1)),
         eta -> true)
gammaDist =
    Dist("gamma",
        inverseLink,
        mu -> mu .* mu,
        (y, mu, wt)-> -2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu),
        (y, mu, wt)-> (n=sum(wt); disp=sum(-2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu))/n; invdisp(1/disp); sum(wt .* dgamma(y, invdisp, mu * disp, true))),
        (y, wt)-> all(y > 0) ? y : error("non-positive response values not allowed for gammaDist"),
        mu  -> all(mu > 0.),
        eta -> all(eta > 0.))

GaussianDist =
    Dist("Gaussian",
         identityLink,
         mu  -> ones(typeof(mu), size(mu)),
         (y, mu, wt)-> (r = y - mu; wt .* r .* r),
         (y, mu, wt)-> (n = length(mu); r = y - mu; n * (log(2*pi*sum(wt .* r .* r)/n) + 1) + 2 - sum(log(wt))),
         (y, wt)-> y,
         mu  -> true,
         eta -> true)

PoissonDist =
    Dist("Poisson",
         logLink,
         mu  -> mu,
         (y, mu, wt)-> 2 * wt .* (y .* logN0(y ./ mu) - (y - mu)),
         (y, mu, wt)-> -2 * sum(dpois(y, mu, true) * wt),
         (y, mu)-> y + 0.1,
         mu  -> all(mu > 0),
         eta -> true)

Finally the GlmResp type consists of the distribution, the link, the response, the mean and linear predictor. Occasionally we want to add an offset to the linear predictor expression or apply prior weights to the responses and these are included.

type GlmResp                            # response in a glm model
    dist::Dist                  
    link::Link
    eta::Vector{Float64}        # linear predictor
    mu::Vector{Float64}         # mean response
    offset::Vector{Float64}     # offset added to linear predictor (usually 0)
    wts::Vector{Float64}        # prior weights
    y::Vector{Float64}          # response
end

With just this definition we would need to specify the offset, wts, etc. every time we construct such an object. By defining an outer constructor (meaning a constructor that is defined outside the type definition) we can provide defaults

## outer constructor - the most common way of creating the object
function GlmResp(dist::Dist, link::Link, y::Vector{Float64})
    n  = length(y)
    wt = ones(Float64, (n,))
    mu = dist.mustart(y, wt)
    GlmResp(dist, link, link.linkFun(mu), mu, zeros(Float64, (n,)), wt, y)
end

## another outer constructor using the canonical link for the distribution
GlmResp(dist::Dist, y::Vector{Float64}) = GlmResp(dist, dist.canonical, y)

The second outer constructor allows us to use the canonical link without needing to specify it.

There are several functions that we wish to apply to the glmResp object.

deviance( r::GlmResp) = r.dist.deviance(r.y, r.mu, r.wts)
devResid( r::GlmResp) = r.dist.devResid(r.y, r.mu, r.wts)
drsum(    r::GlmResp) = sum(devResid(r))
muEta(    r::GlmResp) = r.link.muEta(r.eta)
sqrtWrkWt(r::GlmResp) = muEta(r) .* sqrt(r.wts ./ variance(r))
variance( r::GlmResp) = r.dist.variance(r.mu)
wrkResid( r::GlmResp) = (r.y - r.mu) ./ r.link.muEta(r.eta)
wrkResp(  r::GlmResp) = (r.eta - r.offset) + wrkResid(r)

Our initial definitions of these functions are actually method definitions because we declare that the argument r must be a GlmResp. However, defining a method also defines the function if it has not already been defined. In R we must separately specify S3 generics and methods. An S4 generic is automatically created from a method specification but the interactions of S4 generics and namespaces can become tricky.

Note that the period ('.') is used to access components or members of a composite type and cannot be used in a Julia identifier.

The function to update the linear predictor requires both a GlmResp and the linear predictor value to assign. Here we use an abstract type inheriting from Number to allow a more general specification

updateMu{T<:Number}(r::GlmResp, linPr::AbstractArray{T}) = (r.eta = linPr + r.offset; r.mu = r.link.linkInv(r.eta); drsum(r))

Updating the linear predictor also updates the mean response then returns the sum of the square deviance residuals. Note that this is not functional semantics in that the object r being passed as an argument is updated in place. Arguments to Julia functions are passed by reference. Those accustomed to the functional semantics of R (meaning that you can't change the value of an argument by passing it to a function) should beware.

Finally we create a composite type for a predictor

type predD                              # predictor with dense X
    X::Matrix{Float64}                  # model matrix
    beta0::Vector{Float64}              # base coefficient vector
    delb::Vector{Float64}               # increment
end

and an outer constructor

## outer constructor
predD(X::Matrix{Float64}) = (zz = zeros((size(X, 2),)); predD(X, zz, zz))

Given the current state of the predictor we create the weighted X'X matrix and the product of the weighted model matrix and the weighted working residuals in the accumulate function

function accumulate(r::GlmResp, p::predD)
    w   = sqrtWrkWt(r)
    wX  = diagmm(w, p.X)
    (wX' * (w .* wrkResp(r))), (wX' * wX)
end

This function doesn't need to be written separately but I hope that doing so will make it easier to apply these operations to distributed arrays, which, to me, seem like a natural way of applying parallel computing to many statistical computing tasks.

A function to calculate and apply the increment is

increment(r::GlmResp, p::predD) = ((wXz, wXtwX) = accumulate(r, p); bb = wXtwX \ wXz; p.delb = bb - p.beta0; updateMu(r, p.X * bb))

and finally we get to the glmFit function

function glmFit(p::predD, r::GlmResp, maxIter::Uint, minStepFac::Float64, convTol::Float64)
    if (maxIter < 1) error("maxIter must be positive") end
    if (minStepFac < 0 || 1 < minStepFac) error("minStepFac must be in (0, 1)") end
    cvg = false

    devold = typemax(Float64)           # Float64 version of Inf
    for i=1:maxIter
        dev = increment(r, p)
        if (dev < devold)
            p.beta0 = p.beta0 + p.delb
        else
            error("code needed to handle the step-factor case")
        end
        if abs((devold - dev)/dev) < convTol
            cvg = true
            break
        end
        devold = dev
    end
    if !cvg
        error("failure to converge in $maxIter iterations")
    end
end

glmFit(p::predD, r::GlmResp) = glmFit(p, r, uint(30), 0.001, 1.e-6)

Defining two methods allows for default values for some of the arguments. Here we have a kind of all-or-none approach to defaults. It is possible to create other method signatures for defaults applied to only some of the arguments. In retrospect I think I was being too cute in declaring the maxIter argument as an unsigned int. It only makes sense for this to be positive but it might make life simpler to allow it to be an integer as positivity is checked anyway.

Note the use of string interpolation in the last error message. The numeric value of maxIter will be substituted in the error message.

Checking that it works

The script

load("../extras/glm.jl")

## generate a Bernoulli response

srand(123454321)
n    = 10000
X    = hcat(fill(1, (n,)), 3. * randn((n, 2)))
beta = randn((3,))
eta  = X * beta
mu   = logitLink.linkInv(eta)
y    = float64(rand(n) < mu)
println("Coefficient vector for simulation: $(beta)")

## establish the glmResp object

rr   = GlmResp(BernoulliDist, y)
pp   = predD(X)

glmFit(pp, rr)
println("Converged to coefficient vector: $(pp.beta0 + pp.delb)")
println("Deviance at estimates: $(deviance(rr))")

## reinitialize objects for timing
rr   = GlmResp(BernoulliDist, y)
pp   = predD(X)
println("Elapsed time for fitting binary response with X $(size(X,1)) by $(size(X,2)): $(@elapsed glmFit(pp, rr)) seconds")

produces

Coefficient vector for simulation: [-0.54105, 0.308874, -1.02009]
Converged to coefficient vector: [-0.551101, 0.301489, -1.02574]
Deviance at estimates: 6858.466827205941
Elapsed time for fitting binary response with X 10000 by 3: 0.09874677658081055 seconds

1 comment: