In January of this year I first saw mention of the
Julia language in the release notes for
LLVM. I mentioned this to
Dirk Eddelbuettel and later we got in
contact with Viral Shah
regarding a Debian package for Julia.
There are many aspects of Julia that are quite intriguing to an
R programmer. I am interested in
programming languages for "Computing with Data", in John Chambers'
term, or "Technical Computing", as the authors of Julia classify it.
I believe that learning a programming language is somewhat like
learning a natural language in that you need to live with it and use
it for a while before you feel comfortable with it and with the
culture surrounding it.
A common complaint for those learning R is finding the name of the
function to perform a particular task. In writing a bit of Julia
code for fitting generalized linear models, as described below, I
found myself in exactly the same position of having to search through
documentation to find how to do something that I felt should be
simple. The experience is frustrating but I don't know of a way of
avoiding it. One word of advice for R programmers looking at
Julia, the names of most functions correspond to the
Matlab/octave names, not the R names. One exception is the
d-p-q-r functions for distributions, as I described in an earlier
posting.
Evaluating a new language
There is a temptation to approach a new programming language with the
idea that it should be exactly like the language I am currently using
but better, faster, easier, ... This is, of course, ridiculous. If
you want R syntax, semantics, function names, and packages then use
R. If you consider another language you should accept that many
tasks will be done differently. Some will be done better than in your
current language and some will not be done as well. And the majority
will just be done differently.
Two primary Julia developers, Jeff Bezanson and Stefan Karpinski,
recently gave a presentation at the lang.NEXT conference. The
slides and
video
can help you get a flavor of the language.
So, what does Julia provide that R doesn't?
A JIT. The
first time that a function is called with a particular argument
signature it is compiled to machine code using the
LLVM compiler backend. This, naturally, provides
a speed boost. It can also affect programming style. In R one
instinctively writes vectorized code to avoid a performance hit. In
Julia non-vectorized code can have similar and, under some
circumstances, better performance than vectorized code.
A richer type system. Scalars, vectors/arrays, tuples, composite
types and several others can be defined in Julia. In R the
atomic types are actually atomic vectors and the composite types are
either lists (which is actually a type of vector in R, not a
lisp-like list), vectors with attributes (called structures in R
but not to be confused with C's structs). The class systems, S3
and S4, are built on top of these structures.
Currently it is difficult to work with 64-bit integers in R but not
in Julia
Multiple dispatch. In Julia all functions provide multiple
dispatch to handle different signatures of arguments. In some
sense, functions are just names used to index into method tables.
In R terminology all functions are generic functions with S4-like
dispatch, not S3-like which only dispatches on the first argument.
Parallelism. A multiple process model and distributed arrays are a
basic part of the Julia language.
A cleaner model for accessing libraries of compiled code. As Jeff
and Stefan point out, languages like R and Matlab/octave
typically implement the high-level operations in "glue" code that,
in turn, accesses library code. In Julia one can short-circuit
this process and call the library code from the high-level
language. See the earlier blog posting on accessing the Rmath
library from Julia. This may not seem like a big deal unless you
have written thousands of lines of glue code, as I have.
There are some parts of R not present in Julia that programmers
may miss.
Default arguments of functions and named actual arguments in
function calls. The process of matching actual arguments to formal
arguments in R function calls is richer. All matching in Julia
is by position within function signature. Function design with many
arguments requires more thought and care in Julia than in R.
The R package system and CRAN. One could also throw in namespaces
and function documentation standards as part of the package system.
These are important parts of R that are not yet available.
However, that does not preclude simular facilities being developed
for Julia. The R package system did not spring fully grown from
the brow of Zeus.
Graphics. One of the big advantages of R over similar languages
is the sophisticated graphics available in ggplot2 or lattice. Work
is underway on graphics models for Julia but, again, it is early
days still.
An example: Generalized Linear Models (GLMs)
Most R users are familiar with the basic model-fitting functions
like lm() and glm(). These functions call model.frame and
model.matrix to form the numerical representation of the model from
the formula and data specification, taking into account optional
arguments such as subset, na.action and contrasts. After that they
call the numerical workhorses, lm.fit and glm.fit. What I will
describe is a function like glm.fit (the name in Julia cannot
include a period and I call it glmFit) and structures like the glm
"family" in R.
A glm family is a collection of functions that describe the
distribution of the response, given the mean, and the link between the
linear predictor and the mean response. The link function takes the
mean response, mu, to the "linear predictor", eta, of the form
X %*% beta # in R
or
X * beta # in Julia
where X is the model matrix and beta is the coefficient vector. In
practice we more frequently evaluate the inverse link, from eta to mu,
and its derivative.
We define a composite type
type Link
name::String # name of the link
linkFun::Function # link function mu -> eta
linkInv::Function # inverse link eta -> mu
muEta::Function # derivative eta -> d mu/d eta
end
with specific instances
logitLink =
Link("logit",
mu -> log(mu ./ (1. - mu)),
eta -> 1. ./ (1. + exp(-eta)),
eta -> (e = exp(-abs(eta)); f = 1. + e; e ./ (f .* f)))
logLink =
Link("log",
mu -> log(mu),
eta -> exp(eta),
eta -> exp(eta))
identityLink =
Link("identity",
mu -> mu,
eta -> eta,
eta -> ones(eltype(eta), size(eta)))
inverseLink =
Link("inverse",
mu -> 1. ./ mu,
eta -> 1. ./ eta,
eta -> -1. ./ (eta .* eta))
for the common links. Here I have used the short-cut syntax
mu -> log(mu ./ (1. - mu))
to create anonymous functions. The muEta function for the logitLink
shows the use of multiple statements within the body of the anonymous
function. Like R, the last expression evaluated in a function is
the value of a Julia function.
I also ensure that the functions in the Link object are safe for
vector arguments by using the componentwise-forms of multiplicative
operators ("./" and ".*"). These forms apply to scalars as well as
vector. I use explicit floating-point constants, such as 1., to give
the compiler a hint that the result should be a floating-point scalar
or vector.
A distribution characterizes the variance as a function of
the mean, and the "deviance residuals" used to evaluate the
model-fitting criterion. For some distributions, including the
Gaussian distribution the model-fitting criterion (residual sum of
squares) is different from the deviance itself but minimizing this
simpler criterion provides the maximum-likelihood (or minimum
deviance) estimates of the coefficients.
When we eventually fit the model we want the value of the deviance
itself, which may not be the same as the sum of squared deviance
residuals. Other functions are used to check for valid mean vectors
or linear predictors. Also each distribution in what is called the
"exponential family", which includes many common distributions like
Bernoulli, binomial, gamma, Gaussian and Poisson, has a canonical link
function derived from the distribution itself.
The composite type representing the distribution is
type Dist
name::String # the name of the distribution
canonical::Link # the canonical link for the distribution
variance::Function # variance function mu -> var
devResid::Function # vector of squared deviance residuals
deviance::Function # the scalar deviance
mustart::Function # derive a starting estimate for mu
validmu::Function # check validity of the mu vector
valideta::Function # check validity of the eta vector
end
with specific distributions defined as
## utilities used in some distributions
logN0(x::Number) = x == 0 ? x : log(x)
logN0{T<:Number}(x::AbstractArray{T}) = reshape([ logN0(x[i]) | i=1:numel(x) ], size(x))
y_log_y(y, mu) = y .* logN0(y ./ mu) # provides correct limit at y == 0
BernoulliDist =
Dist("Bernoulli",
logitLink,
mu -> max(eps(Float64), mu .* (1. - mu)),
(y, mu, wt)-> 2 * wt .* (y_log_y(y, mu) + y_log_y(1. - y, 1. - mu)),
(y, mu, wt)-> -2. * sum(y .* log(mu) + (1. - y) .* log(1. - mu)),
(y, wt)-> (wt .* y + 0.5) ./ (wt + 1.),
mu -> all((0 < mu) & (mu < 1)),
eta -> true)
gammaDist =
Dist("gamma",
inverseLink,
mu -> mu .* mu,
(y, mu, wt)-> -2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu),
(y, mu, wt)-> (n=sum(wt); disp=sum(-2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu))/n; invdisp(1/disp); sum(wt .* dgamma(y, invdisp, mu * disp, true))),
(y, wt)-> all(y > 0) ? y : error("non-positive response values not allowed for gammaDist"),
mu -> all(mu > 0.),
eta -> all(eta > 0.))
GaussianDist =
Dist("Gaussian",
identityLink,
mu -> ones(typeof(mu), size(mu)),
(y, mu, wt)-> (r = y - mu; wt .* r .* r),
(y, mu, wt)-> (n = length(mu); r = y - mu; n * (log(2*pi*sum(wt .* r .* r)/n) + 1) + 2 - sum(log(wt))),
(y, wt)-> y,
mu -> true,
eta -> true)
PoissonDist =
Dist("Poisson",
logLink,
mu -> mu,
(y, mu, wt)-> 2 * wt .* (y .* logN0(y ./ mu) - (y - mu)),
(y, mu, wt)-> -2 * sum(dpois(y, mu, true) * wt),
(y, mu)-> y + 0.1,
mu -> all(mu > 0),
eta -> true)
Finally the GlmResp type consists of the distribution, the link, the
response, the mean and linear predictor. Occasionally we want to add
an offset to the linear predictor expression or apply prior weights to
the responses and these are included.
type GlmResp # response in a glm model
dist::Dist
link::Link
eta::Vector{Float64} # linear predictor
mu::Vector{Float64} # mean response
offset::Vector{Float64} # offset added to linear predictor (usually 0)
wts::Vector{Float64} # prior weights
y::Vector{Float64} # response
end
With just this definition we would need to specify the offset, wts,
etc. every time we construct such an object. By defining an outer
constructor (meaning a constructor that is defined outside the type
definition) we can provide defaults
## outer constructor - the most common way of creating the object
function GlmResp(dist::Dist, link::Link, y::Vector{Float64})
n = length(y)
wt = ones(Float64, (n,))
mu = dist.mustart(y, wt)
GlmResp(dist, link, link.linkFun(mu), mu, zeros(Float64, (n,)), wt, y)
end
## another outer constructor using the canonical link for the distribution
GlmResp(dist::Dist, y::Vector{Float64}) = GlmResp(dist, dist.canonical, y)
The second outer constructor allows us to use the canonical link
without needing to specify it.
There are several functions that we wish to apply to the glmResp
object.
deviance( r::GlmResp) = r.dist.deviance(r.y, r.mu, r.wts)
devResid( r::GlmResp) = r.dist.devResid(r.y, r.mu, r.wts)
drsum( r::GlmResp) = sum(devResid(r))
muEta( r::GlmResp) = r.link.muEta(r.eta)
sqrtWrkWt(r::GlmResp) = muEta(r) .* sqrt(r.wts ./ variance(r))
variance( r::GlmResp) = r.dist.variance(r.mu)
wrkResid( r::GlmResp) = (r.y - r.mu) ./ r.link.muEta(r.eta)
wrkResp( r::GlmResp) = (r.eta - r.offset) + wrkResid(r)
Our initial definitions of these functions are actually method
definitions because we declare that the argument r must be a
GlmResp. However, defining a method also defines the function if it
has not already been defined. In R we must separately specify S3
generics and methods. An S4 generic is automatically created from a
method specification but the interactions of S4 generics and
namespaces can become tricky.
Note that the period ('.') is used to access components or members of
a composite type and cannot be used in a Julia identifier.
The function to update the linear predictor requires both a GlmResp
and the linear predictor value to assign. Here we use an abstract
type inheriting from Number to allow a more general specification
updateMu{T<:Number}(r::GlmResp, linPr::AbstractArray{T}) = (r.eta = linPr + r.offset; r.mu = r.link.linkInv(r.eta); drsum(r))
Updating the linear predictor also updates the mean response then
returns the sum of the square deviance residuals. Note that this is
not functional semantics in that the object r being passed as an
argument is updated in place. Arguments to Julia functions are
passed by reference. Those accustomed to the functional semantics of
R (meaning that you can't change the value of an argument by passing
it to a function) should beware.
Finally we create a composite type for a predictor
type predD # predictor with dense X
X::Matrix{Float64} # model matrix
beta0::Vector{Float64} # base coefficient vector
delb::Vector{Float64} # increment
end
and an outer constructor
## outer constructor
predD(X::Matrix{Float64}) = (zz = zeros((size(X, 2),)); predD(X, zz, zz))
Given the current state of the predictor we create the weighted X'X
matrix and the product of the weighted model matrix and the weighted
working residuals in the accumulate function
function accumulate(r::GlmResp, p::predD)
w = sqrtWrkWt(r)
wX = diagmm(w, p.X)
(wX' * (w .* wrkResp(r))), (wX' * wX)
end
This function doesn't need to be written separately but I hope that
doing so will make it easier to apply these operations to distributed
arrays, which, to me, seem like a natural way of applying parallel
computing to many statistical computing tasks.
A function to calculate and apply the increment is
increment(r::GlmResp, p::predD) = ((wXz, wXtwX) = accumulate(r, p); bb = wXtwX \ wXz; p.delb = bb - p.beta0; updateMu(r, p.X * bb))
and finally we get to the glmFit function
function glmFit(p::predD, r::GlmResp, maxIter::Uint, minStepFac::Float64, convTol::Float64)
if (maxIter < 1) error("maxIter must be positive") end
if (minStepFac < 0 || 1 < minStepFac) error("minStepFac must be in (0, 1)") end
cvg = false
devold = typemax(Float64) # Float64 version of Inf
for i=1:maxIter
dev = increment(r, p)
if (dev < devold)
p.beta0 = p.beta0 + p.delb
else
error("code needed to handle the step-factor case")
end
if abs((devold - dev)/dev) < convTol
cvg = true
break
end
devold = dev
end
if !cvg
error("failure to converge in $maxIter iterations")
end
end
glmFit(p::predD, r::GlmResp) = glmFit(p, r, uint(30), 0.001, 1.e-6)
Defining two methods allows for default values for some of the
arguments. Here we have a kind of all-or-none approach to defaults.
It is possible to create other method signatures for defaults applied
to only some of the arguments. In retrospect I think I was being too
cute in declaring the maxIter argument as an unsigned int. It only
makes sense for this to be positive but it might make life simpler to
allow it to be an integer as positivity is checked anyway.
Note the use of string interpolation in the last error message. The
numeric value of maxIter will be substituted in the error message.
Checking that it works
The script
load("../extras/glm.jl")
## generate a Bernoulli response
srand(123454321)
n = 10000
X = hcat(fill(1, (n,)), 3. * randn((n, 2)))
beta = randn((3,))
eta = X * beta
mu = logitLink.linkInv(eta)
y = float64(rand(n) < mu)
println("Coefficient vector for simulation: $(beta)")
## establish the glmResp object
rr = GlmResp(BernoulliDist, y)
pp = predD(X)
glmFit(pp, rr)
println("Converged to coefficient vector: $(pp.beta0 + pp.delb)")
println("Deviance at estimates: $(deviance(rr))")
## reinitialize objects for timing
rr = GlmResp(BernoulliDist, y)
pp = predD(X)
println("Elapsed time for fitting binary response with X $(size(X,1)) by $(size(X,2)): $(@elapsed glmFit(pp, rr)) seconds")
produces
Coefficient vector for simulation: [-0.54105, 0.308874, -1.02009]
Converged to coefficient vector: [-0.551101, 0.301489, -1.02574]
Deviance at estimates: 6858.466827205941
Elapsed time for fitting binary response with X 10000 by 3: 0.09874677658081055 seconds