In January of this year I first saw mention of the Julia language in the release notes for LLVM. I mentioned this to Dirk Eddelbuettel and later we got in contact with Viral Shah regarding a Debian package for Julia.

There are many aspects of *Julia* that are quite intriguing to an
R programmer. I am interested in
programming languages for "Computing with Data", in John Chambers'
term, or "Technical Computing", as the authors of *Julia* classify it.
I believe that learning a programming language is somewhat like
learning a natural language in that you need to live with it and use
it for a while before you feel comfortable with it and with the
culture surrounding it.

A common complaint for those learning *R* is finding the name of the
function to perform a particular task. In writing a bit of *Julia*
code for fitting generalized linear models, as described below, I
found myself in exactly the same position of having to search through
documentation to find how to do something that I felt should be
simple. The experience is frustrating but I don't know of a way of
avoiding it. One word of advice for *R* programmers looking at
*Julia*, the names of most functions correspond to the
*Matlab*/*octave* names, not the *R* names. One exception is the
d-p-q-r functions for distributions, as I described in an earlier
posting.

## Evaluating a new language

There is a temptation to approach a new programming language with the
idea that it should be exactly like the language I am currently using
but better, faster, easier, ... This is, of course, ridiculous. If
you want *R* syntax, semantics, function names, and packages then use
*R*. If you consider another language you should accept that many
tasks will be done differently. Some will be done better than in your
current language and some will not be done as well. And the majority
will just be done differently.

Two primary *Julia* developers, Jeff Bezanson and Stefan Karpinski,
recently gave a presentation at the lang.NEXT conference. The
slides and
video
can help you get a flavor of the language.

So, what does *Julia* provide that *R* doesn't?

A JIT. The first time that a function is called with a particular argument signature it is compiled to machine code using the LLVM compiler backend. This, naturally, provides a speed boost. It can also affect programming style. In

*R*one instinctively writes vectorized code to avoid a performance hit. In*Julia*non-vectorized code can have similar and, under some circumstances, better performance than vectorized code.A richer type system. Scalars, vectors/arrays, tuples, composite types and several others can be defined in

*Julia*. In*R*the atomic types are actually atomic vectors and the composite types are either lists (which is actually a type of vector in*R*, not a lisp-like list), vectors with attributes (called structures in*R*but not to be confused with*C*'s structs). The class systems, S3 and S4, are built on top of these structures.Currently it is difficult to work with 64-bit integers in

*R*but not in*Julia*Multiple dispatch. In

*Julia*all functions provide multiple dispatch to handle different signatures of arguments. In some sense, functions are just names used to index into method tables. In*R*terminology all functions are generic functions with S4-like dispatch, not S3-like which only dispatches on the first argument.Parallelism. A multiple process model and distributed arrays are a basic part of the

*Julia*language.A cleaner model for accessing libraries of compiled code. As Jeff and Stefan point out, languages like

*R*and*Matlab*/*octave*typically implement the high-level operations in "glue" code that, in turn, accesses library code. In*Julia*one can short-circuit this process and call the library code from the high-level language. See the earlier blog posting on accessing the Rmath library from Julia. This may not seem like a big deal unless you have written thousands of lines of glue code, as I have.

There are some parts of *R* not present in *Julia* that programmers
may miss.

Default arguments of functions and named actual arguments in function calls. The process of matching actual arguments to formal arguments in

*R*function calls is richer. All matching in*Julia*is by position within function signature. Function design with many arguments requires more thought and care in*Julia*than in*R*.The

*R*package system and CRAN. One could also throw in namespaces and function documentation standards as part of the package system. These are important parts of*R*that are not yet available. However, that does not preclude simular facilities being developed for*Julia*. The*R*package system did not spring fully grown from the brow of Zeus.Graphics. One of the big advantages of

*R*over similar languages is the sophisticated graphics available in ggplot2 or lattice. Work is underway on graphics models for*Julia*but, again, it is early days still.

## An example: Generalized Linear Models (GLMs)

Most *R* users are familiar with the basic model-fitting functions
like lm() and glm(). These functions call model.frame and
model.matrix to form the numerical representation of the model from
the formula and data specification, taking into account optional
arguments such as subset, na.action and contrasts. After that they
call the numerical workhorses, lm.fit and glm.fit. What I will
describe is a function like glm.fit (the name in *Julia* cannot
include a period and I call it glmFit) and structures like the glm
"family" in *R*.

A glm family is a collection of functions that describe the distribution of the response, given the mean, and the link between the linear predictor and the mean response. The link function takes the mean response, mu, to the "linear predictor", eta, of the form

```
X %*% beta # in R
```

or

```
X * beta # in Julia
```

where X is the model matrix and beta is the coefficient vector. In practice we more frequently evaluate the inverse link, from eta to mu, and its derivative.

We define a composite type

```
type Link
name::String # name of the link
linkFun::Function # link function mu -> eta
linkInv::Function # inverse link eta -> mu
muEta::Function # derivative eta -> d mu/d eta
end
```

with specific instances

```
logitLink =
Link("logit",
mu -> log(mu ./ (1. - mu)),
eta -> 1. ./ (1. + exp(-eta)),
eta -> (e = exp(-abs(eta)); f = 1. + e; e ./ (f .* f)))
logLink =
Link("log",
mu -> log(mu),
eta -> exp(eta),
eta -> exp(eta))
identityLink =
Link("identity",
mu -> mu,
eta -> eta,
eta -> ones(eltype(eta), size(eta)))
inverseLink =
Link("inverse",
mu -> 1. ./ mu,
eta -> 1. ./ eta,
eta -> -1. ./ (eta .* eta))
```

for the common links. Here I have used the short-cut syntax

```
mu -> log(mu ./ (1. - mu))
```

to create anonymous functions. The muEta function for the logitLink
shows the use of multiple statements within the body of the anonymous
function. Like *R*, the last expression evaluated in a function is
the value of a *Julia* function.

I also ensure that the functions in the Link object are safe for vector arguments by using the componentwise-forms of multiplicative operators ("./" and ".*"). These forms apply to scalars as well as vector. I use explicit floating-point constants, such as 1., to give the compiler a hint that the result should be a floating-point scalar or vector.

A distribution characterizes the variance as a function of the mean, and the "deviance residuals" used to evaluate the model-fitting criterion. For some distributions, including the Gaussian distribution the model-fitting criterion (residual sum of squares) is different from the deviance itself but minimizing this simpler criterion provides the maximum-likelihood (or minimum deviance) estimates of the coefficients.

When we eventually fit the model we want the value of the deviance itself, which may not be the same as the sum of squared deviance residuals. Other functions are used to check for valid mean vectors or linear predictors. Also each distribution in what is called the "exponential family", which includes many common distributions like Bernoulli, binomial, gamma, Gaussian and Poisson, has a canonical link function derived from the distribution itself.

The composite type representing the distribution is

```
type Dist
name::String # the name of the distribution
canonical::Link # the canonical link for the distribution
variance::Function # variance function mu -> var
devResid::Function # vector of squared deviance residuals
deviance::Function # the scalar deviance
mustart::Function # derive a starting estimate for mu
validmu::Function # check validity of the mu vector
valideta::Function # check validity of the eta vector
end
```

with specific distributions defined as

```
## utilities used in some distributions
logN0(x::Number) = x == 0 ? x : log(x)
logN0{T<:Number}(x::AbstractArray{T}) = reshape([ logN0(x[i]) | i=1:numel(x) ], size(x))
y_log_y(y, mu) = y .* logN0(y ./ mu) # provides correct limit at y == 0
BernoulliDist =
Dist("Bernoulli",
logitLink,
mu -> max(eps(Float64), mu .* (1. - mu)),
(y, mu, wt)-> 2 * wt .* (y_log_y(y, mu) + y_log_y(1. - y, 1. - mu)),
(y, mu, wt)-> -2. * sum(y .* log(mu) + (1. - y) .* log(1. - mu)),
(y, wt)-> (wt .* y + 0.5) ./ (wt + 1.),
mu -> all((0 < mu) & (mu < 1)),
eta -> true)
gammaDist =
Dist("gamma",
inverseLink,
mu -> mu .* mu,
(y, mu, wt)-> -2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu),
(y, mu, wt)-> (n=sum(wt); disp=sum(-2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu))/n; invdisp(1/disp); sum(wt .* dgamma(y, invdisp, mu * disp, true))),
(y, wt)-> all(y > 0) ? y : error("non-positive response values not allowed for gammaDist"),
mu -> all(mu > 0.),
eta -> all(eta > 0.))
GaussianDist =
Dist("Gaussian",
identityLink,
mu -> ones(typeof(mu), size(mu)),
(y, mu, wt)-> (r = y - mu; wt .* r .* r),
(y, mu, wt)-> (n = length(mu); r = y - mu; n * (log(2*pi*sum(wt .* r .* r)/n) + 1) + 2 - sum(log(wt))),
(y, wt)-> y,
mu -> true,
eta -> true)
PoissonDist =
Dist("Poisson",
logLink,
mu -> mu,
(y, mu, wt)-> 2 * wt .* (y .* logN0(y ./ mu) - (y - mu)),
(y, mu, wt)-> -2 * sum(dpois(y, mu, true) * wt),
(y, mu)-> y + 0.1,
mu -> all(mu > 0),
eta -> true)
```

Finally the GlmResp type consists of the distribution, the link, the response, the mean and linear predictor. Occasionally we want to add an offset to the linear predictor expression or apply prior weights to the responses and these are included.

```
type GlmResp # response in a glm model
dist::Dist
link::Link
eta::Vector{Float64} # linear predictor
mu::Vector{Float64} # mean response
offset::Vector{Float64} # offset added to linear predictor (usually 0)
wts::Vector{Float64} # prior weights
y::Vector{Float64} # response
end
```

With just this definition we would need to specify the offset, wts, etc. every time we construct such an object. By defining an outer constructor (meaning a constructor that is defined outside the type definition) we can provide defaults

```
## outer constructor - the most common way of creating the object
function GlmResp(dist::Dist, link::Link, y::Vector{Float64})
n = length(y)
wt = ones(Float64, (n,))
mu = dist.mustart(y, wt)
GlmResp(dist, link, link.linkFun(mu), mu, zeros(Float64, (n,)), wt, y)
end
## another outer constructor using the canonical link for the distribution
GlmResp(dist::Dist, y::Vector{Float64}) = GlmResp(dist, dist.canonical, y)
```

The second outer constructor allows us to use the canonical link without needing to specify it.

There are several functions that we wish to apply to the glmResp object.

```
deviance( r::GlmResp) = r.dist.deviance(r.y, r.mu, r.wts)
devResid( r::GlmResp) = r.dist.devResid(r.y, r.mu, r.wts)
drsum( r::GlmResp) = sum(devResid(r))
muEta( r::GlmResp) = r.link.muEta(r.eta)
sqrtWrkWt(r::GlmResp) = muEta(r) .* sqrt(r.wts ./ variance(r))
variance( r::GlmResp) = r.dist.variance(r.mu)
wrkResid( r::GlmResp) = (r.y - r.mu) ./ r.link.muEta(r.eta)
wrkResp( r::GlmResp) = (r.eta - r.offset) + wrkResid(r)
```

Our initial definitions of these functions are actually method
definitions because we declare that the argument r must be a
GlmResp. However, defining a method also defines the function if it
has not already been defined. In *R* we must separately specify S3
generics and methods. An S4 generic is automatically created from a
method specification but the interactions of S4 generics and
namespaces can become tricky.

Note that the period ('.') is used to access components or members of a composite type and cannot be used in a Julia identifier.

The function to update the linear predictor requires both a GlmResp and the linear predictor value to assign. Here we use an abstract type inheriting from Number to allow a more general specification

```
updateMu{T<:Number}(r::GlmResp, linPr::AbstractArray{T}) = (r.eta = linPr + r.offset; r.mu = r.link.linkInv(r.eta); drsum(r))
```

Updating the linear predictor also updates the mean response then
returns the sum of the square deviance residuals. Note that this is
not functional semantics in that the object r being passed as an
argument is updated in place. Arguments to *Julia* functions are
passed by reference. Those accustomed to the functional semantics of
*R* (meaning that you can't change the value of an argument by passing
it to a function) should beware.

Finally we create a composite type for a predictor

```
type predD # predictor with dense X
X::Matrix{Float64} # model matrix
beta0::Vector{Float64} # base coefficient vector
delb::Vector{Float64} # increment
end
```

and an outer constructor

```
## outer constructor
predD(X::Matrix{Float64}) = (zz = zeros((size(X, 2),)); predD(X, zz, zz))
```

Given the current state of the predictor we create the weighted X'X matrix and the product of the weighted model matrix and the weighted working residuals in the accumulate function

```
function accumulate(r::GlmResp, p::predD)
w = sqrtWrkWt(r)
wX = diagmm(w, p.X)
(wX' * (w .* wrkResp(r))), (wX' * wX)
end
```

This function doesn't need to be written separately but I hope that doing so will make it easier to apply these operations to distributed arrays, which, to me, seem like a natural way of applying parallel computing to many statistical computing tasks.

A function to calculate and apply the increment is

```
increment(r::GlmResp, p::predD) = ((wXz, wXtwX) = accumulate(r, p); bb = wXtwX \ wXz; p.delb = bb - p.beta0; updateMu(r, p.X * bb))
```

and finally we get to the glmFit function

```
function glmFit(p::predD, r::GlmResp, maxIter::Uint, minStepFac::Float64, convTol::Float64)
if (maxIter < 1) error("maxIter must be positive") end
if (minStepFac < 0 || 1 < minStepFac) error("minStepFac must be in (0, 1)") end
cvg = false
devold = typemax(Float64) # Float64 version of Inf
for i=1:maxIter
dev = increment(r, p)
if (dev < devold)
p.beta0 = p.beta0 + p.delb
else
error("code needed to handle the step-factor case")
end
if abs((devold - dev)/dev) < convTol
cvg = true
break
end
devold = dev
end
if !cvg
error("failure to converge in $maxIter iterations")
end
end
glmFit(p::predD, r::GlmResp) = glmFit(p, r, uint(30), 0.001, 1.e-6)
```

Defining two methods allows for default values for some of the arguments. Here we have a kind of all-or-none approach to defaults. It is possible to create other method signatures for defaults applied to only some of the arguments. In retrospect I think I was being too cute in declaring the maxIter argument as an unsigned int. It only makes sense for this to be positive but it might make life simpler to allow it to be an integer as positivity is checked anyway.

Note the use of string interpolation in the last error message. The numeric value of maxIter will be substituted in the error message.

## Checking that it works

The script

```
load("../extras/glm.jl")
## generate a Bernoulli response
srand(123454321)
n = 10000
X = hcat(fill(1, (n,)), 3. * randn((n, 2)))
beta = randn((3,))
eta = X * beta
mu = logitLink.linkInv(eta)
y = float64(rand(n) < mu)
println("Coefficient vector for simulation: $(beta)")
## establish the glmResp object
rr = GlmResp(BernoulliDist, y)
pp = predD(X)
glmFit(pp, rr)
println("Converged to coefficient vector: $(pp.beta0 + pp.delb)")
println("Deviance at estimates: $(deviance(rr))")
## reinitialize objects for timing
rr = GlmResp(BernoulliDist, y)
pp = predD(X)
println("Elapsed time for fitting binary response with X $(size(X,1)) by $(size(X,2)): $(@elapsed glmFit(pp, rr)) seconds")
```

produces

```
Coefficient vector for simulation: [-0.54105, 0.308874, -1.02009]
Converged to coefficient vector: [-0.551101, 0.301489, -1.02574]
Deviance at estimates: 6858.466827205941
Elapsed time for fitting binary response with X 10000 by 3: 0.09874677658081055 seconds
```

so in summary...? benchmarks? do you like it more?

ReplyDelete