-1- David Champagne on the R programming language and its ...

8 downloads 1411 Views 26KB Size Report
May 24, 2011 ... Java technology zone technical podcast series: Season 2 ... a programming language written by statisticians for statisticians. GLOVER: SoundsĀ ...
Java technology zone technical podcast series: Season 2

David Champagne on the R programming language and its relevance in the world of big data Episode date: 05-24-2011

>>

This is the developerWorks podcasts Java

technical series hosted by Andy Glover. Here's Andy.

GLOVER:

David Champagne is the CTO of Revolution

Analytics, which is a company obviously playing in the analytics big-data space and recently has come out with an open source language, I believe, called the R. So, David, why don't you tell us what is Revolution Analytics? And then let's find out about R.

CHAMPAGNE:

Revolution Analytics is a commercial open

source company based on the R programming language. And R is a programming language written by statisticians for statisticians.

GLOVER:

Sounds really exciting.

CHAMPAGNE:

It was co-authored by two gentlemen in New

Zealand, Ross Ihaka and Robert Gentlemen. They built this program in 1993, and it has taken off and become viral in the statistical community and the data-analytic community. It is a thriving community of open source driven statisticians all over the world contributing and extending R's capability. And it has become the lingua franca of

-1-

statistical analysis.

GLOVER:

Let's take a step back real quick. When you

talk about statistical analysis, some people may be thrown right back into college or something like that where they had to take statistics as part of a CS degree. Why is that important now?

CHAMPAGNE:

Well, everybody is data-driven now. And

understanding your data -- understanding how to use your data in an effective way -- is basically how you run your business. If you can't process your data and you can't do analysis on that data in a meaningful way to predict behavior, to understand behavior, and to discover new things about the data that you're working with, you're not going to be able to run your business very effectively.

So where R comes in in this is that it is the most flexible statistical language that is available for use in this market. And it drives people to innovate. A perfect example is there's a project called Bioconductor, which is an open source project that is focused on discovery of genomic data and doing sophisticated modeling on complex models on genomic data.

Scientists and statisticians are analyzing that data using R, and they have built over 400 sophisticated packages on

-2-

top of R to do that analysis. There's no other statistical package or language out there that can claim anywhere near that kind of flexibility and ability to innovate.

Being able to have the tools and the technology to do that is the main reason why a lot of the innovations in science and finance are occurring. They're occurring because of R.

GLOVER:

Obviously, this is a podcast that people are

listening to, and they may be listening to it as they're driving or whatever, but let's see if we can paint a mental picture, if you will, of what R looks like from a programming-language standpoint.

Something that comes to mind when you talk about statistics and just kind of -- although this language has nothing to do with statistics -- but it harkens memories of Prolog. As a Java developer, what does R look and feel like?

CHAMPAGNE:

R looks like an object-oriented language.

Everything in R is an object. This provides great flexibility when programming.

GLOVER:

Okay.

CHAMPAGNE:

The other thing you have to realize about R: R

is not a compiled language; it's an interpreted language.

-3-

GLOVER:

Okay. So Java developers would feel comfortable

with that.

CHAMPAGNE:

Right. For example, you could have a variable

x, which in R is an object, and one pass through the interpreter it can be an integer. And the next pass through the interpreter it could be a vector. And the next time it could be a function. So it's completely malleable in the way that objects are represented in the language. It provides an enormous flexibility.

You have all of the kinds of higher-level language capabilities for writing sophisticated loops and if statements and whatnot. But then you have all the built-in functions that are statistical-driven for creating models, for plotting and visualizing data, and for creating summaries and doing summary statistics on data, with just simple commands that are already built into the language that give the statistician or the developer an easy and flexible way to create meaningful analysis on data.

GLOVER:

If I'm a Java developer and I've heard this,

and it sounds like somewhat of a loosely typed language, maybe not unrelated to something like Ruby. And if I am a Java developer, though, how do I make use of R?

-4-

CHAMPAGNE:

Well, there's a couple ways if you're a Java

developer. There's a package in R that's been developed called rJava. With the rJava language, the rJava package, you can access R in two ways.

You can, from your Java code, invoke R commands and R functions directly and return the results back into your Java environment. And vice versa: you can also from R, with the package, you can invoke Java functions and bring those back into R.

GLOVER:

CHAMPAGNE:

Oh. Wow.

For example, if you're trying to bring R into

a BI environment or some other third-party application, the ability to drive it natively from Java using a package like rJava, you invoke a predefined collection of R scripts or an arbitrary block of R code and then get the artifacts of those things back, whether they be plots or tables or vectors or whatever. And then being able to use that in your Java application is a pretty powerful idea.

GLOVER:

Yes. And I think it's also illustrative of the

long-running question about where is Java going in the future. Is the language dead? That whole debate really is misnomered in that the key aspect here between R and Java is the JVM itself, right?

-5-

You have this other language that is far more powerful and effective in the case of statistical analysis. Yet you have, let's say, enterprise Java applications that live on something like Hadoop or whatever that is built in Java, but the bridge between the two is the JVM itself.

Hence you have the proliferation of other languages, like Groovy and Scala, which are more general-purpose as opposed to R. So I think this is fascinating. I specifically mentioned Hadoop because you talked about data analysis, and big data is key.

I think it's always been key, but it's becoming more and more, shall we say, in the forefront given the proliferation or, I guess: the lower cost of storage means we can store more and more data. So now we can do something with that data. Tell me more about kind of the relationship between R and you did say BI and some of these tools.

CHAMPAGNE:

Let's start with R and BI. I think that having

advanced analytic capabilities to do things like sales forecasting or other kinds of analysis on data that's presented through dashboards or other interactive apps is important for understanding the business.

So one of the things that we've done at Revolution

-6-

Analytics, in terms of making that more accessible, is to provide a collection of web services on top of R, where a programmer doesn't have to know about the analytics.

All they're worried about is the inputs and the outputs coming from the analytic routine and using a RESTful API, being able to invoke some predefined script that does analytics and then returns the artifacts of that execution back to the application.

So being able to plot out a sales forecast for a dashboard and have the user be able to interact with that sales forecast and predict out 12 months, 24 months, whatever it might be. All the analytic capability is what R brings to the table.

And the flexibility in the language, and the flexibility in the extensibility framework, allows things like adding on a web services layer very important and doable. And then it enables any kind of application to take advantage of advanced analytics.

GLOVER:

Sure.

CHAMPAGNE:

Not having to go through some expensive third-

party commercial tool like one of the big statistical players in the market to get that capability. So R, being

-7-

open source, having all of this thriving community around it, really drives that.

GLOVER:

You mentioned a web services layer. Does that

mean that there's also a server component to R in some way?

CHAMPAGNE:

Well, there are packages that have been built

that offer some capabilities to build server-like functionality. What we've done at Revolution Analytics is built a server itself that contains a repository, has a security model, exposes the web services.

But it's all underneath, utilizing capabilities that are in the open source for R and for the extensibility around the language itself. R, in and of itself, is pretty fundamental in terms of what it offers. But its extensibility makes it flexible enough to build things like a server on top of it without having to jump through hoops to do it.

GLOVER:

And then specifically back to Hadoop. Could you

explain at a high level what's cool, what's hot these days? There's obviously the cloud, and then this NoSQL movement, if you will -- data stored in nonrelational systems. How does R play in that space? Who, for example, is using R in these environments, and how are they leveraging it?

CHAMPAGNE:

Well, Hadoop is basically a big distributed

-8-

computing framework for data and for programming. And one of the things that a lot of companies that have their data stored in Hadoop is that they want to be able to do some more-advanced analytics on that data: run a giant regression, run a massive crosstab with billions of rows of data and thousands of columns of variables.

And being able to do that, cross tabulations are something that ... that's a step of statistical function, and that's built into R. Marrying those two capabilities of being able to do that on a very large collection of data -- terabyteclass or even petabyte-class data -- and distribute it across nodes is where we're adding some development effort in trying to make that marriage happen.

There are efforts in the open source community to do that as well. The potential is really there to try to scale these kinds of advanced analytics such that R can be a major player in providing beyond the kind of simple MapReduce algorithms that are out there and really writing MapReduce algorithms directly in R and having them flow through the Hadoop framework.

So I think that it has a big future, and it remains to be seen how well it's played out. It's on the cutting edge, at this point. But it is going where analytics needs to go, which is with massive data stuff.

-9-

GLOVER:

One thing that comes to mind -- kind of to take

a left turn here -- is you talked about analytics, and I think you even mentioned predictability. You had mentioned sales forecasts and being able to predict out, let's say, the next year of sales forecasts based on the data that we've collected thus far.

I'm just curious, from an overall statistical standpoint, what's the reality of actually doing that? If we could predict out the future easily, then why aren't some of us billionaires on the stock market?

[LAUGHTER]

CHAMPAGNE:

Technically you're only predicting within a

band of confidence.

[LAUGHTER]

Given the data that you have in the past, if the past is a prediction of the future, which often it is not, you're going to be able to at least have some idea of where you're going. But it's always within some sort of band of confidence. It's only as good as the data that you collected is going to be.

-10-

So to say that you're going to be able to say with any kind of certainty what the future is going to hold is obviously foolish. But there are definitely things, statistically, that can give you indications of where things are going based on historical data.

And that is one of the powers of being able to do something more meaningful with your data that you've collected and that you're collecting from your customers and all the transactions that you're doing to give you a leg up against your competition. And using R in that context is really powerful.

GLOVER:

Does R facilitate various distributions -- I

think Poisson or Gaussian? But what about as well as a Monte Carlo analysis? Where does R play in that kind of world?

CHAMPAGNE:

R has all of that. Whatever statistical

capability you can think of, either R has that built into it or there are extensions via what R calls packages that provide that kind of functionality. So if you're doing Monte Carlo simulations or random forest or machine learning or optimization algorithms, all of that stuff exists either in R or someone else has built it in the community.

GLOVER:

R is obviously very helpful in running these

analyses and coming up with hard-core numbers. Earlier you'd

-11-

mentioned something about dashboards and whatnot. Does R have a graphing component? CHAMPAGNE:

R has an extremely powerful visualization

capability.

GLOVER:

Wow. Tell me more about that.

CHAMPAGNE:

One of its strengths is to be able to take data

and quickly visualize it through a few simple commands and render all different kinds of views of the data. It's one of its most powerful features, and it's one of the things that allows you to have insight into your data and then do more ad hoc analysis to be able not to try to fit everything to a linear model -- to be able to really visualize the data, identify those outlier cases, do something meaningful with them, do additional analysis on it, and then really make your data meaningful in the context of your business or your organization.

So the plotting capability. I mean, just one command. There's a plot command in R. You just take "plot" and you take whatever you're plotting and it renders that very effectively for you. There are add-on packages like lattice and ggplot2 that provide even more-powerful concepts, in terms of plotting and graphing.

These are heavily used in the statistical world. Anybody

-12-

who's producing meaningful graphs from statistical analysis is immediately drawn to that feature in R.

GLOVER:

You talk about analyzing data. One of the

challenges is getting at that data, whether it be in a file system or a relational database or a NoSQL database. Does R come with adapters or drivers for all these? How does one actually plug R into, let's say, pulling data out of Oracle or pulling data out of HBase or something?

CHAMPAGNE:

R is basically set up to deal with delimited

files. That's its origin. But there are a lot of add-on packages. Like there's an RODBC package for connecting to the database. There are specific packages that deal with Oracle and MySQL and whatnot.

R has a lot of that capability built into it. The connector capability for standard ways in which data is stored are provided with R. When you're talking about Hadoop and you're talking about HDFS and HBase, that's one of the things that Revolution Analytics has built is a series of connectors for those to bring the ability to grab data directly from those sources in Hadoop and bring them into the R environment so that people statisticians can work on that data directly as if it were native to R.

The number of connectors continues to grow as corporations

-13-

and organizations have different data sources that they need to connect to. It's all part of the flexibility of R, in terms of the extensibility framework, to make new connectors and being able to bring data directly into the R programming environment.

GLOVER:

Got it. And then, as we talked about earlier, R

is an open source language or platform. But that begs the question: You're the CTO of a commercial company. What's the business model here? How do you all make money?

CHAMPAGNE:

Well, we're kind of focused on trying to bring

a couple of different things. One is to bring support to bringing R into the enterprise just like Red Hat does it with Linux, providing an enterprise version -- a distribution that we provide support around -- that we work on doing packaging that's suitable for the enterprise. And then adding extensions on top of R that help the enterprise integrate it into their environment.

So being able to deal with very large data sets, being able to integrate R into business-intelligence applications, and having those kinds of commercial-grade capabilities that companies and organizations are looking for because they're already sold on R. It's not a question of whether R is the tool that they want to use. Most companies are already sold on that. They're just looking for those additional things

-14-

that can help them make it enterprise-ready and business-ready.

GLOVER:

Got it. So most importantly, where do we find

out more?

CHAMPAGNE:

The open source version of R is available from

CRAN, which is the Comprehensive R Archive Network. That's what CRAN stands for. So just type in CRAN to Google and the top link will get you to the open source project, and you can just download it from there.

GLOVER:

Okay.

CHAMPAGNE:

You can also download a community version from

our website, and anyone is welcome to do that. In addition, for academics, if you're affiliated with an academic institution, all you need to do is fill out a form, and you get everything we have in our enterprise edition and it's available for you to use within the context of your academic work.

Those are two ways to get it and get it for free. And then, of course, we sell our enterprise edition for organizations and enterprises that are interested in bringing R into business.

-15-

GLOVER:

Excellent. And I'm sure there's lots of

documentation and examples and whatnot?

CHAMPAGNE:

There's lots of mailing lists, local user

groups, lots of examples out on the Net. If you start looking, you'll be amazed at how viral R has grown and what a great project it has become.

GLOVER:

Awesome.

CHAMPAGNE:

It's really awesome.

GLOVER:

Well, David, thank you very much for spending

some time out of your day and educating us on R. I think it's pretty neat that you can leverage R on the JVM. And I think that's a theme that's reverberated throughout the Java community: choosing the right language for the right job, if you will, or right tool for the right job.

And it certainly sounds like R is obviously built with statistics in mind. So it seems like a no-brainer if you're doing additional analytics or want to dig deeper into data, you would use something like R.

CHAMPAGNE:

Absolutely.

GLOVER:

Again, thank you for your time.

-16-

CHAMPAGNE:

Thank you.

GLOVER:

I'll look forward to hearing more about

Revolution Analytics.

CHAMPAGNE:

Okay. Thank you.

>>

Andy Glover talking with David Champagne. Find

more episodes in Andy's Java technical series of the developerWorks podcasts at ibm.com/developerworks/java.

[END OF SEGMENT]

-17-