I typically use Python for my data analysis. However, there are many packages which exist only in R. RPy2 allows interaction between R and Python. The following post is an attempt to show how we can leverage the awesomeness of R's packages in Python. Python meets R-Pythor. The following is the Github link to the project: https://github.com/nipunbatra/pythor
The installation on OSX wasn't straighforward. The first part of this post explains how I was able to install RPy2 on OSX. In the next part, I give some sample usage.
I assume that R was installed to the default location. First, we need to edit
/opt/local/Library/Frameworks/R.framework/Resources/etc/Makeconf
and change the following line
LIBS = -lpcre -lbz2 -lz -lm -liconv -licuuc -licui18n
TO
LIBS = -lpcre -lbz2 -lz -lm -liconv
Next, I used Conda to install rpy2 following instructions on this Stack Overflow question.
conda skeleton pypi rpy2
conda build rpy2
conda install rpy2 --use-local
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(10,5))
df.head()
df.describe()
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Creating an R data drame corresponding to df
rdf = pandas2ri.py2ri(df)
rdf
print(rdf)
Now, let us perform some R operations on the dataframe
import rpy2.robjects as robjects
From the rpy2 documentation,
The object r in rpy2.robjects represents the running embedded R process. If familiar with R and the R console, r is a little like a communication channel from Python to R.
R functions¶head of our R dataframe rdfhead_df = robjects.r['head']
print(head_df(rdf, 2))
r_summary = robjects.r['summary'](rdf)
print(r_summary)
Let us take the example of linear regression on faithful geyser data set. First, let us load the R dataset into Pandas.
faithful_pandas_df = robjects.r('faithful')
type(faithful_pandas_df)
faithful_pandas_df.head()
Now, let us create an R dataframe from this Python dataframe. It should be noted that we could have directly used the faithful geyser dataset in R. But, I chose to do this extra conversion for purpose of illustration.
faithful_r_df = pandas2ri.py2ri(faithful_pandas_df)
Let us take the head of the R dataframe and confirm that it is the same as the pandas dataframe.
print(robjects.r['head'](faithful_r_df, 5))
Great, now let us fit a linear relationship bewteen #eruptions and waiting as follows:
$$\mathrm{eruptions} = \alpha + \beta\times\mathrm{waiting} + \epsilon$$
For the purpose of this illustration, we will assume that there exists no package for linear regression in Python. So, we will be using the lm R stats package. Our aim would be to create a function in Python that does the following:
lm package from R to fit the linear relationshipSince we have already done step 2, we'll jump to step 3
from rpy2.robjects.packages import importr
stats = importr('stats')
fit = stats.lm('eruptions ~ waiting', data=faithful_r_df)
print(fit.names)
Let us extract the residuals from the fit R object now.
residuals = fit.rx2('residuals')
Viewing the first five residuals
print(residuals[:5])
Getting a NumPy array from the residuals
residuals_numpy = pandas2ri.ri2py(residuals)
residuals_numpy[:5]
Let us plot a histogram of these residuals using matplotlib now. For a good fit, we would expect a normal distributed residual set.
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
plt.hist(residuals_numpy)
The histogram looks reasonably good.
Now, let us extract the coeffiencts from the linear fit
coeffs = fit.rx2('coefficients')
print(coeffs)
type(coeffs)
coeffs_python = pandas2ri.ri2py(coeffs)
coeffs_python
Looks like we lost the names of columns (waiting, etc.)
coeffs_python_names = pandas2ri.ri2py(coeffs.names).tolist()
coeffs_python_names
Let us now extract the fitted value of eruptions
fit_eruptions = pandas2ri.ri2py(fit.rx2('fitted.values'))
Now, let us plot the true and the fitted eruptions in matplotlib.
plt.scatter(faithful_pandas_df['waiting'], fit_eruptions, label='Fitted')
plt.scatter(faithful_pandas_df['waiting'], faithful_pandas_df['eruptions'], label='True eruptions', color='r')
plt.legend(loc='upper left')
plt.xlabel("Waiting")
plt.ylabel("Eruptions")
Now, let us predict using the learnt linear model on last five entries in the dataset.
last_5 = robjects.r['tail'](faithful_r_df, 5)
print(last_5)
pred_r = stats.predict(fit, newdata=last_5)
print(pred_r)
pred_python = pandas2ri.ri2py(pred_r)
pred_python
Great! Now that we are able to fit a model and also predict on it, let us now package this function similar to the scikit-learn API
class PYLM(object):
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import importr
stats = importr('stats')
def convert_fit_to_python(self, fit):
coeffs_r = fit.rx2('coefficients')
coeffs= pandas2ri.ri2py(coeffs_r)
coeff_names = pandas2ri.ri2py(coeffs_r.names).tolist()
coeff_series = pd.Series({k:v for k,v in zip(coeff_names, coeffs)})
fitted_values = pandas2ri.ri2py(fit.rx2('fitted.values'))
return coeff_series
def fit(self, relationship, df):
"""
relationship: string of the form: a~b+c
df: Pandas Dataframe
"""
# Get R dataframe
r_df = pandas2ri.py2ri(df)
# Create linear fit
fit = stats.lm(relationship, data=df)
self.fit = fit
python_fit = self.convert_fit_to_python(fit)
return python_fit
def predict(self, df):
pred_r = stats.predict(self.fit, newdata=df)
pred_python = pandas2ri.ri2py(pred_r)
return pred_python
pylm = PYLM()
relationship='eruptions~waiting'
pylm.fit(relationship, faithful_pandas_df)
pylm.predict(faithful_pandas_df.tail(5))
Great, this means that we've now been able get the functionality of R's lm package! Now, let us take a more complex example
I'll be following this simple to follow tutorial for the R code.
r_stl = robjects.r['stl']
r_ts = robjects.r['ts']
import datetime
data = np.arange(85.) / 12.
data = np.sin(data * (2*np.pi))
data += np.arange(85.) / 12. * .5
data += .1 * np.random.randn(85)
idx = pd.DatetimeIndex(start=datetime.datetime(1999,1,1), freq='1M', periods=len(data))
data = pd.Series(data, index=idx)
data.plot()
Pandas uses human understandable frequency formatting. We'll need to convert that to a suitable periodicty for stl.
def convert_pd_freqstr(df):
"""
df: pd.DataFrame with pd.DatetimeIndex type index
"""
freqstr = df.index.freq.freqstr
freq_interval = freqstr[-1]
if len(freqstr)>1:
freq_number = int(freqstr[:-1])
else:
freq_number=1
if freq_interval is 'Y':
return 1.0/freq_number
elif freq_interval is 'M':
return 12.0/freq_number
elif freq_interval is 'D':
return 365.0/freq_number
convert_pd_freqstr(data)
r_ts_data = r_ts(robjects.FloatVector(np.asarray(data)), start= robjects.IntVector([data.index[0].year, data.index[0].month, data.index[0].day]), frequency=convert_pd_freqstr(data))
print(r_ts_data)
Let us plot this time series. The next code snippet is also a good example of how to invoke custom R code in Python.
r_plot = robjects.r("""
function(data, filename){
png(filename=filename)
plot(data)
dev.off()}
""")
r_plot(r_ts_data, "/Users/nipunbatra/Desktop/ts_1.png")
from IPython.display import Image
Image("/Users/nipunbatra/Desktop/ts_1.png")
r_ts_decomposed = r_stl(r_ts_data, 12)
r_plot(r_ts_decomposed, "/Users/nipunbatra/Desktop/decomposed.png")
Image("/Users/nipunbatra/Desktop/decomposed.png")
This looks great. Let us now write a small Python class which takes in a Pandas timeseries and uses R's stl for decomposition and returns a dataframe consisting of data and three other timeseries.
class PYSTL(object):
def convert_pd_freqstr(df):
freqstr = df.index.freq.freqstr
freq_interval = freqstr[-1]
if len(freqstr)>1:
freq_number = int(freqstr[:-1])
else:
freq_number=1
if freq_interval is 'Y':
return 1.0/freq_number
elif freq_interval is 'M':
return 12.0/freq_number
elif freq_interval is 'D':
return 365.0/freq_number
def decompose(self, ser, np=12):
from rpy2 import robjects
from numpy import asarray
r_stl = robjects.r['stl']
r_ts = robjects.r['ts']
start = robjects.IntVector([ser.index[0].year, ser.index[0].month, ser.index[0].day])
freq = convert_pd_freqstr(ser)
r_ts_data = r_ts(robjects.FloatVector(asarray(ser)), start=start, frequency=freq)
r_decomposed = r_stl(r_ts_data, freq)
res_ts = asarray(r_decomposed[0])
res_ts = pd.DataFrame({"data":data,
"seasonal" : pd.Series(res_ts[:,0],
index=data.index),
"trend" : pd.Series(res_ts[:,1],
index=data.index),
"remainder" : pd.Series(res_ts[:,2],
index=data.index)})
res_ts = res_ts[['data','seasonal','trend','remainder']]
self.decomposed = res_ts
return res_ts
def plot(self, **kwargs):
ax = self.decomposed.plot(subplots=True, legend=False, **kwargs)
plt.tight_layout()
ax[0].set_ylabel("data")
ax[1].set_ylabel("seasonal")
ax[2].set_ylabel("trend")
ax[3].set_ylabel("remainder")
ax[3].set_xlabel("Time")
pystl = PYSTL()
pystl.decompose(data).head()
pystl.plot(figsize=(12,6))
So, there you go. RPY2 makes our life easy! With a minimal amount of effort, one can effectively open up the entire set of R libraries to be used in Python. Please feel free to contribute to the project and provide your feedback.