Tuesday, 21 October 2014

R: A first attempt at linear regression

I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.
I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started with the following code to build a data frame containing some potential predictors:
01.library(RNeo4j)
02.officeEventsQuery = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(),
03.(event)-[:HELD_AT]->(venue)
04.WHERE (event.time + event.utc_offset) < timestamp() AND venue.name IN [\"Neo Technology\", \"OpenCredo\"]
05.RETURN event.time + event.utc_offset AS eventTime,event.announced_at AS announcedAt, event.name, COUNT(*) AS rsvps"
06. 
07.events = subset(cypher(graph, officeEventsQuery), !is.na(announcedAt))
08.events$eventTime <- timestampToDate(events$eventTime)
09.events$day <- format(events$eventTime, "%A")
10.events$monthYear <- format(events$eventTime, "%m-%Y")
11.events$month <- format(events$eventTime, "%m")
12.events$year <- format(events$eventTime, "%Y")
13.events$announcedAt<- timestampToDate(events$announcedAt)
14.events$timeDiff = as.numeric(events$eventTime - events$announcedAt, units = "days")
If we preview ‘events’ it contains the following columns:
1.> head(events)
2.eventTime         announcedAt                                        event.name rsvps       day monthYear month year  timeDiff
3.1 2013-01-29 18:00:00 2012-11-30 11:30:57                                   Intro to Graphs    24   Tuesday   01-2013    01 2013 60.270174
4.2 2014-06-24 18:30:00 2014-06-18 19:11:19                                   Intro to Graphs    43   Tuesday   06-2014    06 2014  5.971308
5.3 2014-06-18 18:30:00 2014-06-08 07:03:13                         Neo4j World Cup Hackathon    24 Wednesday   06-2014    06 2014 10.476933
6.4 2014-05-20 18:30:00 2014-05-14 18:56:06                                   Intro to Graphs    53   Tuesday   05-2014    05 2014  5.981875
7.5 2014-02-11 18:00:00 2014-02-05 19:11:03                                   Intro to Graphs    35   Tuesday   02-2014    02 2014  5.950660
8.6 2014-09-04 18:30:00 2014-08-26 06:34:01 Hands On Intro to Cypher - Neo4j's Query Language    20  Thursday   09-2014    09 2014  9.497211
We want to predict ‘rsvps’ from the other columns so I started off by creating a linear model which took all the other columns into account:
01.> summary(lm(rsvps ~., data = events))
02. 
03.Call:
04.lm(formula = rsvps ~ ., data = events)
05. 
06.Residuals:
07.Min      1Q  Median      3Q     Max
08.-8.2582 -1.1538  0.0000  0.4158 10.5803
09. 
10.Coefficients: (14 not defined because of singularities)
11.Estimate Std. Error t value Pr(>|t|)  
12.(Intercept)                                                       -9.365e+03  3.009e+03  -3.113  0.00897 **
13.eventTime                                                          3.609e-06  2.951e-06   1.223  0.24479  
14.announcedAt                                                        3.278e-06  2.553e-06   1.284  0.22339  
15.event.nameGraph Modelling - Do's and Don'ts                        4.884e+01  1.140e+01   4.286  0.00106 **
16.event.nameHands on build your first Neo4j app for Java developers  3.735e+01  1.048e+01   3.562  0.00391 **
17.event.nameHands On Intro to Cypher - Neo4j's Query Language        2.560e+01  9.713e+00   2.635  0.02177 *
18.event.nameIntro to Graphs                                          2.238e+01  8.726e+00   2.564  0.02480 *
19.event.nameIntroduction to Graph Database Modeling                 -1.304e+02  4.835e+01  -2.696  0.01946 *
20.event.nameLunch with Neo4j's CEO, Emil Eifrem                      3.920e+01  1.113e+01   3.523  0.00420 **
21.event.nameNeo4j Clojure Hackathon                                 -3.063e+00  1.195e+01  -0.256  0.80203  
22.event.nameNeo4j Python Hackathon with py2neo's Nigel Small         2.128e+01  1.070e+01   1.989  0.06998 .
23.event.nameNeo4j World Cup Hackathon                                5.004e+00  9.622e+00   0.520  0.61251  
24.dayTuesday                                                         2.068e+01  5.626e+00   3.676  0.00317 **
25.dayWednesday                                                       2.300e+01  5.522e+00   4.165  0.00131 **
26.monthYear01-2014                                                  -2.350e+02  7.377e+01  -3.185  0.00784 **
27.monthYear02-2013                                                  -2.526e+01  1.376e+01  -1.836  0.09130 .
28.monthYear02-2014                                                  -2.325e+02  7.763e+01  -2.995  0.01118 *
29.monthYear03-2013                                                  -4.605e+01  1.683e+01  -2.736  0.01805 *
30.monthYear03-2014                                                  -2.371e+02  8.324e+01  -2.848  0.01468 *
31.monthYear04-2013                                                  -6.570e+01  2.309e+01  -2.845  0.01477 *
32.monthYear04-2014                                                  -2.535e+02  8.746e+01  -2.899  0.01336 *
33.monthYear05-2013                                                  -8.672e+01  2.845e+01  -3.049  0.01011 *
34.monthYear05-2014                                                  -2.802e+02  9.420e+01  -2.975  0.01160 *
35.monthYear06-2013                                                  -1.022e+02  3.283e+01  -3.113  0.00897 **
36.monthYear06-2014                                                  -2.996e+02  1.003e+02  -2.988  0.01132 *
37.monthYear07-2014                                                  -3.123e+02  1.054e+02  -2.965  0.01182 *
38.monthYear08-2013                                                  -1.326e+02  4.323e+01  -3.067  0.00976 **
39.monthYear08-2014                                                  -3.060e+02  1.107e+02  -2.763  0.01718 *
40.monthYear09-2013                                                          NA         NA      NA       NA  
41.monthYear09-2014                                                  -3.465e+02  1.164e+02  -2.976  0.01158 *
42.monthYear10-2012                                                   2.602e+01  1.959e+01   1.328  0.20886  
43.monthYear10-2013                                                  -1.728e+02  5.678e+01  -3.044  0.01020 *
44.monthYear11-2012                                                   2.717e+01  1.509e+01   1.800  0.09704 .
45.month02                                                                   NA         NA      NA       NA  
46.month03                                                                   NA         NA      NA       NA  
47.month04                                                                   NA         NA      NA       NA  
48.month05                                                                   NA         NA      NA       NA  
49.month06                                                                   NA         NA      NA       NA  
50.month07                                                                   NA         NA      NA       NA  
51.month08                                                                   NA         NA      NA       NA  
52.month09                                                                   NA         NA      NA       NA  
53.month10                                                                   NA         NA      NA       NA  
54.month11                                                                   NA         NA      NA       NA  
55.year2013                                                                  NA         NA      NA       NA  
56.year2014                                                                  NA         NA      NA       NA  
57.timeDiff                                                                  NA         NA      NA       NA  
58.---
59.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
60. 
61.Residual standard error: 5.287 on 12 degrees of freedom
62.Multiple R-squared:  0.9585,  Adjusted R-squared:  0.8512
63.F-statistic: 8.934 on 31 and 12 DF,  p-value: 0.0001399
As I understand it we can look at the R-squared value to understand how much of the variance in the data has been explained by the model – in this case it’s 85%.
A lot of the coefficients seem to be based around specific event names which seems a bit too specific to me so I wanted to see what would happen if I derived a feature which indicated whether a session was practical:
1.events$practical = grepl("Hackathon|Hands on|Hands On", events$event.name)
We can now run the model again with the new column having excluded ‘event.name’ field:
01.> summary(lm(rsvps ~., data = subset(events, select = -c(event.name))))
02. 
03.Call:
04.lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name)))
05. 
06.Residuals:
07.Min      1Q  Median      3Q     Max
08.-18.647  -2.311   0.000   2.908  23.218
09. 
10.Coefficients: (13 not defined because of singularities)
11.Estimate Std. Error t value Pr(>|t|) 
12.(Intercept)      -3.980e+03  4.752e+03  -0.838   0.4127 
13.eventTime         2.907e-06  3.873e-06   0.751   0.4621 
14.announcedAt       3.336e-08  3.559e-06   0.009   0.9926 
15.dayTuesday        7.547e+00  6.080e+00   1.241   0.2296 
16.dayWednesday      2.442e+00  7.046e+00   0.347   0.7327 
17.monthYear01-2014 -9.562e+01  1.187e+02  -0.806   0.4303 
18.monthYear02-2013 -4.230e+00  2.289e+01  -0.185   0.8553 
19.monthYear02-2014 -9.156e+01  1.254e+02  -0.730   0.4742 
20.monthYear03-2013 -1.633e+01  2.808e+01  -0.582   0.5676 
21.monthYear03-2014 -8.094e+01  1.329e+02  -0.609   0.5496 
22.monthYear04-2013 -2.249e+01  3.785e+01  -0.594   0.5595 
23.monthYear04-2014 -9.230e+01  1.401e+02  -0.659   0.5180 
24.monthYear05-2013 -3.237e+01  4.654e+01  -0.696   0.4952 
25.monthYear05-2014 -1.015e+02  1.509e+02  -0.673   0.5092 
26.monthYear06-2013 -3.947e+01  5.355e+01  -0.737   0.4701 
27.monthYear06-2014 -1.081e+02  1.604e+02  -0.674   0.5084 
28.monthYear07-2014 -1.110e+02  1.678e+02  -0.661   0.5163 
29.monthYear08-2013 -5.144e+01  6.988e+01  -0.736   0.4706 
30.monthYear08-2014 -1.023e+02  1.784e+02  -0.573   0.5731 
31.monthYear09-2013 -6.057e+01  7.893e+01  -0.767   0.4523 
32.monthYear09-2014 -1.260e+02  1.874e+02  -0.672   0.5094 
33.monthYear10-2012  9.557e+00  2.873e+01   0.333   0.7430 
34.monthYear10-2013 -6.450e+01  9.169e+01  -0.703   0.4903 
35.monthYear11-2012  1.689e+01  2.316e+01   0.729   0.4748 
36.month02                  NA         NA      NA       NA 
37.month03                  NA         NA      NA       NA 
38.month04                  NA         NA      NA       NA 
39.month05                  NA         NA      NA       NA 
40.month06                  NA         NA      NA       NA 
41.month07                  NA         NA      NA       NA 
42.month08                  NA         NA      NA       NA 
43.month09                  NA         NA      NA       NA 
44.month10                  NA         NA      NA       NA 
45.month11                  NA         NA      NA       NA 
46.year2013                 NA         NA      NA       NA 
47.year2014                 NA         NA      NA       NA 
48.timeDiff                 NA         NA      NA       NA 
49.practicalTRUE    -9.388e+00  5.289e+00  -1.775   0.0919 .
50.---
51.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
52. 
53.Residual standard error: 10.21 on 19 degrees of freedom
54.Multiple R-squared:  0.7546,  Adjusted R-squared:  0.4446
55.F-statistic: 2.434 on 24 and 19 DF,  p-value: 0.02592
Now we’re only accounting for 44% of the variance and none of our coefficients are significant so this wasn’t such a good change.
I also noticed that we’ve got a bit of overlap in the date related features – we’ve got one column for monthYear and then separate ones for month and year. Let’s strip out the combined one:
01.> summary(lm(rsvps ~., data = subset(events, select = -c(event.name, monthYear))))
02. 
03.Call:
04.lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name,
05.monthYear)))
06. 
07.Residuals:
08.Min       1Q   Median       3Q      Max
09.-16.5745  -4.0507  -0.1042   3.6586  24.4715
10. 
11.Coefficients: (1 not defined because of singularities)
12.Estimate Std. Error t value Pr(>|t|) 
13.(Intercept)   -1.573e+03  4.315e+03  -0.364   0.7185 
14.eventTime      3.320e-06  3.434e-06   0.967   0.3425 
15.announcedAt   -2.149e-06  2.201e-06  -0.976   0.3379 
16.dayTuesday     4.713e+00  5.871e+00   0.803   0.4294 
17.dayWednesday  -2.253e-01  6.685e+00  -0.034   0.9734 
18.month02        3.164e+00  1.285e+01   0.246   0.8075 
19.month03        1.127e+01  1.858e+01   0.607   0.5494 
20.month04        4.148e+00  2.581e+01   0.161   0.8736 
21.month05        1.979e+00  3.425e+01   0.058   0.9544 
22.month06       -1.220e-01  4.271e+01  -0.003   0.9977 
23.month07        1.671e+00  4.955e+01   0.034   0.9734 
24.month08        8.849e+00  5.940e+01   0.149   0.8827 
25.month09       -5.496e+00  6.782e+01  -0.081   0.9360 
26.month10       -5.066e+00  7.893e+01  -0.064   0.9493 
27.month11        4.255e+00  8.697e+01   0.049   0.9614 
28.year2013      -1.799e+01  1.032e+02  -0.174   0.8629 
29.year2014      -3.281e+01  2.045e+02  -0.160   0.8738 
30.timeDiff              NA         NA      NA       NA 
31.practicalTRUE -9.816e+00  5.084e+00  -1.931   0.0645 .
32.---
33.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
34. 
35.Residual standard error: 10.19 on 26 degrees of freedom
36.Multiple R-squared:  0.666, Adjusted R-squared:  0.4476
37.F-statistic: 3.049 on 17 and 26 DF,  p-value: 0.005187
Again none of the coefficients are statistically significant which is disappointing. I think the main problem may be that I have very few data points (only 42) making it difficult to come up with a general model.
I think my next step is to look for some other features that could impact the number of RSVPs e.g. other events on that day, the weather.
I’m a novice at this but trying to learn more so if you have any ideas of what I should do next please let me know.

Hadoop and the mystery of the version number

When I’m working with people on Hadoop I ask what you would think is a simple question. What version of Hadoop are you using? The answer normally is one of several attempts to explain what’s installed including –
AnswerTranslation
Hortonworks/ClouderaThis is my Hadoop Distribution.
Hortonworks 2I know we aren’t using version 1.
Hadoop 2I dont know my distro but I’m using Hadoop 2.
Apachesomeone else is working this. I have no idea.
In reality though it’s not as straight forward as you might think. I think the easiest way to get the most bang for your buck is to simply take a look at the version number of the package installed. So on yum based systems you could simply do
01.yum list hive\*
02.Loaded plugins:fastestmirror,priorities
03.Determining fastest mirrors
04.Installed Packages
05.hive.noarch  0.13.0.2.1.1.0-385.el6@HDP-2.1
06.hive-hcatalog.noarch0.13.0.2.1.1.0-385.el6@HDP-2.1
07.hive-jdbc.noarch0.13.0.2.1.1.0-385.el6@HDP-2.1
08.hive-webhcat.noarch  0.13.0.2.1.1.0-385.el6@HDP-2.1
09.Available Packages
10.hive-hcatalog-server.noarch  0.13.0.2.1.1.0-385.el6HDP-2.1
11.hive-metastore.noarch  0.13.0.2.1.1.0-385.el6HDP-2.1
12.hive-server.noarch0.13.0.2.1.1.0-385.el6HDP-2.1
13.hive-server2.noarch  0.13.0.2.1.1.0-385.el6HDP-2.1
14.hive-webhcat-server.noarch0.13.0.2.1.1.0-385.el6HDP-2.1
15.hivex.i6861.3.3-4.2.el6  base
16.hivex.x86_641.3.3-4.2.el6  base
17.hivex-devel.i6861.3.3-4.2.el6  base
18.hivex-devel.x86_641.3.3-4.2.el6  base
and get back of list of whats installed and whats available. You could also simply query the rpm database:
01.rpm-qa|grep hadoop
02.hadoop-2.4.0.2.1.1.0-385.el6.x86_64
03.hadoop-yarn-proxyserver-2.4.0.2.1.1.0-385.el6.x86_64
04.hadoop-hdfs-2.4.0.2.1.1.0-385.el6.x86_64
05.hadoop-yarn-2.4.0.2.1.1.0-385.el6.x86_64
06.hadoop-mapreduce-2.4.0.2.1.1.0-385.el6.x86_64
07.hadoop-yarn-resourcemanager-2.4.0.2.1.1.0-385.el6.x86_64
08.hadoop-libhdfs-2.4.0.2.1.1.0-385.el6.x86_64
09.hadoop-client-2.4.0.2.1.1.0-385.el6.x86_64
10.hadoop-mapreduce-historyserver-2.4.0.2.1.1.0-385.el6.x86_64
11.hadoop-yarn-nodemanager-2.4.0.2.1.1.0-385.el6.x86_64
12.hadoop-lzo-0.6.0-1.x86_64
13.hadoop-lzo-native-0.6.0-1.x86_64
If you run SLES you will need to do zypper and on windows look at your add/remove programs dialog on most major newer versions of windows. In the end you are still left with this cryptic string to decode. If you look closely there is a method to the madness and it helps to know this level of detail when working in an area like Hadoop where minor version numbers or a build number could make all the difference.
For example:
package name-version-architecture
hadoop-2.4.0.2.1.1.0-385-.el6.x86_64
The version number in this case is from a Hortonworks distribution so  we have a seven digit (8 places) version number.
package version-HDP Version-build number
2.4.0-2.1.1.0-build 385
It’s important to know both the version of Hadoop and the version of the package you are working on. For example if someone says “I’m working on Hive”. You really need to know what hive version AND what Hadoop version because the two are intimately linked. If someone gives you the hive package string:
1.hive-0.13.0.2.1.1.0-385.el6.noarch
It’s really not enough information for you to tell what version of Hadoop someone is using. You know they are using HDP 2.1.1.0 so one either asks for the same information on the Hadoop package installed OR goes to the release notes for the distro to decode the distribution version number into the Apache Hadoop version. Each distribution uses a different combination of packages and it pays to know EXACTLY what you are getting when you download a distro. Cloudera has exactly the same issues and their packaging may in fact be even more forthcoming in that they tell you how many patches were applied. Hortonworks does this in the context of their release notes.
package name-package version+CDH version+patches
hadoop-2.3.0+cdh5.1+384
Hopefully now you have a better understanding of Hadoop package versions.
Related Posts Plugin for WordPress, Blogger...