I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.
I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started with the following code to build a data frame containing some potential predictors:
01.library(RNeo4j)02.officeEventsQuery = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(),03.(event)-[:HELD_AT]->(venue)04.WHERE (event.time + event.utc_offset) < timestamp() AND venue.name IN [\"Neo Technology\", \"OpenCredo\"]05.RETURN event.time + event.utc_offset AS eventTime,event.announced_at AS announcedAt, event.name, COUNT(*) AS rsvps"06. 07.events = subset(cypher(graph, officeEventsQuery), !is.na(announcedAt))08.events$eventTime <- timestampToDate(events$eventTime)09.events$day <- format(events$eventTime, "%A")10.events$monthYear <- format(events$eventTime, "%m-%Y")11.events$month <- format(events$eventTime, "%m")12.events$year <- format(events$eventTime, "%Y")13.events$announcedAt<- timestampToDate(events$announcedAt)14.events$timeDiff = as.numeric(events$eventTime - events$announcedAt, units = "days")
If we preview ‘events’ it contains the following columns:
1.> head(events)2.eventTime         announcedAt                                        event.name rsvps       day monthYear month year  timeDiff3.1 2013-01-29 18:00:00 2012-11-30 11:30:57                                   Intro to Graphs    24   Tuesday   01-2013    01 2013 60.2701744.2 2014-06-24 18:30:00 2014-06-18 19:11:19                                   Intro to Graphs    43   Tuesday   06-2014    06 2014  5.9713085.3 2014-06-18 18:30:00 2014-06-08 07:03:13                         Neo4j World Cup Hackathon    24 Wednesday   06-2014    06 2014 10.4769336.4 2014-05-20 18:30:00 2014-05-14 18:56:06                                   Intro to Graphs    53   Tuesday   05-2014    05 2014  5.9818757.5 2014-02-11 18:00:00 2014-02-05 19:11:03                                   Intro to Graphs    35   Tuesday   02-2014    02 2014  5.9506608.6 2014-09-04 18:30:00 2014-08-26 06:34:01 Hands On Intro to Cypher - Neo4j's Query Language    20  Thursday   09-2014    09 2014  9.497211
We want to predict ‘rsvps’ from the other columns so I started off by creating a linear model which took all the other columns into account:
01.> summary(lm(rsvps ~., data = events))02. 03.Call:04.lm(formula = rsvps ~ ., data = events)05. 06.Residuals:07.Min      1Q  Median      3Q     Max08.-8.2582 -1.1538  0.0000  0.4158 10.580309. 10.Coefficients: (14 not defined because of singularities)11.Estimate Std. Error t value Pr(>|t|)  12.(Intercept)                                                       -9.365e+03  3.009e+03  -3.113  0.00897 **13.eventTime                                                          3.609e-06  2.951e-06   1.223  0.24479  14.announcedAt                                                        3.278e-06  2.553e-06   1.284  0.22339  15.event.nameGraph Modelling - Do's and Don'ts                        4.884e+01  1.140e+01   4.286  0.00106 **16.event.nameHands on build your first Neo4j app for Java developers  3.735e+01  1.048e+01   3.562  0.00391 **17.event.nameHands On Intro to Cypher - Neo4j's Query Language        2.560e+01  9.713e+00   2.635  0.02177 *18.event.nameIntro to Graphs                                          2.238e+01  8.726e+00   2.564  0.02480 *19.event.nameIntroduction to Graph Database Modeling                 -1.304e+02  4.835e+01  -2.696  0.01946 *20.event.nameLunch with Neo4j's CEO, Emil Eifrem                      3.920e+01  1.113e+01   3.523  0.00420 **21.event.nameNeo4j Clojure Hackathon                                 -3.063e+00  1.195e+01  -0.256  0.80203  22.event.nameNeo4j Python Hackathon with py2neo's Nigel Small         2.128e+01  1.070e+01   1.989  0.06998 .23.event.nameNeo4j World Cup Hackathon                                5.004e+00  9.622e+00   0.520  0.61251  24.dayTuesday                                                         2.068e+01  5.626e+00   3.676  0.00317 **25.dayWednesday                                                       2.300e+01  5.522e+00   4.165  0.00131 **26.monthYear01-2014                                                  -2.350e+02  7.377e+01  -3.185  0.00784 **27.monthYear02-2013                                                  -2.526e+01  1.376e+01  -1.836  0.09130 .28.monthYear02-2014                                                  -2.325e+02  7.763e+01  -2.995  0.01118 *29.monthYear03-2013                                                  -4.605e+01  1.683e+01  -2.736  0.01805 *30.monthYear03-2014                                                  -2.371e+02  8.324e+01  -2.848  0.01468 *31.monthYear04-2013                                                  -6.570e+01  2.309e+01  -2.845  0.01477 *32.monthYear04-2014                                                  -2.535e+02  8.746e+01  -2.899  0.01336 *33.monthYear05-2013                                                  -8.672e+01  2.845e+01  -3.049  0.01011 *34.monthYear05-2014                                                  -2.802e+02  9.420e+01  -2.975  0.01160 *35.monthYear06-2013                                                  -1.022e+02  3.283e+01  -3.113  0.00897 **36.monthYear06-2014                                                  -2.996e+02  1.003e+02  -2.988  0.01132 *37.monthYear07-2014                                                  -3.123e+02  1.054e+02  -2.965  0.01182 *38.monthYear08-2013                                                  -1.326e+02  4.323e+01  -3.067  0.00976 **39.monthYear08-2014                                                  -3.060e+02  1.107e+02  -2.763  0.01718 *40.monthYear09-2013                                                          NA         NA      NA       NA  41.monthYear09-2014                                                  -3.465e+02  1.164e+02  -2.976  0.01158 *42.monthYear10-2012                                                   2.602e+01  1.959e+01   1.328  0.20886  43.monthYear10-2013                                                  -1.728e+02  5.678e+01  -3.044  0.01020 *44.monthYear11-2012                                                   2.717e+01  1.509e+01   1.800  0.09704 .45.month02                                                                   NA         NA      NA       NA  46.month03                                                                   NA         NA      NA       NA  47.month04                                                                   NA         NA      NA       NA  48.month05                                                                   NA         NA      NA       NA  49.month06                                                                   NA         NA      NA       NA  50.month07                                                                   NA         NA      NA       NA  51.month08                                                                   NA         NA      NA       NA  52.month09                                                                   NA         NA      NA       NA  53.month10                                                                   NA         NA      NA       NA  54.month11                                                                   NA         NA      NA       NA  55.year2013                                                                  NA         NA      NA       NA  56.year2014                                                                  NA         NA      NA       NA  57.timeDiff                                                                  NA         NA      NA       NA  58.---59.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 160. 61.Residual standard error: 5.287 on 12 degrees of freedom62.Multiple R-squared:  0.9585,  Adjusted R-squared:  0.851263.F-statistic: 8.934 on 31 and 12 DF,  p-value: 0.0001399
As I understand it we can look at the R-squared value to understand how much of the variance in the data has been explained by the model – in this case it’s 85%.
A lot of the coefficients seem to be based around specific event names which seems a bit too specific to me so I wanted to see what would happen if I derived a feature which indicated whether a session was practical:
1.events$practical = grepl("Hackathon|Hands on|Hands On", events$event.name)
We can now run the model again with the new column having excluded ‘event.name’ field:
01.> summary(lm(rsvps ~., data = subset(events, select = -c(event.name))))02. 03.Call:04.lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name)))05. 06.Residuals:07.Min      1Q  Median      3Q     Max08.-18.647  -2.311   0.000   2.908  23.21809. 10.Coefficients: (13 not defined because of singularities)11.Estimate Std. Error t value Pr(>|t|) 12.(Intercept)      -3.980e+03  4.752e+03  -0.838   0.4127 13.eventTime         2.907e-06  3.873e-06   0.751   0.4621 14.announcedAt       3.336e-08  3.559e-06   0.009   0.9926 15.dayTuesday        7.547e+00  6.080e+00   1.241   0.2296 16.dayWednesday      2.442e+00  7.046e+00   0.347   0.7327 17.monthYear01-2014 -9.562e+01  1.187e+02  -0.806   0.4303 18.monthYear02-2013 -4.230e+00  2.289e+01  -0.185   0.8553 19.monthYear02-2014 -9.156e+01  1.254e+02  -0.730   0.4742 20.monthYear03-2013 -1.633e+01  2.808e+01  -0.582   0.5676 21.monthYear03-2014 -8.094e+01  1.329e+02  -0.609   0.5496 22.monthYear04-2013 -2.249e+01  3.785e+01  -0.594   0.5595 23.monthYear04-2014 -9.230e+01  1.401e+02  -0.659   0.5180 24.monthYear05-2013 -3.237e+01  4.654e+01  -0.696   0.4952 25.monthYear05-2014 -1.015e+02  1.509e+02  -0.673   0.5092 26.monthYear06-2013 -3.947e+01  5.355e+01  -0.737   0.4701 27.monthYear06-2014 -1.081e+02  1.604e+02  -0.674   0.5084 28.monthYear07-2014 -1.110e+02  1.678e+02  -0.661   0.5163 29.monthYear08-2013 -5.144e+01  6.988e+01  -0.736   0.4706 30.monthYear08-2014 -1.023e+02  1.784e+02  -0.573   0.5731 31.monthYear09-2013 -6.057e+01  7.893e+01  -0.767   0.4523 32.monthYear09-2014 -1.260e+02  1.874e+02  -0.672   0.5094 33.monthYear10-2012  9.557e+00  2.873e+01   0.333   0.7430 34.monthYear10-2013 -6.450e+01  9.169e+01  -0.703   0.4903 35.monthYear11-2012  1.689e+01  2.316e+01   0.729   0.4748 36.month02                  NA         NA      NA       NA 37.month03                  NA         NA      NA       NA 38.month04                  NA         NA      NA       NA 39.month05                  NA         NA      NA       NA 40.month06                  NA         NA      NA       NA 41.month07                  NA         NA      NA       NA 42.month08                  NA         NA      NA       NA 43.month09                  NA         NA      NA       NA 44.month10                  NA         NA      NA       NA 45.month11                  NA         NA      NA       NA 46.year2013                 NA         NA      NA       NA 47.year2014                 NA         NA      NA       NA 48.timeDiff                 NA         NA      NA       NA 49.practicalTRUE    -9.388e+00  5.289e+00  -1.775   0.0919 .50.---51.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 152. 53.Residual standard error: 10.21 on 19 degrees of freedom54.Multiple R-squared:  0.7546,  Adjusted R-squared:  0.444655.F-statistic: 2.434 on 24 and 19 DF,  p-value: 0.02592
Now we’re only accounting for 44% of the variance and none of our coefficients are significant so this wasn’t such a good change.
I also noticed that we’ve got a bit of overlap in the date related features – we’ve got one column for monthYear and then separate ones for month and year. Let’s strip out the combined one:
01.> summary(lm(rsvps ~., data = subset(events, select = -c(event.name, monthYear))))02. 03.Call:04.lm(formula = rsvps ~ ., data = subset(events, select = -c(event.name,05.monthYear)))06. 07.Residuals:08.Min       1Q   Median       3Q      Max09.-16.5745  -4.0507  -0.1042   3.6586  24.471510. 11.Coefficients: (1 not defined because of singularities)12.Estimate Std. Error t value Pr(>|t|) 13.(Intercept)   -1.573e+03  4.315e+03  -0.364   0.7185 14.eventTime      3.320e-06  3.434e-06   0.967   0.3425 15.announcedAt   -2.149e-06  2.201e-06  -0.976   0.3379 16.dayTuesday     4.713e+00  5.871e+00   0.803   0.4294 17.dayWednesday  -2.253e-01  6.685e+00  -0.034   0.9734 18.month02        3.164e+00  1.285e+01   0.246   0.8075 19.month03        1.127e+01  1.858e+01   0.607   0.5494 20.month04        4.148e+00  2.581e+01   0.161   0.8736 21.month05        1.979e+00  3.425e+01   0.058   0.9544 22.month06       -1.220e-01  4.271e+01  -0.003   0.9977 23.month07        1.671e+00  4.955e+01   0.034   0.9734 24.month08        8.849e+00  5.940e+01   0.149   0.8827 25.month09       -5.496e+00  6.782e+01  -0.081   0.9360 26.month10       -5.066e+00  7.893e+01  -0.064   0.9493 27.month11        4.255e+00  8.697e+01   0.049   0.9614 28.year2013      -1.799e+01  1.032e+02  -0.174   0.8629 29.year2014      -3.281e+01  2.045e+02  -0.160   0.8738 30.timeDiff              NA         NA      NA       NA 31.practicalTRUE -9.816e+00  5.084e+00  -1.931   0.0645 .32.---33.Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 134. 35.Residual standard error: 10.19 on 26 degrees of freedom36.Multiple R-squared:  0.666, Adjusted R-squared:  0.447637.F-statistic: 3.049 on 17 and 26 DF,  p-value: 0.005187
Again none of the coefficients are statistically significant which is disappointing. I think the main problem may be that I have very few data points (only 42) making it difficult to come up with a general model.
I think my next step is to look for some other features that could impact the number of RSVPs e.g. other events on that day, the weather.
I’m a novice at this but trying to learn more so if you have any ideas of what I should do next please let me know.
 
 

No comments:
Post a Comment