Drinking from the Firehose: using R to keep up with current ML Research - part 1
Jun 12, 2018
Ernest Omane-Kodie
7 minute read

via GIPHY

In this post, I will explore how we can use resources available in the R package ecosystem to keep up with state-of-the-art machine learning research.

Motivation

Recently, I have been reading books about creativity. The recurring concept in the books is that the key place to start when mastering a skill is by reproducing other people’s proven ideas and figuring out their inner workings - a form of apprenticeship.

The 19th Century French writer, Emile Zola, captured the essence of this idea in his description of art as “a corner of creation seen through a temperament”.

Will Gompertz neatly paraphrases this in Think Like an Artist as:

“Creativity is the presentation of pre-existing elements and ideas filtered through the perceptions and feelings of an individual.”

The idea permeates all creative endeavours. Would-be singer-songwriters, scientists, painters, and so on, start off by attempting to replicate some established piece of work. Andrew Ng captures this same idea in this response to a question on Quora I stumbled upon while writing up this post:

“To go even further, read research papers (follow ML leaders on twitter to see what papers they’re excited about). Even better, try to replicate the results in the research papers. Trying to replicate others’ results is one of the most effective but under-appreciated ways to get good at AI”

The problem

The challenge with applying this concept to machine learning practice is that, in many respects, the field is going through exciting times. We are witnessing the relentless creation of new algorithms, drastic improvements in existing algorithms, an explosion of data, significant strides in computer hardware technology, and unprecedented collaboration among researchers from different fields.

Although the internet has made it much easier to source materials and information, the sheer volume of research literature emerging from all the moving parts of machine learning can be overwhelming.

How do you discover new research literature? How do you find the most relevant research for your work? In the following sections, I will attempt to answer these questions. The aim is to use a few examples to shed a little light on how to stay plugged in to the research community without leaving the R software environment1.

Academic databases and search engines

There is a comprehensive list of academic databases and search engines on this Wikipedia page. I will look at one example of the most popular databases in this category.

arXiv

arXiv is a repository of scientific papers, created in the spirit of the open access movement, where mathematicians and scientists usually upload their papers for worldwide access and sometimes for reviews before they are published in peer-reviewed journals.

I will use aRxiv, an R interface to the arXiv API, and a few tidyverse tools to explore arXiv papers.

library(aRxiv)
library(tidyverse)
library(kableExtra)

pretty_print <- function(df){
  result = df %>% 
    kable() %>% 
    kable_styling(font_size = 14) %>% 
    row_spec(0, bold = T)
  return(result)
}

To start off, we need to construct a search query. To do this, we need terms to use as query arguments. The following options are available:

query_terms %>% pretty_print
term description
ti Title
au Author
abs Abstract
co Comment
jr Journal Reference
cat Subject Category
rn Report Number
all All of the above
submittedDate Date/time of initial submission, as YYYYMMDDHHMM
lastUpdatedDate Date/time of last update, as YYYYMMDDHHMM

Machine learning and related research papers are typically submitted under the following categories:

ai_ml_categories <- c("Artificial", "Intelligence", "Learning", 
                   "Robotics", "Vision") %>% 
  str_flatten("|")

arxiv_cats %>% 
  dplyr::filter(str_detect(description, ai_ml_categories)) %>% 
  pretty_print
abbreviation description
stat.ML Statistics - Machine Learning
cs.AI Computer Science - Artificial Intelligence
cs.CV Computer Science - Computer Vision and Pattern Recognition
cs.LG Computer Science - Learning
cs.RO Computer Science - Robotics

We can search for papers under specific categories. For example, we can inspect papers submitted under robotics in the first three months of 2018 using:

c("cat:cs.RO", 
  "submittedDate:[201801010000 TO 201802302400]") %>% 
  str_flatten(" AND ") %>% 
  arxiv_search(limit = 12, sort_by = "updated", ascending = FALSE) %>% 
  select(submitted, id, title) %>% 
  head(5) %>% 
  pretty_print()
submitted id title
2018-02-16 18:57:57 1802.06070v5 Diversity is All You Need: Learning Skills without a Reward Function
2018-02-12 17:47:13 1802.04205v3 Efficient Hierarchical Robot Motion Planning Under Uncertainty and Hybrid Dynamics
2018-02-25 04:47:31 1802.08953v2 Robust Target-relative Localization with Ultra-Wideband Ranging and Communication
2018-01-06 14:50:07 1801.02025v2 Robot Localisation and 3D Position Estimation Using a Free-Moving Camera and Cascaded Convolutional Neural Networks
2018-01-01 23:41:50 1801.00527v3 Freeform Assembly Planning

We can search for papers by topic. For example, the following are a few of the most recent papers on adversarial AI submitted from the beginning of the year to 7th June 2018.

c("ti:Adversarial", 
  "submittedDate:[201801010000 TO 201806072400]") %>% 
  str_flatten(" AND ") %>% 
  arxiv_search(limit = 12, sort_by = "updated", ascending = FALSE) %>% 
  select(submitted, id, title) %>% 
  head(5) %>% 
  pretty_print()  
submitted id title
2018-02-15 17:13:18 1802.05666v2 Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
2018-05-21 10:58:10 1805.07984v3 Adversarial Attacks on Neural Networks for Graph Data
2018-06-05 17:04:37 1806.02299v2 DPatch: Attacking Object Detectors with Adversarial Patches
2018-06-07 23:27:16 1806.02924v2 On Adversarial Risk and Training
2018-05-30 00:05:53 1805.11752v2 Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

We can build complex queries based on specific needs by chaining multiple queries together with AND, OR, and ANDNOT. We can also search Titles, Authors, Abstract and so on.

GitHub

Most new research in the machine learning community comprise open-source projects hosted on GitHub. This is great news in the sense that you can easily obtain a copy of the source code of a project of interest and tinker with it to your heart’s content.

In this section, I will explore research projects hosted on GitHub. I will use the httr package to access the GitHub search API, and then use a few tidyverse packages to extract the necessary data points.

Suppose we want to find the most starred repositories about generative adversarial networks. We can start by creating a URL with a search query and pass that to httr.

library(httr)
library(magrittr)

url <- glue::glue("https://api.github.com/search/repositories?",
                 "q=generative+adversarial+networkin:name,description&",
                 "sort=stars&", 
                 "order=desc")

gh_repos <- GET(url) %>% 
  content(encoding = "UTF-8")

Let’s inspect all the metadata available on each repository:

gh_repos %>%
  use_series("items") %>% 
  extract2(1) %>% 
  names()
##  [1] "id"                "node_id"           "name"             
##  [4] "full_name"         "owner"             "private"          
##  [7] "html_url"          "description"       "fork"             
## [10] "url"               "forks_url"         "keys_url"         
## [13] "collaborators_url" "teams_url"         "hooks_url"        
## [16] "issue_events_url"  "events_url"        "assignees_url"    
## [19] "branches_url"      "tags_url"          "blobs_url"        
## [22] "git_tags_url"      "git_refs_url"      "trees_url"        
## [25] "statuses_url"      "languages_url"     "stargazers_url"   
## [28] "contributors_url"  "subscribers_url"   "subscription_url" 
## [31] "commits_url"       "git_commits_url"   "comments_url"     
## [34] "issue_comment_url" "contents_url"      "compare_url"      
## [37] "merges_url"        "archive_url"       "downloads_url"    
## [40] "issues_url"        "pulls_url"         "milestones_url"   
## [43] "notifications_url" "labels_url"        "releases_url"     
## [46] "deployments_url"   "created_at"        "updated_at"       
## [49] "pushed_at"         "git_url"           "ssh_url"          
## [52] "clone_url"         "svn_url"           "homepage"         
## [55] "size"              "stargazers_count"  "watchers_count"   
## [58] "language"          "has_issues"        "has_projects"     
## [61] "has_downloads"     "has_wiki"          "has_pages"        
## [64] "forks_count"       "mirror_url"        "archived"         
## [67] "open_issues_count" "license"           "forks"            
## [70] "open_issues"       "watchers"          "default_branch"   
## [73] "score"

What we are looking for in our example is the GitHub URL and the number of stars for each repository (stargazers_count).

Let’s look at the top 5 most starred GitHub repositories in this category.

extract_info <- function(index){
  repository <- gh_repos %>% 
    use_series("items") %>% 
    extract2(index) %>% 
    extract2("html_url")
  
  stars_count <- gh_repos %>% 
    use_series("items") %>% 
    extract2(index) %>% 
    extract2("stargazers_count")
  
  return(data.frame(repository = repository, 
                    stars_count = stars_count))
}

1:5 %>% map_dfr(extract_info) %>% 
  pretty_print()
repository stars_count
https://github.com/carpedm20/DCGAN-tensorflow 4271
https://github.com/junyanz/iGAN 2785
https://github.com/Newmu/dcgan_code 2762
https://github.com/eriklindernoren/Keras-GAN 1895
https://github.com/nightrome/really-awesome-gan 1715

Journals & RSS feeds

We can explore papers being published in scientific journals by extracting their RSS feeds. I will use tidyRSS, an R package for extracting tidy dataframes from RSS, Atom and JSON feeds. For feeds that do not play nice with tidyRSS, I will use the feedeR package.

Here are two examples:

We can search new research papers published in Science magazine to see if anything has been published on a specific topic of interest.

library(tidyRSS)
science_magazine <- tidyfeed("http://science.sciencemag.org/rss/twis.xml")
names(science_magazine)
## [1] "feed_title"       "feed_link"        "feed_description"
## [4] "item_title"       "item_link"        "item_description"

item_title contains a brief description of each paper. We can search this for specific keywords.

science_magazine %>% 
  filter(str_detect(url, "some_regex")) 

We can search publications in the Journal of Machine Learning Research using:

library(feedeR)
jmlr <- feed.extract("http://jmlr.org/jmlr.xml") %>% 
  use_series("items")
names(jmlr)
## [1] "title" "date"  "link"  "hash"

Again we can search title for specific keywords.

Wrapping up

Using logic similar to what we’ve stepped through above, we could set up an automated process to regularly scrape these portals for all the relevant information and upload the results in a webpage or set up an email notification whenever something which might be of interest gets published.

In a future post, I will explore social media and other resources.


  1. You can achieve this same end through a combination of various mobile apps, email subscriptions, keeping an eye on machine learning tags on social media and so on, but where is the fun in that?