In this post, I will explore how we can use resources available in the R package ecosystem to keep up with state-of-the-art machine learning research.
Motivation
Recently, I have been reading books about creativity. The recurring concept in the books is that the key place to start when mastering a skill is by reproducing other people’s proven ideas and figuring out their inner workings - a form of apprenticeship.
The 19th Century French writer, Emile Zola, captured the essence of this idea in his description of art as “a corner of creation seen through a temperament”.
Will Gompertz neatly paraphrases this in Think Like an Artist as:
“Creativity is the presentation of pre-existing elements and ideas filtered through the perceptions and feelings of an individual.”
The idea permeates all creative endeavours. Would-be singer-songwriters, scientists, painters, and so on, start off by attempting to replicate some established piece of work. Andrew Ng captures this same idea in this response to a question on Quora I stumbled upon while writing up this post:
“To go even further, read research papers (follow ML leaders on twitter to see what papers they’re excited about). Even better, try to replicate the results in the research papers. Trying to replicate others’ results is one of the most effective but under-appreciated ways to get good at AI”
The problem
The challenge with applying this concept to machine learning practice is that, in many respects, the field is going through exciting times. We are witnessing the relentless creation of new algorithms, drastic improvements in existing algorithms, an explosion of data, significant strides in computer hardware technology, and unprecedented collaboration among researchers from different fields.
Although the internet has made it much easier to source materials and information, the sheer volume of research literature emerging from all the moving parts of machine learning can be overwhelming.
How do you discover new research literature? How do you find the most relevant research for your work? In the following sections, I will attempt to answer these questions. The aim is to use a few examples to shed a little light on how to stay plugged in to the research community without leaving the R software environment1.
Academic databases and search engines
There is a comprehensive list of academic databases and search engines on this Wikipedia page. I will look at one example of the most popular databases in this category.
arXiv
arXiv is a repository of scientific papers, created in the spirit of the open access movement, where mathematicians and scientists usually upload their papers for worldwide access and sometimes for reviews before they are published in peer-reviewed journals.
I will use aRxiv
, an R interface to the arXiv API, and a few tidyverse
tools to explore arXiv papers.
library(aRxiv)
library(tidyverse)
library(kableExtra)
pretty_print <- function(df){
result = df %>%
kable() %>%
kable_styling(font_size = 14) %>%
row_spec(0, bold = T)
return(result)
}
To start off, we need to construct a search query. To do this, we need terms to use as query arguments. The following options are available:
query_terms %>% pretty_print
term | description |
---|---|
ti | Title |
au | Author |
abs | Abstract |
co | Comment |
jr | Journal Reference |
cat | Subject Category |
rn | Report Number |
all | All of the above |
submittedDate | Date/time of initial submission, as YYYYMMDDHHMM |
lastUpdatedDate | Date/time of last update, as YYYYMMDDHHMM |
Machine learning and related research papers are typically submitted under the following categories:
ai_ml_categories <- c("Artificial", "Intelligence", "Learning",
"Robotics", "Vision") %>%
str_flatten("|")
arxiv_cats %>%
dplyr::filter(str_detect(description, ai_ml_categories)) %>%
pretty_print
abbreviation | description |
---|---|
stat.ML | Statistics - Machine Learning |
cs.AI | Computer Science - Artificial Intelligence |
cs.CV | Computer Science - Computer Vision and Pattern Recognition |
cs.LG | Computer Science - Learning |
cs.RO | Computer Science - Robotics |
We can search for papers under specific categories. For example, we can inspect papers submitted under robotics in the first three months of 2018 using:
c("cat:cs.RO",
"submittedDate:[201801010000 TO 201802302400]") %>%
str_flatten(" AND ") %>%
arxiv_search(limit = 12, sort_by = "updated", ascending = FALSE) %>%
select(submitted, id, title) %>%
head(5) %>%
pretty_print()
submitted | id | title |
---|---|---|
2018-02-16 18:57:57 | 1802.06070v5 | Diversity is All You Need: Learning Skills without a Reward Function |
2018-02-12 17:47:13 | 1802.04205v3 | Efficient Hierarchical Robot Motion Planning Under Uncertainty and Hybrid Dynamics |
2018-02-25 04:47:31 | 1802.08953v2 | Robust Target-relative Localization with Ultra-Wideband Ranging and Communication |
2018-01-06 14:50:07 | 1801.02025v2 | Robot Localisation and 3D Position Estimation Using a Free-Moving Camera and Cascaded Convolutional Neural Networks |
2018-01-01 23:41:50 | 1801.00527v3 | Freeform Assembly Planning |
We can search for papers by topic. For example, the following are a few of the most recent papers on adversarial AI submitted from the beginning of the year to 7th June 2018.
c("ti:Adversarial",
"submittedDate:[201801010000 TO 201806072400]") %>%
str_flatten(" AND ") %>%
arxiv_search(limit = 12, sort_by = "updated", ascending = FALSE) %>%
select(submitted, id, title) %>%
head(5) %>%
pretty_print()
submitted | id | title |
---|---|---|
2018-02-15 17:13:18 | 1802.05666v2 | Adversarial Risk and the Dangers of Evaluating Against Weak Attacks |
2018-05-21 10:58:10 | 1805.07984v3 | Adversarial Attacks on Neural Networks for Graph Data |
2018-06-05 17:04:37 | 1806.02299v2 | DPatch: Attacking Object Detectors with Adversarial Patches |
2018-06-07 23:27:16 | 1806.02924v2 | On Adversarial Risk and Training |
2018-05-30 00:05:53 | 1805.11752v2 | Multi-turn Dialogue Response Generation in an Adversarial Learning Framework |
We can build complex queries based on specific needs by chaining multiple queries together with AND
, OR
, and ANDNOT
. We can also search Titles, Authors, Abstract and so on.
GitHub
Most new research in the machine learning community comprise open-source projects hosted on GitHub. This is great news in the sense that you can easily obtain a copy of the source code of a project of interest and tinker with it to your heart’s content.
In this section, I will explore research projects hosted on GitHub. I will use the httr
package to access the GitHub search API, and then use a few tidyverse
packages to extract the necessary data points.
Suppose we want to find the most starred repositories about generative adversarial networks. We can start by creating a URL with a search query and pass that to httr
.
library(httr)
library(magrittr)
url <- glue::glue("https://api.github.com/search/repositories?",
"q=generative+adversarial+networkin:name,description&",
"sort=stars&",
"order=desc")
gh_repos <- GET(url) %>%
content(encoding = "UTF-8")
Let’s inspect all the metadata available on each repository:
gh_repos %>%
use_series("items") %>%
extract2(1) %>%
names()
## [1] "id" "node_id" "name"
## [4] "full_name" "owner" "private"
## [7] "html_url" "description" "fork"
## [10] "url" "forks_url" "keys_url"
## [13] "collaborators_url" "teams_url" "hooks_url"
## [16] "issue_events_url" "events_url" "assignees_url"
## [19] "branches_url" "tags_url" "blobs_url"
## [22] "git_tags_url" "git_refs_url" "trees_url"
## [25] "statuses_url" "languages_url" "stargazers_url"
## [28] "contributors_url" "subscribers_url" "subscription_url"
## [31] "commits_url" "git_commits_url" "comments_url"
## [34] "issue_comment_url" "contents_url" "compare_url"
## [37] "merges_url" "archive_url" "downloads_url"
## [40] "issues_url" "pulls_url" "milestones_url"
## [43] "notifications_url" "labels_url" "releases_url"
## [46] "deployments_url" "created_at" "updated_at"
## [49] "pushed_at" "git_url" "ssh_url"
## [52] "clone_url" "svn_url" "homepage"
## [55] "size" "stargazers_count" "watchers_count"
## [58] "language" "has_issues" "has_projects"
## [61] "has_downloads" "has_wiki" "has_pages"
## [64] "forks_count" "mirror_url" "archived"
## [67] "open_issues_count" "license" "forks"
## [70] "open_issues" "watchers" "default_branch"
## [73] "score"
What we are looking for in our example is the GitHub URL and the number of stars for each repository (stargazers_count).
Let’s look at the top 5 most starred GitHub repositories in this category.
extract_info <- function(index){
repository <- gh_repos %>%
use_series("items") %>%
extract2(index) %>%
extract2("html_url")
stars_count <- gh_repos %>%
use_series("items") %>%
extract2(index) %>%
extract2("stargazers_count")
return(data.frame(repository = repository,
stars_count = stars_count))
}
1:5 %>% map_dfr(extract_info) %>%
pretty_print()
repository | stars_count |
---|---|
https://github.com/carpedm20/DCGAN-tensorflow | 4271 |
https://github.com/junyanz/iGAN | 2785 |
https://github.com/Newmu/dcgan_code | 2762 |
https://github.com/eriklindernoren/Keras-GAN | 1895 |
https://github.com/nightrome/really-awesome-gan | 1715 |
Journals & RSS feeds
We can explore papers being published in scientific journals by extracting their RSS feeds. I will use tidyRSS
, an R package for extracting tidy dataframes from RSS, Atom and JSON feeds. For feeds that do not play nice with tidyRSS
, I will use the feedeR
package.
Here are two examples:
We can search new research papers published in Science magazine to see if anything has been published on a specific topic of interest.
library(tidyRSS)
science_magazine <- tidyfeed("http://science.sciencemag.org/rss/twis.xml")
names(science_magazine)
## [1] "feed_title" "feed_link" "feed_description"
## [4] "item_title" "item_link" "item_description"
item_title
contains a brief description of each paper. We can search this for specific keywords.
science_magazine %>%
filter(str_detect(url, "some_regex"))
We can search publications in the Journal of Machine Learning Research using:
library(feedeR)
jmlr <- feed.extract("http://jmlr.org/jmlr.xml") %>%
use_series("items")
names(jmlr)
## [1] "title" "date" "link" "hash"
Again we can search title
for specific keywords.
Wrapping up
Using logic similar to what we’ve stepped through above, we could set up an automated process to regularly scrape these portals for all the relevant information and upload the results in a webpage or set up an email notification whenever something which might be of interest gets published.
In a future post, I will explore social media and other resources.
You can achieve this same end through a combination of various mobile apps, email subscriptions, keeping an eye on machine learning tags on social media and so on, but where is the fun in that?↩