Books and APIs - Adamistics

It all started with my sister. She’s a science writer. And, as you might imagine, she and many of her friends are avid readers. So her recent request for book recommendations on Facebook quickly grew to an unmanageably long list.

Books, people. I need book recommendations. Please reply with one book you that’s stuck with you long after you closed it for the last time.

Wouldn’t it be handy if someone pulled all these recommendations together? Who would be crazy enough to do that? That’s where I come in. I copied the titles and authors into a spreadsheet, and shared it on GitHub.

library(knitr)
dat <- read.csv(url("https://github.com/JVAdams/vishy/raw/gh-pages/JillsFBBookList.csv"))
kable(head(dat[, c("Title", "Author")]))

Title	Author
A Little Life	Hanya Yanagihara
A House in the Sky	Amanda Lindhout
The Master and Margarita	Mikhail Bulgakov
The Shadow of the Wind	Carlos Ruiz Zafón
The Blind Assassin	Margaret Atwood
A Man Called Ove	Frederic Backman

But, the list wasn’t very informative. What year were the books published? Were they fiction or nonfiction?

I went looking for a data base of books that was accessible with an API (application programming interface) and had an R package associated with it. I found rgoodreads, an R package for the Goodreads API.

I had to sign up to get an access key. Then, it was easy to get information on the book by supplying the title, using the function book_by_title().

Sys.setenv(GOODREADS_KEY = "YOUR_KEY_HERE")

library(tidyverse)
library(rgoodreads)

books <- lapply(dat$Title, book_by_title)
books2 <- bind_rows(books) %>%
  rename(
    year=publication_year,
    rating=average_rating
  ) %>%
  mutate(
    author=paste0(sapply(strsplit(sapply(authors, "[", 1), ":"), "[", 1),
      if_else(sapply(authors, length)>1, " +", ""))
    ) %>%
  select(title, author, year, rating, isbn)

The Goodreads API provided information on the title, authors, date of publication, reviewer rating, and the ISBN (International Standard Book Number). But, much to my suprise, there was no indication of book genre, so I still couldn’t tell if it was fiction or nonfiction.

kable(head(books2))

title	author	year	rating	isbn
A Little Life	Hanya Yanagihara	2015	4.27	0385539258
A House in the Sky	Amanda Lindhout +	2013	4.20	1451651694
The Master and Margarita	Mikhail Bulgakov +	1996	4.32	0679760806
The Shadow of the Wind (The Cemetery of Forgotten Books, #1)	Carlos Ruiz Zafón +		4.24	0143034901
The Blind Assassin	Margaret Atwood	2001	3.94	1860498809
A Man Called Ove	Fredrik Backman +	2014	4.34	1476738017

Back to the drawing board. I found another API from WorldCat, the world’s largest network of library content and services. WorldCat has several APIs, the one that seemed to meet my needs was Classify. This API could take an ISBN and spit out a data summary that includes the classification from the Dewey Decimal System. I could use the DDS numbers to assign the books to one of 10 classes, which I found on Wikipedia.

# Dewey Decimal System classes from 0 to 900
ddc09 <- c("General works, Computer science and Information", 
  "Philosophy and psychology", "Religion", "Social sciences", "Language", 
  "Pure Science", "Technology", "Arts & recreation", "Literature", 
  "History & geography")

WorldCat did not have DDS numbers for all the book titles in my list, but for those that it did have, it was very helpful.

library(httr)
library(XML)
ddc <- sapply(books2$isbn, function(isbn) {
  query=paste0("http://classify.oclc.org/classify2/Classify?isbn=", 
      isbn, "&summary=true")
  qraw <- GET(query)
  qlist <- xmlToList(rawToChar(qraw$content))
  if("recommendations" %in% names(qlist)) {
    out <- as.numeric(qlist$recommendations$ddc$mostRecent["sfa"])
  } else {
    out <- NA
  }
  out
})

books3 <- books2 %>%
  mutate(
    ddc = unname(ddc),
    class = if_else(is.na(ddc), "Unknown", ddc09[floor(ddc/100) + 1])
  ) %>%
  arrange(class, desc(rating)) %>%
  select(title, author, year, rating, class)

library(DT)
datatable(books3)