I published Beyond Spreadsheets with R with Manning Publications partly because they approached me, but a lot because they seem to do a good job of publishing technical books and making them DRM free - if you buy a physical copy you get access to the eBooks to download (pdf, epub, …) and use as you see fit. I’ve received free (well-known) ebooks from other large publishers that I just don’t read because it involves installing and opening up a proprietary piece of software just to read the book, and I can’t integrate it with any of my notetaking tools.

After going through the process, I can say that Manning also do a really good job of getting eyes on drafts and providing technical proofreading. I had a panel of reviewers who gave me great feedback on draft chapters and the book was better for it. I’m learning a lot of languages and deep diving into more structured programming recently, so I find myself buying a lot of their technical books - it’s possible that I’ve given back a lot of what I made on Beyond Spreadsheets with R and I’m not necessarily disappointed by that.

I was recently offered a draft (Manning calls these Manning Early Access Program or MEAP editions) of Data Analysis with AI and R by Ulirich Matter and asked if I could share my thoughts. Word of mouth can do great things, and many eyes are very valuable for early drafts. What follows wasn’t paid for; I was given an ebook copy of the book to review.

I’m apprehensive but optimistic when it comes to what we’re calling AI now (whatever definition we’ve had since the 1960s is now irrelevant) and I do wish we’d come up with a better term that didn’t use the word “intelligence” so early in the game. I’m apprehensive because it’s almost certainly being rushed into places it doesn’t need to be at great cost to users, companies, and the environment, with results that overlook the human aspect of the problem being solved (the human aspect typically being the most complex). I’m optimistic because in the few places where it can be used somewhat “safely” (of low impact if it’s entirely wrong) it’s capable of doing some things that a text-matching search can’t - I treat a chat LLM like an overconfident coworker and ask it to explain how something I don’t understand works. I still need to check to make sure it isn’t wrong, but a web search engine can’t answer “what is this snippet of code doing?”

The first example of Data Analysis with AI and R demonstrates having an LLM try to fix some R code - a simple function that adds two input parameters as a + b that fails when called with a character argument b e.g. b="5". The demonstrated LLM offers a new version of the function that has a + as.numeric(b) in the body with a decent explanation of why it failed

Your code is attempting to add a numeric value to a character string, which is causing the error. In R, you need to convert the string to a numeric type first. Here’s the corrected code:

I think this is a fantastic teachable moment. Keep in mind that we’re using R, a dynamically typed language, so preventing this sort of thing at the type level doesn’t exist. Someone bitten by this enough times will start to do defensive programming and add assertions to the top of their functions to enforce type constraints according to how their function works internally plus any business logic of how it expects to be used. LLMs don’t (typically) have any of that context, and sometimes that’s a human component.

An experienced programmer will recognise the symmetry in the arguments as presented - two values, to be used on either side of a + infix function. The LLM “fixes” the issue as presented, which is great, but if you stop there and move on you’re leaving the potential bug that will bite you when someone tries to use the function with a="5" and b=5. If you’re stuck on something, an LLM can potentially get you unstuck, but it’s not a magic bug eraser. It can generate some code that appears suitable as a response and that code may solve the immediate problem, but it’s a tool to help you improve, not an actual coworker.

With that said, I tried the example verbatim into the latest (free) chatGPT and it identified the benefit of converting both arguments, but said in its explanation

In R, you need to ensure that both arguments are of the same type

which isn’t strictly true…

typeof(a <- 5)
# [1] "double"
typeof(b <- 5L)
# [1] "integer"
typeof(c <- 5+5i)
# [1] "complex"
a+b+c
# [1] 15+5i

Again, ask for advice, but treat it with suspicion.

The rest of the available 3 chapters covers a lot of interesting topics, many of which I wasn’t familiar with such as setting system messages (the internal prompt) via the API, the PAL (Programming-Aided Language Models) approach, and generating embeddings of some input text.

I had a handful of suggestions based on what I’d read which, for a regular, already published book, might get sent to the author to be listed on an errata page somewhere. For a MEAP book, Manning has a discussion forum where readers can ask questions, submit suggestions/corrections, etc… so that the book can be improved before it’s published. I was very happy to have this available for Beyond Spreadsheets with R (forum) so I submitted my notes/questions there and hoped the author might have had a chance to respond before I wrote this, but not yet it seems.

Overall, I’m pleased with what I read so far - there’s a lot in there about how to interact with the models via the API in an R way, including having it help write tests, documentation, and entire functions. I’m looking forward to the rest of the book.

I think it’s brave to try to write a book about a field that’s changing quite so fast, but most of the principles should still apply for at least a while, even if the screenshots of website menus are quickly outdated.

So far, this one has my recommendation. I don’t think I’ve seen any R-related resources that have anywhere near this level of detail. If you’re interested in a copy, I believe Manning has a sale at the moment and all products sitewide are 50% off (including Beyond Spreadsheets with R, FYI).