The R and JavaScript code to reproduce the results in this post is available from https://github.com/nanxstats/r-readability-parser. Photo by Nick Hillier. Readability.js Maybe you have used tools like rvest to harvest text data from web pages. Naturally, this often requires elaborated human efforts in the front to understand the structure of the target website. The picture looks quite different when we think at the web scale. To parse the content of many more sites and many more types of pages, we need to make our tool adaptive enough to extract the most relevant text instead of purely relying …