Our simple task will be to extract the list of links on the CL Cookbookâs index page and check if they are reachable.
Weâll use the following libraries:
- Dexador - an HTTP client (that aims at replacing the venerable Drakma),
- Plump - a markup parser, that works on malformed HTML,
- Lquery - a DOM manipulation library, to extract content from our Plump result,
- lparallel - a library for parallel programming (read more in the process section).
Before starting letâs install those libraries with Quicklisp:
(ql:quickload '("dexador" "plump" "lquery" "lparallel"))
HTTP Requests
Easy things first. Install Dexador. Then we use the get
function:
(defvar *url* "https://lispcookbook.github.io/cl-cookbook/")
(defvar *request* (dex:get *url*))
This returns a list of values: the whole page content, the return code (200), the response headers, the uri and the stream.
"<!DOCTYPE html>
<html lang=\"en\">
<head>
<title>Home – the Common Lisp Cookbook</title>
[âŚ]
"
200
#<HASH-TABLE :TEST EQUAL :COUNT 19 {1008BF3043}>
#<QURI.URI.HTTP:URI-HTTPS https://lispcookbook.github.io/cl-cookbook/>
#<CL+SSL::SSL-STREAM for #<FD-STREAM for "socket 192.168.0.23:34897, peer: 151.101.120.133:443" {100781C133}>>
Remember, in Slime we can inspect the objects with a right-click on them.
Parsing and extracting content with CSS selectors
Weâll use lquery
to parse the html and extract the
content.
We first need to parse the html into an internal data structure. Use
(lquery:$ (initialize <html>))
:
(defvar *parsed-content* (lquery:$ (initialize *request*)))
;; => #<PLUMP-DOM:ROOT {1009EE5FE3}>
lquery uses Plump internally.
Now weâll extract the links with CSSÂ selectors.
Note: to find out what should be the CSS selector of the element Iâm interested in, I right click on an element in the browser and I choose âInspect elementâ. This opens up the inspector of my browserâs web dev tool and I can study the page structure.
So the links I want to extract are in a page with an id
of value
âcontentâ, and they are in regular list elements (li
).
Letâs try something:
(lquery:$ *parsed-content* "#content li")
;; => #(#<PLUMP-DOM:ELEMENT li {100B3263A3}> #<PLUMP-DOM:ELEMENT li {100B3263E3}>
;; #<PLUMP-DOM:ELEMENT li {100B326423}> #<PLUMP-DOM:ELEMENT li {100B326463}>
;; #<PLUMP-DOM:ELEMENT li {100B3264A3}> #<PLUMP-DOM:ELEMENT li {100B3264E3}>
;; #<PLUMP-DOM:ELEMENT li {100B326523}> #<PLUMP-DOM:ELEMENT li {100B326563}>
;; #<PLUMP-DOM:ELEMENT li {100B3265A3}> #<PLUMP-DOM:ELEMENT li {100B3265E3}>
;; #<PLUMP-DOM:ELEMENT li {100B326623}> #<PLUMP-DOM:ELEMENT li {100B326663}>
;; [âŚ]
Wow it works ! We get here a vector of plump elements.
Iâd like to easily check what those elements are. To see the entire
html, we can end our lquery line with (serialize)
:
(lquery:$ *parsed-content* "#content li" (serialize))
#("<li><a href=\"license.html\">License</a></li>"
"<li><a href=\"getting-started.html\">Getting started</a></li>"
"<li><a href=\"editor-support.html\">Editor support</a></li>"
[âŚ]
And to see their textual content (the user-visible text inside the
html), we can use (text)
instead:
(lquery:$ *parsed-content* "#content" (text))
#("License" "Editor support" "Strings" "Dates and Times" "Hash Tables"
"Pattern Matching / Regular Expressions" "Functions" "Loop" "Input/Output"
"Files and Directories" "Packages" "Macros and Backquote"
"CLOS (the Common Lisp Object System)" "Sockets" "Interfacing with your OS"
"Foreign Function Interfaces" "Threads" "Defining Systems"
[âŚ]
"Pascal Costanzaâs Highly Opinionated Guide to Lisp"
"Loving Lisp - the Savy Programmerâs Secret Weapon by Mark Watson"
"FranzInc, a company selling Common Lisp and Graph Database solutions.")
All right, so we see we are manipulating what we want. Now to get their
href
, a quick look at lqueryâs doc and weâll use (attr
"some-name")
:
(lquery:$ *parsed-content* "#content li a" (attr :href))
;; => #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
;; "hashes.html" "pattern_matching.html" "functions.html" "loop.html" "io.html"
;; "files.html" "packages.html" "macros.html"
;; "/cl-cookbook/clos-tutorial/index.html" "os.html" "ffi.html"
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
;; [âŚ]
;; "http://www.nicklevine.org/declarative/lectures/"
;; "http://www.p-cos.net/lisp/guide.html" "https://leanpub.com/lovinglisp/"
;; "https://franz.com/")
Note: using (serialize)
after attr
leads to an error.
Nice, we now have the list (well, a vector) of links of the page. Weâll now write an async program to check and validate they are reachable.
External resources:
Async requests
In this example weâll take the list of url from above and weâll check if they are reachable. We want to do this asynchronously, but to see the benefits weâll first do it synchronously !
We need a bit of filtering first to exclude the email addresses (maybe that was doable in the CSS selector ?).
We put the vector of urls in a variable:
(defvar *urls* (lquery:$ *parsed-content* "#content li a" (attr :href)))
We remove the elements that start with âmailto:â: (a quick look at the strings page will help)
(remove-if (lambda (it)
(string= it "mailto:" :start1 0
:end1 (length "mailto:")))
*urls*)
;; => #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
;; [âŚ]
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
;; "license.html" "http://lisp-lang.org/"
;; "https://github.com/CodyReichert/awesome-cl"
;; "http://www.lispworks.com/documentation/HyperSpec/Front/index.htm"
;; [âŚ]
;; "https://franz.com/")
Actually before writing the remove-if
(which works on any sequence,
including vectors) I tested with a (map 'vector âŚ)
to see that the
results where indeed nil
or t
.
As a side note, there is a handy starts-with-p
function in the âstrâ library
available in Quicklisp. So we could do:
(map 'vector (lambda (it)
(str:starts-with-p "mailto:" it))
*urls*)
While weâre at it, weâll only consider links starting with âhttpâ, in order not to write too much stuff irrelevant to web scraping:
(remove-if-not (lambda (it)
(string= it "http" :start1 0 :end1 (length "http")))
*)
All right, we put this result in another variable:
(defvar *filtered-urls* *)
and now to the real work. For every url, we want to request it and check that its return code is 200. We have to ignore certain errors. Indeed, a request can timeout, be redirected (we donât want that) or return an error code.
To be in real conditions weâll add a link that times out in our list:
(setf (aref *filtered-urls* 0) "http://lisp.org") ;; :/
Weâll take the simple approach to ignore errors and return nil
in
that case. If all goes well, we return the return code, that should be
200.
As we saw at the beginning, dex:get
returns many values, including
the return code. Weâll access only this one with nth-value
(instead
of all of them with multiple-value-bind
) and weâll use
ignore-errors
, that returns nil in case of an error. We could also
use handler-case
and handle specific error types (see examples in
dexadorâs documentation).
(ignore-errors has the caveat that when thereâs an error, we can not return the element it comes from. Weâll get to our ends though.)
(map 'vector (lambda (it)
(ignore-errors
(nth-value 1 (dex:get it))))
*filtered-urls*)
we get:
#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
200 200 200 200)
it works, but it took a very long time. How much time precisely ?
with (time âŚ)
:
Evaluation took:
21.554 seconds of real time
0.188000 seconds of total run time (0.172000 user, 0.016000 system)
0.87% CPU
55,912,081,589 processor cycles
9,279,664 bytes consed
21 seconds ! Obviously this synchronous method isnât efficient. We wait 10 seconds for links that time out. Itâs time to write and measure an async version.
After installing lparallel
and looking at
its documentation, we see that the parallel
map pmap seems to be what we
want. And itâs only a one word edit. Letâs try:
(time (lparallel:pmap 'vector
(lambda (it)
(ignore-errors
(let ((status (nth-value 1 (dex:get it)))) status)))
*filtered-urls*)
;; Evaluation took:
;; 11.584 seconds of real time
;; 0.156000 seconds of total run time (0.136000 user, 0.020000 system)
;; 1.35% CPU
;; 30,050,475,879 processor cycles
;; 7,241,616 bytes consed
;;
;;#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
;; 200 200 200 200)
Bingo. It still takes more than 10 seconds because we wait 10 seconds for one request that times out. But otherwise it proceeds all the http requests in parallel and so it is much faster.
Shall we get the urls that arenât reachable, remove them from our list and measure the execution time in the sync and async cases ?
What we do is: instead of returning only the return code, we check it is valid and we return the url:
... (if (and status (= 200 status)) it) ...
(defvar *valid-urls* *)
we get a vector of urls with a couple of nil
s: indeed, I thought I
would have only one unreachable url but I discovered another
one. Hopefully I have pushed a fix before you try this tutorial.
But what are they ? We saw the status codes but not the urls :S We have a vector with all the urls and another with the valid ones. Weâll simply treat them as sets and compute their difference. This will show us the bad ones. We must transform our vectors to lists for that.
(set-difference (coerce *filtered-urls* 'list)
(coerce *valid-urls* 'list))
;; => ("http://lisp-lang.org/" "http://www.psg.com/~dlamkins/sl/cover.html")
Gotcha !
BTW it takes 8.280 seconds of real time to me to check the list of valid urls synchronously, and 2.857 seconds async.
Have fun doing web scraping in CL !
More helpful libraries:
- we could use VCR, a store and replay utility to set up repeatable tests or to speed up a bit our experiments in the REPL.
- cl-async, carrier and others network, parallelism and concurrency libraries to see on the awesome-cl list, Cliki or Quickdocs.
Page source: web-scraping.md