Effortlessly Extract Data from Websites with Crawly YML

In a previous article, we successfully resolved issues with our Erlang Solutions blog spider using Crawly v0.15.0 and the Management interface. However, we have made another exciting step towards simplifying the scraping. Some spiders are more straightforward than others; we may not need to code in these cases! Sounds interesting? Let’s see how it works!

The workflow

So in our ideal world scenario, it should work in the following way:

Pull Crawly Docker image from DockerHub.
Create a simple configuration file.
Start it!
Create a spider via the YML interface.

The detailed documentation and the example can be found on HexDocs here: https://hexdocs.pm/crawly/spiders_in_yml.html#content

This article will follow all steps from scratch, so it’s self-containing. But if you have any questions, please don’t hesitate to refer to the original docs or to ping us on the Discussions board.

The steps

1. First of all, we will pull Crawly from DockerHub:

docker pull oltarasenko/crawly:0.15.0

2. We should re-use the same configuration as in our previous article as we need to get the same data as in our previous article. So let’s create a file called `crawly.config` with the same content as previously:

[{crawly, [
    {closespider_itemcount, 100},
    {closespider_timeout, 5},
    {concurrent_requests_per_domain, 15},

    {middlewares, [
            'Elixir.Crawly.Middlewares.DomainFilter',
            'Elixir.Crawly.Middlewares.UniqueRequest',
            'Elixir.Crawly.Middlewares.RobotsTxt',
            {'Elixir.Crawly.Middlewares.UserAgent', [
                {user_agents, [
                    <<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
                    <<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
                    ]
                }]
            }
        ]
    },

    {pipelines, [
            {'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
            {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
            {'Elixir.Crawly.Pipelines.JSONEncoder'},
            {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
        ]
    }]
}].
[{crawly, [
    {closespider_itemcount, 100},
    {closespider_timeout, 5},
    {concurrent_requests_per_domain, 15},

    {middlewares, [
            'Elixir.Crawly.Middlewares.DomainFilter',
            'Elixir.Crawly.Middlewares.UniqueRequest',
            'Elixir.Crawly.Middlewares.RobotsTxt',
            {'Elixir.Crawly.Middlewares.UserAgent', [
                {user_agents, [
                    <<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
                    <<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
                    ]
                }]
            }
        ]
    },

    {pipelines, [
            {'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
            {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
            {'Elixir.Crawly.Pipelines.JSONEncoder'},
            {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
        ]
    }]
}].

3. Starting the container shall be done with the help of the following command:

docker run --name yml_spiders_example \
 -it -p 4001:4001 \
 -v $(pwd)/crawly.config:/app/config/crawly.config \
 oltarasenko/crawly:0.15.0

Once done, you will probably see the following debug messages in your console. That is a good sign, as it worked!

09:13:05.047 [info] Opening/checking dynamic spiders storage

09:13:05.049 [debug] Using the following folder to load extra spiders: ./spiders

09:13:05.054 [debug] Starting data storage

Now you can open `localhost:4000` in your browser, and your journey starts here!

Crawly Management Tool/ Management interface

Crawly Management Tool/ Creating management interface

4. Building a spider

Once you click Create New Spider, you will see the following basic page allowing inputting your spider code:

Crawly Management Tool/ new spider interface

Crawly Management Tool/ Creating a new spider interface

One may say that these interfaces are super simple a basic, as for something from 2023. It’s right. We’re backend developers, and we do what we can. So this allows for achieving needed results with minimal frontend efforts. If you have a passion for improving it or want to contribute in any other way — you are more than welcome to do so!

Writing a spider

So the interface above requires you to write a “right” YML. So you need to know what is expected. Let’s start with a basic example, add some explanations, and improve it later.

I suggest starting by inputting the following YML there:

name: ErlangSolutionsBlog
base_url: "https://www.erlang-solutions.com"
start_urls:
 - "https://www.erlang-solutions.com/blog/web-scraping-with-elixir/"
fields:
 - name: title
   selector: "title"
links_to_follow:
 - selector: "a"
   attribute: "href"

Now if you click the Preview button, you shall see what spider is going to extract from your start URLs:

Crawly Management Tool/ Preview spider results even before running anything.

So, what you can see here is:

The spider will extract only one field called title that equals to “Erlang Solutions.” Besides that, your spider is going to follow these links after the start page:

"https://www.erlang-solutions.com/", 
"https://www.erlang-solutions.com#", 
"https://www.erlang-solutions.com/services/consulting/", 
"https://www.erlang-solutions.com/services/development/", 
....

The YML format

name A string representing the name of the scraper.
base_url A string representing the base URL of the website being scraped. The value must be a valid URI.
start_urls An array of strings representing the URLs to start scraping from. Each URL must be a valid URI.
links_to_follow An array of objects representing the links to follow when scraping a page. Each object must have the following properties:
— selector A string representing the CSS selector for the links to follow.
— attribute A string representing the attribute of the link element that contains the URL to follow.
fields: An array of objects representing the fields to scrape from each page. Each object must have the following properties:
— name A string representing the name of the field
— selector A string representing the CSS selector for the field to scrape.

Finishing the spider

As in the original article, we plan to extract the following fields:

title
author
publishing_date
url
article_body

Expected selectors are copied from the previous article and can be found using Google Chrome’s inspect & copy approach!

fields:
 - name: title
   selector: ".page-title-sm"
 - name: article_body
   selector: ".default-content"
 - name: author
   selector: ".post-info__author"
 - name: publishing_date
   selector: ".header-inner .post-info .post-info__item span"

By now, you have noticed that the URL field is not added here. That’s because the URL is automatically added to every item by Crawly.

Now if you hit preview again, you should see the full scraped item:

Crawly Management System/results just now.

Crawly Management System/ Item contains results just now.

Now, if you click save, running the full spider and seeing the actual results will be possible.

Crawly Management Tool/ You can see that the spider is running and already extracting some data.

Conclusion

We hope you like our work! We hope it will help reduce the need to write spider code or maybe help engage non-Elixir people who can play with it without coding!

Please don’t be too critical of the product, as everything in this world has bugs, and we plan to improve it over time (as we already did during the last 4+ years). If you have ideas and improvement suggestions, please drop us a message so we can help!