Effortlessly Extract Data from Websites with Crawly YML
- Oleg Tarasenko
- 14th Jul 2023
- 10 min of reading time
In a previous article, we successfully resolved issues with our Erlang Solutions blog spider using Crawly v0.15.0 and the Management interface. However, we have made another exciting step towards simplifying the scraping. Some spiders are more straightforward than others; we may not need to code in these cases! Sounds interesting? Let’s see how it works!
So in our ideal world scenario, it should work in the following way:
The detailed documentation and the example can be found on HexDocs here: https://hexdocs.pm/crawly/spiders_in_yml.html#content
This article will follow all steps from scratch, so it’s self-containing. But if you have any questions, please don’t hesitate to refer to the original docs or to ping us on the Discussions board.
1. First of all, we will pull Crawly from DockerHub:
docker pull oltarasenko/crawly:0.15.0
2. We should re-use the same configuration as in our previous article as we need to get the same data as in our previous article. So let’s create a file called `crawly.config` with the same content as previously:
[{crawly, [
{closespider_itemcount, 100},
{closespider_timeout, 5},
{concurrent_requests_per_domain, 15},
{middlewares, [
'Elixir.Crawly.Middlewares.DomainFilter',
'Elixir.Crawly.Middlewares.UniqueRequest',
'Elixir.Crawly.Middlewares.RobotsTxt',
{'Elixir.Crawly.Middlewares.UserAgent', [
{user_agents, [
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
<<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
]
}]
}
]
},
{pipelines, [
{'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
{'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
{'Elixir.Crawly.Pipelines.JSONEncoder'},
{'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
]
}]
}].
[{crawly, [
{closespider_itemcount, 100},
{closespider_timeout, 5},
{concurrent_requests_per_domain, 15},
{middlewares, [
'Elixir.Crawly.Middlewares.DomainFilter',
'Elixir.Crawly.Middlewares.UniqueRequest',
'Elixir.Crawly.Middlewares.RobotsTxt',
{'Elixir.Crawly.Middlewares.UserAgent', [
{user_agents, [
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
<<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
]
}]
}
]
},
{pipelines, [
{'Elixir.Crawly.Pipelines.Validate', [{fields, [<<"title">>, <<"author">>, <<"publishing_date">>, <<"url">>, <<"article_body">>]}]},
{'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, <<"title">>}]},
{'Elixir.Crawly.Pipelines.JSONEncoder'},
{'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
]
}]
}].
3. Starting the container shall be done with the help of the following command:
docker run --name yml_spiders_example \
-it -p 4001:4001 \
-v $(pwd)/crawly.config:/app/config/crawly.config \
oltarasenko/crawly:0.15.0
Once done, you will probably see the following debug messages in your console. That is a good sign, as it worked!
09:13:05.047 [info] Opening/checking dynamic spiders storage
09:13:05.049 [debug] Using the following folder to load extra spiders: ./spiders
09:13:05.054 [debug] Starting data storage
Now you can open `localhost:4000` in your browser, and your journey starts here!
Crawly Management Tool/ Creating management interface
4. Building a spider
Once you click Create New Spider, you will see the following basic page allowing inputting your spider code:
Crawly Management Tool/ Creating a new spider interface
One may say that these interfaces are super simple a basic, as for something from 2023. It’s right. We’re backend developers, and we do what we can. So this allows for achieving needed results with minimal frontend efforts. If you have a passion for improving it or want to contribute in any other way — you are more than welcome to do so!
So the interface above requires you to write a “right” YML. So you need to know what is expected. Let’s start with a basic example, add some explanations, and improve it later.
I suggest starting by inputting the following YML there:
name: ErlangSolutionsBlog
base_url: "https://www.erlang-solutions.com"
start_urls:
- "https://www.erlang-solutions.com/blog/web-scraping-with-elixir/"
fields:
- name: title
selector: "title"
links_to_follow:
- selector: "a"
attribute: "href"
Now if you click the Preview button, you shall see what spider is going to extract from your start URLs:
Crawly Management Tool/ Preview spider results even before running anything.
So, what you can see here is:
The spider will extract only one field called title that equals to “Erlang Solutions.” Besides that, your spider is going to follow these links after the start page:
"https://www.erlang-solutions.com/",
"https://www.erlang-solutions.com#",
"https://www.erlang-solutions.com/services/consulting/",
"https://www.erlang-solutions.com/services/development/",
....
As in the original article, we plan to extract the following fields:
title
author
publishing_date
url
article_body
Expected selectors are copied from the previous article and can be found using Google Chrome’s inspect & copy approach!
fields:
- name: title
selector: ".page-title-sm"
- name: article_body
selector: ".default-content"
- name: author
selector: ".post-info__author"
- name: publishing_date
selector: ".header-inner .post-info .post-info__item span"
By now, you have noticed that the URL field is not added here. That’s because the URL is automatically added to every item by Crawly.
Now if you hit preview again, you should see the full scraped item:
Crawly Management System/ Item contains results just now.
Now, if you click save, running the full spider and seeing the actual results will be possible.
Crawly Management Tool/ You can see that the spider is running and already extracting some data.
We hope you like our work! We hope it will help reduce the need to write spider code or maybe help engage non-Elixir people who can play with it without coding!
Please don’t be too critical of the product, as everything in this world has bugs, and we plan to improve it over time (as we already did during the last 4+ years). If you have ideas and improvement suggestions, please drop us a message so we can help!
Pawel Chrząszcz introduces MongooseIM 6.3.0 with Prometheus monitoring and CockroachDB support for greater scalability and flexibility.
Here's how machine learning drives business efficiency, from customer insights to fraud detection, powering smarter, faster decisions.
Phuong Van explores Phoenix LiveView implementation, covering data migration, UI development, and team collaboration from concept to production.