hello friends on the internet today
today i want to show you how easy it is
to build a web scraper using under 20
lines of code but not only that show you
how to adapt the web scraper in order to
scrape whatever you need from a web page
i will building this project using
javascript node.js express as well as
will be doing this with a beginner's
mindset in mind so if you don't know
anything about node or express please do
not be worried i will be taking you
through everything step by step and
explaining everything we are doing along
the way
my aim for this video is to make it as
accessible to as many of you as possible
a basic understanding of javascript is
advised but not a hard prerequisite as i
am giving you my full permission to take
the 20 lines of code so just copy and
paste them and use them as you wish
after of course understanding what the
code does by watching this tutorial but
before we get started what exactly is
web scraping and what is it useful
web scraping refers to the extraction of
data from a website quickly and
accurately imagine for example you are
working at a company that has asked you
to make a list of all the companies
working at a particular trade show and
not only that but their contact name and
email addresses well most people would
probably open up the website of the
trade show and start writing down the
first company starting at a then the
name and then the email associated with
that company and then move on to the
next one and so on and so on and it
could literally take you days to get all
the details that you need and most
likely some spelling mistakes would be
made with web scraping you can have all
that information in seconds many people
move on to selling their web scraping
tools for money either by building them
as a chrome extension or api or selling
them to data capturing companies so the
option to make money off this tool is
there for you too
okay
so now that we understand what a web
scraper is and what it can be used for
it's time to get building one
so here we are i'm just going to create
a blank project using webstorm please
feel free to use whatever code editor or
ide you wish and just create an empty
directory so i'm just going to go ahead
and click here and just call this
web
scraper
just like so so that we can start
completely from scratch so as you can
see here is my directory there are
currently no files in it before we get
going i just want to make sure that
everyone watching has node.js installed
on their machines node.js is essentially
a open source server environment and we
will be using it to create our own
server or in other words our own backend
it's free and allows us to use the
javascript language in order to create
it so i am a big fan
so i'm just going to head over to
node.js
now i am using a mac so i would of
course click here in order to download
this onto my computer however here are
all the other options you have for
installing the source code so please go
ahead and choose the one that you need
now i already have this download so i'm
not going to go ahead and click here but
please go ahead and click whichever
version or option is required for you
okay
great
now let's carry on
so back in our projects it's time to get
coding the first thing i'm going to do
is just open up my terminal right here
and i'm going to type a command the
command is npm init okay
this will trigger initialization and
spin up a package json file we are
creating a package.json file so that we
can install packages or modules into our
project to use if you want to have a
look at all the packages that are
available to us please go ahead and
visit npmjs.com
so here are all the packages available
to our disposal if you go ahead and just
type one axios and click it you get all
the information on how to install it as
well as how many weekly downloads it
gets
so there we go you can literally search
through all the packages that are
available to you right here on this
registry
as a general rule any project that uses
node.js as we will be using will need to
have a package.json file so let's go
ahead and create one so i'm just going
to go ahead and type enter
and these prompts will be shown
now i'm just going to go through and go
enter
version 1 enter is fine description i'm
going to leave blank entry point is in
dates.js that is fine and then i'm just
going to leave all these blank like so
and click ok
so there we have it now if we go into
here you will see that a package.json
file has been generated for us based on
the commands that we just had
so once again here was our web scraper
the version is one because this is the
first version of the app that we are
building the description we left blank
and the main file that we are going to
be reading is index.js so let's go ahead
and create that index.js file i'm just
going to go ahead and create it like so
and there we go
the package.json file there's actually a
lot more than just hold our packages and
the versions of them that we need so if
you'd like to know more about it please
pause here and google beginner's guide
using npm but for now let's carry on
so
wonderful now that we have that let's
get to installing some packages the
first packages that we are going to need
is a package called express express is
essentially a back-end framework for
node.js okay we're going to install it
in order to listen to paths and listen
out to our port to make sure that
everything is working okay
what i mean by this is that if we visit
a certain path or url it will execute
some code and it will listen out to the
port that we define but enough talking
let me show you how
so as i said the package that we need is
called express
so i'm just going to show you it on here
let's search for the package express
and it will give us the instructions on
how to install it so i'm just going to
copy that and go back to my project
and whack the command in
here
so npmi i is essentially for install
it's a shorthand and i'm going to click
enter and wait for that to install as a
dependency to my project
so that is now done and you should
suddenly see a dependency show up here
and there we go so express is our first
dependency and it has shown up here with
a version
now what is quite important for you to
know is that if this project is not
working you for any reason it could be
it doesn't have to be but it could be
because of the version so if that is the
case make sure to delete whatever's in
here and write the version that i am
using and just install the package again
by running npm i for short okay so that
will reinstall the package and will
generate a package lock json file so as
you can see here this file has been
generated since we installed the
dependency and if we look here we will
find the express package
so i'm just going to find that in here
by typing
express and there we go so you will see
the version as well as which registry it
has been installed from
wonderful
another reason that this project could
not be working is that the node version
that you install could be uncompatible
to check your node version all you have
to do so i'm just going to press command
k to clear this
down here
all you would have to do is type node v
to check the version and make sure that
it's the same as mine
now if you want to change the package
you can do so it will require some extra
configuration and you can use the nvm
command to essentially install different
packages so i'm going to show you how to
do this this might not work for you if
you haven't configured your computer
correctly but essentially you can
install a certain package onto your
computer so i can install version
0.10
31 for example and click enter
so now i'm essentially installing this
version as well as having this version
okay and once that has done loaded i'm
going to show you how to use that
version
so let's just wait for that to finish
and i can use that version
by typing any vm use
and then this package right here even
though as default it has now switched to
this version so instead i'm going to use
this version and vm use to switch back
to using the node version that we
installed
and
there we go we are now using node
version 14.7.6
wonderful so those are two reasons that
your project might not work if you are
watching this in the future perhaps
there's been newer versions of express
or newer versions of know that have come
out that has made something brick so
that is just something you need to know
that is a bit of knowledge because that
is not only applicable to this project
but in general is applicable to many
projects that you will come across as a
developer
okay
so we now have the package express as a
reminder the express package is a
back-end framework for
node.js okay
now another package that we need to use
i'm just going to clear this again is a
package called cheerio
so once again i'm just going to go here
and search for the package cheerio
and there we go
cheerio is a package that we will be
using to essentially pick out html
elements on a web page
it works by passing markup and provides
an api for traversing and manipulating
the resulting data structure
cheerio's selector implementation is
nearly identical to jquery so if you
know jquery this might be familiar to
you
so now that we know what we will be
using this for let's get to using it to
pick our elements from a web page okay
and we're going to be doing that from
this webpage right here
so let's go ahead and install it i'm
simply going to copy this
and in webstorm just install the package
cheerio just like we did with express
and once again it should appear in our
dependencies
right
here so here we go there is cheerio and
the version of cheerio that we installed
wonderful we have one more package to
install and that is axios
so once again let's go in here and find
axios
axios is a promise based http client for
the browser and node.js
axios essentially makes it easy to send
http requests to rest endpoints and
perform crud operations this means that
we can use it to get post put and delete
data it is a very popular package and
one that i use quite a lot as a
developer on a day-to-day basis so once
again let's install it i'm going to show
you how to use it in a bit
so once again i'm just going to put that
in here and wait for that to install as
a dependency
okay
wonderful
so there we have it there we have all
three of the packages that we're going
to need for this project
now that we have that i'm just going to
do one more thing and that is write a
script so to write a just gonna get rid
of that one because we're not gonna need
it i'm gonna write a start script so
that if i use the command npm run and
then start as that is what you have
called the script i'm going to
essentially
i'm on index.js listen out to changes on
the index.js file so that is what no
demand does it listens out for any
changes made to our index.js file
so that is now done for the setup for
our package.json file please feel free
to take this from the code that i have
shared with you in the source code
hopefully you understand what all of
this means for now and exactly what we
need to get going so now let's head over
to our index.js file
the first thing that i'm going to do is
actually use all the packages that we
have just installed so if we go to the
documentation you will see that the
first thing we need to do in order to
use these packages is to
require them in the index.js file so i'm
just going to copy that line and in here
i'm just going to paste the line like so
and i'm actually going to do it for all
the packages so we've got axios we also
have
cheerio
and the packages again called
cheerio
and then we also have
the package
express
so there we go there's all three of our
packages that we need
now the next thing that i'm going to do
is actually initialize express
so to do this i'm actually going to get
express so what i'm doing here is
essentially getting the package and
getting all this wonderfulness
everything that comes with and storing
is express but we need to actually call
express in order to release all this
wonderfulness so i can do so by grabbing
express and calling it
and now that we have called it let's say
that something else i'm going to call it
as const app you can call it whatever
you wish
so express essentially comes with great
stuff like use
get
or listen
and because we've saved it all under app
i'm going to use app
listen to listen out to a port so listen
out to the port that we decide let's
decide that our port is going to be
const port 8 000. so we are saying that
we want to listen out to port 8000 to
see if any changes are made and
essentially we want our server to run on
port 8000. again this can be whatever
port you wish that is totally up to you
so i'm going to listen out to port 8000
uh what the syntax for this looks like
is like this
support listen and then i'm going to
pass through
a
callback and i'm just going to say so if
this is working i want it to say server
running because this is my server on
port
and then pass through
whatever port we defined up here
so this is looking
good server running
on port let's get to starting our app to
see if this has worked so all i'm going
to do is use this script
and this script is npm run and then i've
chosen to call it start
so there we go
and wonderful our server is indeed
running on port 8000 and that will
essentially listen out for any changes
we made to this file so if i make a
change to this file let's just go ahead
and call this
bob
and call this bob for example and click
save
it will restart due to changes and start
again on
by running node index js okay and then
we get the message server running on
port 8000 so let's change that back to
app just to make things more readable
and carry on
so great that is step one now step two
let's get to actually doing some
scraping
so
to do this i am gonna start using some
packages and the first packages i'm
going to use is axios okay and axios
works by passing through a url and it
visits the url and then i get the
response from it and in this case i'm
going to get the response data and save
it as some html that we can work with
so in this case let's pass through the
url that we want to work with so we know
that this is
the guardian so i'm just going to copy
that and i'm just going to paste it in
here like so
we can of course make this much more
readable so i'm just going to save this
as a url as i don't plan on it changing
and save this string and then just pass
through the url just
like
so
okay so now that we've passed through
that url i'm going to do some chaining
if you don't know much about chaming i
do have an asynchronous javascript
miniseries that i really do recommend
you watching uh for now just please
carry along curling with me anyway so
this will return a promise and once that
promise has resolved then we get the
response of whatever's come back so
response
and then
well we're going to get the response
data
and let's save this as html okay so you
can call this whatever you wish now if i
console log
html
and i am just going to click save
you will see all this html come back to
me this is essentially the html that is
from the guardian home page okay you
will see it here guardian
all guardian related stuff so this is
great but how do we start picking out
certain elements okay like what if i
want to pick up this button for example
well we do so with cheerio
so
let's go ahead and do that i'm just
going to delete this for now and i'm
going to use cheerio so the package we
just installed and it comes with
something called load that will allow us
to pass through the html so all of this
and then we're gonna save it as let's
just do a dollar sign
okay so there we go so now whenever we
use the dollar sign we're essentially
using all of this html and now i can
essentially
find so i'm going to use the dollar sign
and i can essentially look through all
of the html element and look for
something with the
let's go ahead and see what we want to
pick out so i'm just going to inspect
this page
if we want to pick out for example all
the
titles in here so i can do so i can pick
out each of the articles title and
perhaps the url that comes with them i
could look for let's go ahead and
inspect something which inspect this one
we could look for something that has the
uh
cfc maybe not
this one
maybe let's make it bigger to have a
better
view of what we can and can't use
so for example if we inspect this h3 tag
right here we can see that it has the
class of fc item title so let's go ahead
and use that because in it we also see
which is a url so i'm just going to copy
this as the class name that we want to
look out for
so here we go
and i'm just going to paste it like so
making sure to put a dot in front of it
as we are looking for a class
name
so that is what we are looking for in
the html so don't forget to put that
that is the syntax that you need and for
each item that you find like this well
what do i want to happen let's write a
function so this is a callback function
and for each item that we find that has
the class fc item title i want to get
that item so this is the syntax for
doing so this i want to grab its
text
so we know this is an h3 tag so it will
have some text if you want to have a
look here there is some text in here
so if we look in here there we go there
is some text and that is what we are
grabbing essentially and i also want to
grab the h ref so i can do so once again
by grabbing so
this
and getting the attribute
of
h ref that exists inside it if i want to
be more precise and i think that might
be a good thing to do i can also find
the a tag that exists in that item and
then get the attribute of href from it
okay so there we go that is the syntax
for doing so let's go ahead and save
this as title and let's save this as the
url that we are looking for
and there we go
so for each element that we are finding
we're getting a title we're getting
something that is the url and now i'm
actually going to create an array
so where shall we create this array
let's go ahead and just create it up
here so i'm just going to do it here
const articles
and an empty array now for each item
that we create i i want to get a title i
want to get this url and i'm going to
get to the articles array which is
currently empty and use a javascript
method called push to push something
into it and i'm going to create an
object and this object is going to have
the title that we just picked out and
the url
okay
so that's all we really need to do the
next thing i'm going to do just to show
you this is working is just console log
and then
uh console log out the articles
just like so and just for good measure
we're gonna catch any errors so this is
how you catch errors i'm just gonna
catch uh the errors so catch
error
console log
error
okay
great so now let's check it out i'm just
going to save that
and let's see what comes back
there we go so we are indeed getting the
array that is coming back we have
literally scraped the webpage and we are
getting back so here is the results of
our scrape we are getting back the title
and the url of all the articles that
exist on the guardian homepage okay and
there is a lot so there we go
we have now successfully scraped a
webpage
and that's really all there is to it so
hopefully that was easy enough again if
you want to just take this code so let's
maybe make a bit smaller this is all it
is these are all the lines that you need
along with the setup you can of course
adjust this to scrape whatever you wish
so as long as you know what you're
looking for on the web page you can pick
out the sun elements you can search for
a times you can search for h3 tags you
can search for things by class name it
is completely up to you
so hopefully this has helped you in
creating your own web scraping app
please do hit me up if you have any
questions or if you just want to chat do
so in the description
below
thanks very much
No comments:
Post a Comment