⭐ If you would like to buy me a coffee, well thank you very much that is mega kind! : https://www.buymeacoffee.com/honeyvig Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Tuesday, November 11, 2025

Build a Web Scraper (super simple!)

 

0:01

hello friends on the internet today

0:03

today i want to show you how easy it is

0:05

to build a web scraper using under 20

0:08

lines of code but not only that show you

0:10

how to adapt the web scraper in order to

0:12

scrape whatever you need from a web page

0:16

i will building this project using

0:17

javascript node.js express as well as

0:21

two packages called axios and cheerio i

0:25

will be doing this with a beginner's

0:26

mindset in mind so if you don't know

0:28

anything about node or express please do

0:31

not be worried i will be taking you

0:33

through everything step by step and

0:35

explaining everything we are doing along

0:37

the way

0:38

my aim for this video is to make it as

0:40

accessible to as many of you as possible

0:43

a basic understanding of javascript is

0:45

advised but not a hard prerequisite as i

0:48

am giving you my full permission to take

0:50

the 20 lines of code so just copy and

0:52

paste them and use them as you wish

0:54

after of course understanding what the

0:56

code does by watching this tutorial but

0:59

before we get started what exactly is

1:00

web scraping and what is it useful

1:03

web scraping refers to the extraction of

1:06

data from a website quickly and

1:08

accurately imagine for example you are

1:10

working at a company that has asked you

1:12

to make a list of all the companies

1:14

working at a particular trade show and

1:16

not only that but their contact name and

1:19

email addresses well most people would

1:21

probably open up the website of the

1:23

trade show and start writing down the

1:26

first company starting at a then the

1:28

name and then the email associated with

1:31

that company and then move on to the

1:32

next one and so on and so on and it

1:34

could literally take you days to get all

1:36

the details that you need and most

1:38

likely some spelling mistakes would be

1:40

made with web scraping you can have all

1:43

that information in seconds many people

1:46

move on to selling their web scraping

1:47

tools for money either by building them

1:49

as a chrome extension or api or selling

1:52

them to data capturing companies so the

1:54

option to make money off this tool is

1:56

there for you too

1:58

okay

1:59

so now that we understand what a web

2:01

scraper is and what it can be used for

2:04

it's time to get building one

2:06

so here we are i'm just going to create

2:08

a blank project using webstorm please

2:11

feel free to use whatever code editor or

2:14

ide you wish and just create an empty

2:17

directory so i'm just going to go ahead

2:18

and click here and just call this

2:21

web

2:23

scraper

2:24

just like so so that we can start

2:26

completely from scratch so as you can

2:29

see here is my directory there are

2:30

currently no files in it before we get

2:33

going i just want to make sure that

2:35

everyone watching has node.js installed

2:38

on their machines node.js is essentially

2:40

a open source server environment and we

2:43

will be using it to create our own

2:45

server or in other words our own backend

2:48

it's free and allows us to use the

2:50

javascript language in order to create

2:52

it so i am a big fan

2:54

so i'm just going to head over to

2:56

node.js

2:59

now i am using a mac so i would of

3:02

course click here in order to download

3:04

this onto my computer however here are

3:07

all the other options you have for

3:09

installing the source code so please go

3:11

ahead and choose the one that you need

3:14

now i already have this download so i'm

3:16

not going to go ahead and click here but

3:17

please go ahead and click whichever

3:19

version or option is required for you

3:23

okay

3:24

great

3:25

now let's carry on

3:27

so back in our projects it's time to get

3:30

coding the first thing i'm going to do

3:32

is just open up my terminal right here

3:34

and i'm going to type a command the

3:36

command is npm init okay

3:40

this will trigger initialization and

3:42

spin up a package json file we are

3:45

creating a package.json file so that we

3:47

can install packages or modules into our

3:50

project to use if you want to have a

3:52

look at all the packages that are

3:54

available to us please go ahead and

3:56

visit npmjs.com

4:00

so here are all the packages available

4:02

to our disposal if you go ahead and just

4:04

type one axios and click it you get all

4:07

the information on how to install it as

4:09

well as how many weekly downloads it

4:12

gets

4:12

so there we go you can literally search

4:14

through all the packages that are

4:16

available to you right here on this

4:19

registry

4:20

as a general rule any project that uses

4:22

node.js as we will be using will need to

4:25

have a package.json file so let's go

4:28

ahead and create one so i'm just going

4:29

to go ahead and type enter

4:32

and these prompts will be shown

4:35

now i'm just going to go through and go

4:37

enter

4:37

version 1 enter is fine description i'm

4:40

going to leave blank entry point is in

4:42

dates.js that is fine and then i'm just

4:44

going to leave all these blank like so

4:47

and click ok

4:49

so there we have it now if we go into

4:51

here you will see that a package.json

4:54

file has been generated for us based on

4:56

the commands that we just had

5:00

so once again here was our web scraper

5:02

the version is one because this is the

5:03

first version of the app that we are

5:05

building the description we left blank

5:07

and the main file that we are going to

5:09

be reading is index.js so let's go ahead

5:11

and create that index.js file i'm just

5:14

going to go ahead and create it like so

5:18

and there we go

5:21

the package.json file there's actually a

5:23

lot more than just hold our packages and

5:26

the versions of them that we need so if

5:28

you'd like to know more about it please

5:30

pause here and google beginner's guide

5:31

using npm but for now let's carry on

5:35

so

5:36

wonderful now that we have that let's

5:38

get to installing some packages the

5:41

first packages that we are going to need

5:43

is a package called express express is

5:46

essentially a back-end framework for

5:48

node.js okay we're going to install it

5:51

in order to listen to paths and listen

5:53

out to our port to make sure that

5:55

everything is working okay

5:57

what i mean by this is that if we visit

5:59

a certain path or url it will execute

6:02

some code and it will listen out to the

6:04

port that we define but enough talking

6:07

let me show you how

6:08

so as i said the package that we need is

6:10

called express

6:12

so i'm just going to show you it on here

6:14

let's search for the package express

6:16

and it will give us the instructions on

6:18

how to install it so i'm just going to

6:20

copy that and go back to my project

6:23

and whack the command in

6:25

here

6:26

so npmi i is essentially for install

6:30

it's a shorthand and i'm going to click

6:32

enter and wait for that to install as a

6:35

dependency to my project

6:37

so that is now done and you should

6:39

suddenly see a dependency show up here

6:42

and there we go so express is our first

6:45

dependency and it has shown up here with

6:48

a version

6:49

now what is quite important for you to

6:52

know is that if this project is not

6:54

working you for any reason it could be

6:57

it doesn't have to be but it could be

6:58

because of the version so if that is the

7:01

case make sure to delete whatever's in

7:03

here and write the version that i am

7:05

using and just install the package again

7:08

by running npm i for short okay so that

7:12

will reinstall the package and will

7:14

generate a package lock json file so as

7:17

you can see here this file has been

7:19

generated since we installed the

7:21

dependency and if we look here we will

7:24

find the express package

7:27

so i'm just going to find that in here

7:29

by typing

7:31

express and there we go so you will see

7:34

the version as well as which registry it

7:37

has been installed from

7:39

wonderful

7:41

another reason that this project could

7:42

not be working is that the node version

7:44

that you install could be uncompatible

7:46

to check your node version all you have

7:48

to do so i'm just going to press command

7:49

k to clear this

7:51

down here

7:53

all you would have to do is type node v

7:55

to check the version and make sure that

7:57

it's the same as mine

7:59

now if you want to change the package

8:01

you can do so it will require some extra

8:03

configuration and you can use the nvm

8:06

command to essentially install different

8:09

packages so i'm going to show you how to

8:11

do this this might not work for you if

8:13

you haven't configured your computer

8:14

correctly but essentially you can

8:16

install a certain package onto your

8:19

computer so i can install version

8:22

0.10

8:24

31 for example and click enter

8:27

so now i'm essentially installing this

8:29

version as well as having this version

8:32

okay and once that has done loaded i'm

8:35

going to show you how to use that

8:36

version

8:37

so let's just wait for that to finish

8:39

and i can use that version

8:41

by typing any vm use

8:43

and then this package right here even

8:45

though as default it has now switched to

8:47

this version so instead i'm going to use

8:49

this version and vm use to switch back

8:52

to using the node version that we

8:54

installed

8:56

and

8:56

there we go we are now using node

8:59

version 14.7.6

9:02

wonderful so those are two reasons that

9:04

your project might not work if you are

9:05

watching this in the future perhaps

9:07

there's been newer versions of express

9:09

or newer versions of know that have come

9:11

out that has made something brick so

9:13

that is just something you need to know

9:14

that is a bit of knowledge because that

9:16

is not only applicable to this project

9:17

but in general is applicable to many

9:19

projects that you will come across as a

9:21

developer

9:23

okay

9:24

so we now have the package express as a

9:27

reminder the express package is a

9:30

back-end framework for

9:33

node.js okay

9:35

now another package that we need to use

9:37

i'm just going to clear this again is a

9:39

package called cheerio

9:41

so once again i'm just going to go here

9:43

and search for the package cheerio

9:46

and there we go

9:48

cheerio is a package that we will be

9:50

using to essentially pick out html

9:52

elements on a web page

9:54

it works by passing markup and provides

9:56

an api for traversing and manipulating

9:59

the resulting data structure

10:01

cheerio's selector implementation is

10:03

nearly identical to jquery so if you

10:05

know jquery this might be familiar to

10:08

you

10:08

so now that we know what we will be

10:10

using this for let's get to using it to

10:13

pick our elements from a web page okay

10:16

and we're going to be doing that from

10:17

this webpage right here

10:20

so let's go ahead and install it i'm

10:22

simply going to copy this

10:25

and in webstorm just install the package

10:28

cheerio just like we did with express

10:31

and once again it should appear in our

10:33

dependencies

10:35

right

10:36

here so here we go there is cheerio and

10:39

the version of cheerio that we installed

10:42

wonderful we have one more package to

10:46

install and that is axios

10:48

so once again let's go in here and find

10:50

axios

10:53

axios is a promise based http client for

10:57

the browser and node.js

10:59

axios essentially makes it easy to send

11:01

http requests to rest endpoints and

11:03

perform crud operations this means that

11:06

we can use it to get post put and delete

11:10

data it is a very popular package and

11:13

one that i use quite a lot as a

11:15

developer on a day-to-day basis so once

11:17

again let's install it i'm going to show

11:19

you how to use it in a bit

11:21

so once again i'm just going to put that

11:23

in here and wait for that to install as

11:26

a dependency

11:27

okay

11:28

wonderful

11:30

so there we have it there we have all

11:32

three of the packages that we're going

11:34

to need for this project

11:36

now that we have that i'm just going to

11:38

do one more thing and that is write a

11:40

script so to write a just gonna get rid

11:42

of that one because we're not gonna need

11:44

it i'm gonna write a start script so

11:46

that if i use the command npm run and

11:49

then start as that is what you have

11:52

called the script i'm going to

11:54

essentially

11:57

i'm on index.js listen out to changes on

12:01

the index.js file so that is what no

12:03

demand does it listens out for any

12:05

changes made to our index.js file

12:09

so that is now done for the setup for

12:11

our package.json file please feel free

12:13

to take this from the code that i have

12:16

shared with you in the source code

12:18

hopefully you understand what all of

12:19

this means for now and exactly what we

12:22

need to get going so now let's head over

12:25

to our index.js file

12:27

the first thing that i'm going to do is

12:29

actually use all the packages that we

12:31

have just installed so if we go to the

12:33

documentation you will see that the

12:35

first thing we need to do in order to

12:36

use these packages is to

12:40

require them in the index.js file so i'm

12:43

just going to copy that line and in here

12:46

i'm just going to paste the line like so

12:48

and i'm actually going to do it for all

12:49

the packages so we've got axios we also

12:52

have

12:53

cheerio

12:55

and the packages again called

12:57

cheerio

12:59

and then we also have

13:01

the package

13:02

express

13:03

so there we go there's all three of our

13:07

packages that we need

13:09

now the next thing that i'm going to do

13:10

is actually initialize express

13:13

so to do this i'm actually going to get

13:16

express so what i'm doing here is

13:18

essentially getting the package and

13:20

getting all this wonderfulness

13:21

everything that comes with and storing

13:23

is express but we need to actually call

13:25

express in order to release all this

13:27

wonderfulness so i can do so by grabbing

13:29

express and calling it

13:31

and now that we have called it let's say

13:33

that something else i'm going to call it

13:34

as const app you can call it whatever

13:37

you wish

13:38

so express essentially comes with great

13:40

stuff like use

13:42

get

13:44

or listen

13:46

and because we've saved it all under app

13:48

i'm going to use app

13:49

listen to listen out to a port so listen

13:54

out to the port that we decide let's

13:56

decide that our port is going to be

14:00

const port 8 000. so we are saying that

14:04

we want to listen out to port 8000 to

14:06

see if any changes are made and

14:08

essentially we want our server to run on

14:10

port 8000. again this can be whatever

14:13

port you wish that is totally up to you

14:16

so i'm going to listen out to port 8000

14:19

uh what the syntax for this looks like

14:22

is like this

14:24

support listen and then i'm going to

14:26

pass through

14:27

a

14:28

callback and i'm just going to say so if

14:31

this is working i want it to say server

14:33

running because this is my server on

14:36

port

14:38

and then pass through

14:39

whatever port we defined up here

14:42

so this is looking

14:44

good server running

14:47

on port let's get to starting our app to

14:50

see if this has worked so all i'm going

14:52

to do is use this script

14:56

and this script is npm run and then i've

14:59

chosen to call it start

15:01

so there we go

15:04

and wonderful our server is indeed

15:06

running on port 8000 and that will

15:09

essentially listen out for any changes

15:10

we made to this file so if i make a

15:12

change to this file let's just go ahead

15:15

and call this

15:17

bob

15:18

and call this bob for example and click

15:21

save

15:22

it will restart due to changes and start

15:24

again on

15:26

by running node index js okay and then

15:28

we get the message server running on

15:30

port 8000 so let's change that back to

15:33

app just to make things more readable

15:35

and carry on

15:37

so great that is step one now step two

15:42

let's get to actually doing some

15:43

scraping

15:44

so

15:45

to do this i am gonna start using some

15:48

packages and the first packages i'm

15:50

going to use is axios okay and axios

15:54

works by passing through a url and it

15:57

visits the url and then i get the

15:59

response from it and in this case i'm

16:01

going to get the response data and save

16:03

it as some html that we can work with

16:06

so in this case let's pass through the

16:07

url that we want to work with so we know

16:10

that this is

16:12

the guardian so i'm just going to copy

16:13

that and i'm just going to paste it in

16:15

here like so

16:18

we can of course make this much more

16:19

readable so i'm just going to save this

16:21

as a url as i don't plan on it changing

16:24

and save this string and then just pass

16:26

through the url just

16:29

like

16:30

so

16:31

okay so now that we've passed through

16:32

that url i'm going to do some chaining

16:34

if you don't know much about chaming i

16:36

do have an asynchronous javascript

16:38

miniseries that i really do recommend

16:40

you watching uh for now just please

16:42

carry along curling with me anyway so

16:44

this will return a promise and once that

16:47

promise has resolved then we get the

16:49

response of whatever's come back so

16:51

response

16:56

and then

16:58

well we're going to get the response

17:00

data

17:04

and let's save this as html okay so you

17:07

can call this whatever you wish now if i

17:09

console log

17:11

html

17:13

and i am just going to click save

17:17

you will see all this html come back to

17:20

me this is essentially the html that is

17:22

from the guardian home page okay you

17:26

will see it here guardian

17:28

all guardian related stuff so this is

17:31

great but how do we start picking out

17:32

certain elements okay like what if i

17:34

want to pick up this button for example

17:36

well we do so with cheerio

17:39

so

17:40

let's go ahead and do that i'm just

17:41

going to delete this for now and i'm

17:43

going to use cheerio so the package we

17:46

just installed and it comes with

17:48

something called load that will allow us

17:50

to pass through the html so all of this

17:53

and then we're gonna save it as let's

17:55

just do a dollar sign

17:57

okay so there we go so now whenever we

17:59

use the dollar sign we're essentially

18:01

using all of this html and now i can

18:05

essentially

18:06

find so i'm going to use the dollar sign

18:08

and i can essentially look through all

18:10

of the html element and look for

18:12

something with the

18:15

let's go ahead and see what we want to

18:17

pick out so i'm just going to inspect

18:19

this page

18:24

if we want to pick out for example all

18:26

the

18:27

titles in here so i can do so i can pick

18:31

out each of the articles title and

18:33

perhaps the url that comes with them i

18:36

could look for let's go ahead and

18:38

inspect something which inspect this one

18:40

we could look for something that has the

18:43

uh

18:44

cfc maybe not

18:47

this one

18:51

maybe let's make it bigger to have a

18:52

better

18:54

view of what we can and can't use

18:59

so for example if we inspect this h3 tag

19:02

right here we can see that it has the

19:04

class of fc item title so let's go ahead

19:06

and use that because in it we also see

19:08

that this has an a tag with an href

19:11

which is a url so i'm just going to copy

19:14

this as the class name that we want to

19:16

look out for

19:17

so here we go

19:20

and i'm just going to paste it like so

19:22

making sure to put a dot in front of it

19:23

as we are looking for a class

19:25

name

19:26

so that is what we are looking for in

19:28

the html so don't forget to put that

19:31

that is the syntax that you need and for

19:33

each item that you find like this well

19:37

what do i want to happen let's write a

19:39

function so this is a callback function

19:42

and for each item that we find that has

19:44

the class fc item title i want to get

19:48

that item so this is the syntax for

19:50

doing so this i want to grab its

19:54

text

19:55

so we know this is an h3 tag so it will

19:58

have some text if you want to have a

19:59

look here there is some text in here

20:06

so if we look in here there we go there

20:09

is some text and that is what we are

20:10

grabbing essentially and i also want to

20:12

grab the h ref so i can do so once again

20:16

by grabbing so

20:19

this

20:21

and getting the attribute

20:27

of

20:28

h ref that exists inside it if i want to

20:31

be more precise and i think that might

20:33

be a good thing to do i can also find

20:35

the a tag that exists in that item and

20:39

then get the attribute of href from it

20:42

okay so there we go that is the syntax

20:44

for doing so let's go ahead and save

20:46

this as title and let's save this as the

20:50

url that we are looking for

20:53

and there we go

20:55

so for each element that we are finding

20:57

we're getting a title we're getting

20:58

something that is the url and now i'm

21:01

actually going to create an array

21:04

so where shall we create this array

21:07

let's go ahead and just create it up

21:08

here so i'm just going to do it here

21:11

const articles

21:13

and an empty array now for each item

21:16

that we create i i want to get a title i

21:19

want to get this url and i'm going to

21:21

get to the articles array which is

21:23

currently empty and use a javascript

21:24

method called push to push something

21:26

into it and i'm going to create an

21:28

object and this object is going to have

21:30

the title that we just picked out and

21:32

the url

21:34

okay

21:34

so that's all we really need to do the

21:37

next thing i'm going to do just to show

21:38

you this is working is just console log

21:42

and then

21:45

uh console log out the articles

21:48

just like so and just for good measure

21:50

we're gonna catch any errors so this is

21:52

how you catch errors i'm just gonna

21:54

catch uh the errors so catch

21:58

error

22:00

console log

22:02

error

22:04

okay

22:07

great so now let's check it out i'm just

22:10

going to save that

22:12

and let's see what comes back

22:14

there we go so we are indeed getting the

22:18

array that is coming back we have

22:20

literally scraped the webpage and we are

22:22

getting back so here is the results of

22:25

our scrape we are getting back the title

22:28

and the url of all the articles that

22:30

exist on the guardian homepage okay and

22:33

there is a lot so there we go

22:37

we have now successfully scraped a

22:39

webpage

22:40

and that's really all there is to it so

22:42

hopefully that was easy enough again if

22:46

you want to just take this code so let's

22:49

maybe make a bit smaller this is all it

22:51

is these are all the lines that you need

22:54

along with the setup you can of course

22:56

adjust this to scrape whatever you wish

23:00

so as long as you know what you're

23:01

looking for on the web page you can pick

23:03

out the sun elements you can search for

23:05

a times you can search for h3 tags you

23:07

can search for things by class name it

23:10

is completely up to you

23:12

so hopefully this has helped you in

23:14

creating your own web scraping app

23:16

please do hit me up if you have any

23:18

questions or if you just want to chat do

23:20

so in the description

23:22

below

23:23

thanks very much

No comments:

Post a Comment