Bash Web Scraping

hc
5 min readSep 21, 2021

With the built in tools and easily available libraries like curl, xargs, convert,pup it makes it easy to perform web scraping. Let’s go through and example with the tools and libraries mentioned and see how we can use them to scrape a manga website.

Manga Scraper

We are going to scrape from the website https://manganato.com/. First we choose a manga we want to download. For our case lets just choose Naruto. Looking at the page https://readmanganato.com/manga-ng952689 , we can see there's a chapter list which contains all the chapters.

Clicking on any chapter brings us to a page with all pages loaded within a single HTML file. With all pages located in a single HTML file, it makes our job easier. If it’s not, we can try finding another site.

First, lets get all the links to every chapter by inspecting the HTML source. Using Chrome’s built in developer tools, it’s very easy to view the source.

We see that the chapters is within a ul with a class name of row-content-chapter. Each chapter is then in a li element with a a which has the link. Now, we have to extract this link. In this scraping project, lets use pup (https://github.com/ericchiang/pup) instead of BeautifulSoup. If you know jq you will know how to use pup. It's a command line tool which allows you to pipe stdin and perform some pipeline to get an output. In our case, we are piping the html source and filtering out the elements we want.

Hence our first stage of getting all chapters link is just the following.

curl -s https://readmanganato.com/manga-ng952689 | pup .row-content-chapter a attr{href}

With that, we can proceed to our second stage, which is extracting all images from the chapter. Lets take https://readmanganato.com/manga-ng952689/chapter-700.1 as our experiment chapter.

Looking at the HTML source, we see all the images are inside a div with the class container-chapter-reader.

Very simply we can just extract the images with the command

curl -s https://readmanganato.com/manga-ng952689/chapter-700.1 | pup .container-chapter-reader img attr{src}

The next step will be to start downloading from the first chapter until the last. We also need to store the files nicely in their correct folder. To do so, we will use the manga chapter link which we obtained from stage one. For example https://readmanganato.com/manga-ng952689/chapter-0 we can see the chapter number in the link.

Lets use the grep command to extract the chapter.

echo "https://readmanganato.com/manga-ng952689/chapter-0" | grep -Eo 'chapter-[0-9]+'

With that, we can write a downloader.sh script which takes in the chapter link and perform the following

  1. Create chapter folder
  2. Get chapter images
  3. Download chapter images
#!/bin/sh

# Obtain our chapter number
CHAPTER=$(echo $1 | grep -Eo 'chapter-[0-9]+')
echo Downloading $CHAPTER

# Creating our chapter folder
mkdir "$CHAPTER"

# Entering our chapter folder
cd $CHAPTER

# Extracting all images from the chapter and downloading individual images with the remote name
curl -s $1 | pup .container-chapter-reader img attr{src} | xargs -n 1 -P 5 -I {} bash -c 'curl -sO {} -H "referer: https://readmanganato.com/"'

In the script we used xargs which allowed us to pass in the previous command output (the individual images) to curl . It also allow us to run a number of processes using the -P tag. This speeds up our download by 5 times as we are doing 5 tasks at once instead of 1.

With that we are done! Here’s our final command to run.

curl -s https://readmanganato.com/manga-ng952689 | pup .row-content-chapter a attr{href} | xargs -n 1 -I {} -P 10 bash -c './downloader.sh {}'

With all our images downloaded within the chapter’s folder, let’s merge the images into a pdf! To do so, we can use the convert command. To use this command we will need to install imagemagick.

The usage of this command is quite simple convert img1.jpg img2.jpg output.pdf. Let’s try to generate our command in that format.

First we list our folder using ls. One thing to take note is ls doesn't display in natural order. This will mess up the position of the pdf images.

Instead, we have to add the tag -v.

With our image file names, we can pass it into the convert command.

convert $(ls -v) chapter.pdf

# We can see our images get saved in the chapter.pdf
4091982 Aug 8 09:42 chapter.pdf

With that said, we can now iterate through every folder and do the conversion from image to pdf!

ls | xargs -n 1 -I {} -P 5 bash -c 'cd {} && convert $(ls -v) {}.pdf'

--

--