Bash script to download images from website

Shell Scripting

Image crawlers are very useful when we need to download all the images that appear in a web page. Instead of going through the HTML sources and picking all the images, we can use a script to parse the image files and download them automatically. Let’s see how to do it.

#!/bin/bash
#Description: Images downloader
#Filename: img_downloader.sh
if [ $# -ne 3 ];
then
echo "Usage: $0 URL -d DIRECTORY"
exit -1
fi
for i in {1..4}
do
case $1 in
-d) shift; directory=$1; shift ;;
*) url=${url:-$1}; shift;;
esac
done
mkdir -p $directory;
baseurl=$(echo $url | egrep -o "https?://[a-z.]+")
curl –s $url | egrep -o "<img src=[^>]*>" | sed 's/<img src=\"\([^"]*\).*/\1/g' > /tmp/$$.list
sed -i "s|^/|$baseurl/|" /tmp/$$.list
cd $directory;
while read filename;
do
curl –s -O "$filename" --silent
done < /tmp/$$.list

An example usage is as follows:

$ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images


How it works…
The above image downloader script parses an HTML page, strips out all tags except <img>, then parses src=”URL” from the <img> tag and downloads them to the specified directory.
This script accepts a web page URL and the destination directory path as command-line arguments. The first part of the script is a tricky way to parse command-line arguments.
The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, else it exits and returns a usage example.
If it is 3 arguments, then parse the URL and the destination directory. In order to do that a tricky hack is used:

for i in {1..4}
do
case $1 in
-d) shift; directory=$1; shift ;;
*) url=${url:-$1}; shift;;
esac
done

A for loop is iterated four times (there is no significance to the number four, it is just to iterate a couple of times to run the case statement).
The case statement will evaluate the first argument ($1), and matches -d or any other string arguments that are checked. We can place the -d argument anywhere in the format as follows:

$ ./img_downloader.sh -d DIR URL

Or:

$ ./img_downloader.sh URL -d DIR

shift is used to shift arguments such that when shift is called $1 will be assigned with $2, when again called $1=$3 and so on as it shifts $1 to the next arguments. Hence we can evaluate all arguments through $1 itself. When -d is matched ( -d) ), it is obvious that the next argument is the value for the destination directory. *) corresponds to default match. It will match anything other than -d. Hence while iteration $1=”” or $1=URL in the default match, we need to take $1=URL avoiding “” to overwrite. Hence we use the url=${url:-$1} trick. It will return a URL value
if already not “” else it will assign $1.

egrep -o “<img src=[^>]*>” will print only the matching strings, which are the <img> tags including their attributes. [^>]* used to match all characters except the closing >, that is, <img src=”image.jpg” …. >.

In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.

12 Responses

  1. Harsh says:

    Can it be done to download all the executable files or files with a specific extension?

    • admin says:

      Yes, it can be but you have to tweak the script to find the tags which contains the files with specific extension. Like here for images we use to parse the HTML for < img src= tag

  2. razgorov prikazka says:

    Great job! Well done. Elegant.
    I was wondering though, is there a way to use it for multiple sites? In other words, if I would feed the script a CSV (or some other sort of list) with n URL’s so it could do this with n websites?
    That would be awesome man!

    • admin says:

      Hi Razgorov,

      It’s easy to do, you can do a new wrapper script(masterimgdl.sh) which will execute the image down-loader script again and again till the list of urls in your url_lists.txt file ends. So you need to do following.

      1. create a .txt file(url_lists.txt) with all your urls you want to crawl
      2. create a new bash script with followign content. (masterimgdl.sh)

      #!/bin/bash
      exec < url_list.txtwhile read line do ./img_downloader.sh ${line} -d images done3. create a new bash script, name it as img_downloader.sh using mentioned script/code in post 4. Put all 3 files in a single folder and give chmod 0755 * for all files. 5. Run the master wrapper script which you created in step1 (./masterimgdl.sh)HTH, Admin

  3. Lee Smith says:

    Hi,

    When i try this i get the following error
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 954 0 954 0 0 3538 0 --:--:-- --:--:-- --:--:-- 12230
    100 360k 0 360k 0 0 182k 0 --:--:-- 0:00:01 --:--:-- 186k
    curl: Remote file name has no length!
    curl: try 'curl --help' or 'curl --manual' for more information
    GIF89a����!�,D;curl: Remote file name has no length!
    curl: try 'curl --help' or 'curl --manual' for more information
    curl: Remote file name has no length!
    curl: try 'curl --help' or 'curl --manual' for more information

    Any help on this as I require some images for a Computer Vision assignment

    • admin says:

      It seems you have some curl version issues.. that’s why its throwing these errors.

      curl -O -J -L $url

      I used below version and it just works fine..

      $ curl -V
      curl 7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
      Protocols: tftp ftp telnet dict ldap http file https ftps
      Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz

  4. Eugene says:

    It returns me the following error, also tried other URL’s thus it’s the same:

    sh-3.2# ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images
    sh: ./img_downloader.sh: Permission denied
    sh-3.2#

  5. g4rp says:

    replace this:

    while read filename;
    do
    curl –s -O “$filename” –silent
    done

    wget -i /tmp/$$.list

  6. g4rp says:

    or just use smth like this:

    #!/bin/bash
    #Description: Images downloader
    #Filename: imgparser.sh
    if [ $# -ne 1 ];
    then
    echo “Usage: $0 URL”
    exit -1
    fi
    rm -rf images/;
    mkdir images/;
    url=$1;
    baseurl=$(echo $url | egrep -o “https?://[a-z.]+”)
    # baseurl=$1;
    curl –s $url > ./tmp/list.txt;
    cat ./tmp/list.txt | egrep -o “]*>” | sed ‘s/ ./tmp/list_parsed.txt
    sed -i.bak “s|^[\/]*|$baseurl/|g” ./tmp/list_parsed.txt;
    cd images/
    wget -i ../tmp/list_parsed.txt

Leave a Reply