4chan download script

Thursday, 24. December 2009 14:49 - daniel - Other - 10 Comments


A few days ago 4chan changed their links and my old download script stopped working. Here is the updated version.

#!/bin/sh

if [ "$1" = "" ]; then
echo "Usage: `basename $0` <4chan thread url>"
exit 1
fi

echo "4chan downloader"
echo "Downloading until canceled or 404'd"
LOC=$(echo "$1" | egrep -o '([0-9]*)$' | sed 's/\.html//g' )
echo "Downloading to $LOC"

if [ ! -d $LOC ]; then
mkdir $LOC
fi

cd $LOC

while [ "1" = "1" ]; do
TMP=`mktemp`
TMP2=`mktemp`

wget -O "$TMP" "$1"
if [ "$?" != "0" ]; then
rm $TMP $TMP2
exit 1
fi

egrep 'http://images.4chan.org/[a-z0-9]+/src/([0-9]*).(jpg|png|gif)' "$TMP" -o > "$TMP2"
#cat "$TMP2" | sed 's!/cb-nws!!g' > "$TMP"

wget -nc -i $TMP2

rm $TMP $TMP2

echo "Waiting 30 seconds befor next run"
sleep 30
done;



Comments

Gosha - Monday, 28. December 2009 16:07

Very cute~

Gosha - Tuesday, 29. December 2009 6:09

I wanted to be able to save a whole thread, not just the images, so I edited your script: http://liten.besvikel.se/perma/chandl

Emil - Monday, 4. January 2010 19:21

Is there anyone that can make this script or Gosha's script work on a Mac with OS X 10.6? That would really be helpful :D

Jamy - Thursday, 7. January 2010 20:29

Pardon how do I use this? i am using win xp

Daniel - Thursday, 7. January 2010 20:54

You don't really use it on windows. Maybe if you get (e)grep, wget and bash for windows it might work.

sam2332 posted a autoit port of the old script for windows. You have to fix it for the new 4chan URLs, but it's easier to fix this than to get my script woring on windows.

Here is sam2332's script: http://dl.getdropbox.com/u/226498/script/4chan_img_downloader.au3

Daniel - Friday, 8. January 2010 0:26

@Gosha: If you want to download the whole thread (e.g. mirror it), you can use wget like this

wget -e robots=off -E -nd -nc -np -r -k -H -Dimages.4chan.org,thumbs.4chan.org [4chan url]

Ok, this are a shitload of options and it took me some time to look them up, so here is the explanation:

-e robots=off makes wget ignore robots.txt

-E renames html files to .html, so you don't get a file named 12345 but 12345.html instead

-nd no directory structure, downloads all files into the current folder

-nc no clobber, don't create a 12345.jpg.1 file if 12345.jpg already exists

-np no parent, don't go to parent directory, e.g. it will only download the thread but not the whole board

-r recursive, download all files that are referenced in the originally downloaded file

-k convert links in html files to point to the locally downloaded files

-H span hosts, don't just download from boards.4chan.org, but from every host

-Dimages.4chan.org,thumbs.4chan.org only download from these hosts (includes subdomains of thumbs.4chan.org, e.g. 0.thumbs.4chan.org)

Ok, that's it. pew.

Daniel - Friday, 8. January 2010 0:36

By the way, I thinks it's also easier to use this to just download the images. Use wget like this:

wget -e robots=off -E -nd -nc -np -r -H -Dimages.4chan.org -Rhtml [4chan url]

New options:

-Rhtml reject files with html extension

-k is no longer necessary because we delete the html file anyway.

This is cool, because there is no need for bash, grep, or anything else. You can just download wget for Windows and run this on Windows too.

Gosha - Tuesday, 17. August 2010 12:34

That's really neat. What's missing in the wget-only-version is that it only downloads it once, and not until the thread 404's. Also, the regex magic in the bash script makes it work on practically any wakaba board.

Redlegion - Tuesday, 21. December 2010 16:50

Hey, thanks for the awesome script. I updated it a bit myself to download five images concurrently. Pretty handy for 150 image threads where each image is roughly 2mb in size. Check it out here:

http://pastebin.com/8zqRpKkY

NOTE: I had to change a switch for mktemp to suit FreeBSD's version of mktemp. Might want to revert that if you aren't using FreeBSD.

James - Thursday, 20. January 2011 6:15

Awesome script, thanks.