4chan download script
Thu, 24 Dec 2009 16:49 - Daniel - Other - Comments (10)
A few days ago 4chan changed their links and my old download script stopped working. Here is the updated version.
#!/bin/sh
if [ "$1" = "" ]; then
echo "Usage: `basename $0` <4chan thread url>"
exit 1
fi
echo "4chan downloader"
echo "Downloading until canceled or 404'd"
LOC=$(echo "$1" | egrep -o '([0-9]*)$' | sed 's/\.html//g' )
echo "Downloading to $LOC"
if [ ! -d $LOC ]; then
mkdir $LOC
fi
cd $LOC
while [ "1" = "1" ]; do
TMP=`mktemp`
TMP2=`mktemp`
wget -O "$TMP" "$1"
if [ "$?" != "0" ]; then
rm $TMP $TMP2
exit 1
fi
egrep 'http://images.4chan.org/[a-z0-9]+/src/([0-9]*).(jpg|png|gif)' "$TMP" -o > "$TMP2"
#cat "$TMP2" | sed 's!/cb-nws!!g' > "$TMP"
wget -nc -i $TMP2
rm $TMP $TMP2
echo "Waiting 30 seconds befor next run"
sleep 30
done;
Tags: 4chan download shell bash wget grep
Comments
Gosha, Sweden - Mon, 28 Dec 2009 18:07
Very cute~
Gosha, Sweden - Tue, 29 Dec 2009 08:09
I wanted to be able to save a whole thread, not just the images, so I edited your script: http://liten.besvikel.se/perma/chandl
Emil, Denmark - Mon, 04 Jan 2010 21:21
Is there anyone that can make this script or Gosha's script work on a Mac with OS X 10.6? That would really be helpful :D
Jamy, Unknown - Thu, 07 Jan 2010 22:29
Pardon how do I use this? i am using win xp
Daniel, Austria - Thu, 07 Jan 2010 22:54
You don't really use it on windows. Maybe if you get (e)grep, wget and bash for windows it might work.
sam2332 posted a autoit port of the old script for windows. You have to fix it for the new 4chan URLs, but it's easier to fix this than to get my script woring on windows.
Here is sam2332's script: http://dl.getdropbox.com/u/226498/script/4chan_img_downloader.au3
Daniel, Austria - Fri, 08 Jan 2010 02:26
@Gosha: If you want to download the whole thread (e.g. mirror it), you can use wget like this
wget -e robots=off -E -nd -nc -np -r -k -H -Dimages.4chan.org,thumbs.4chan.org [4chan url]
Ok, this are a shitload of options and it took me some time to look them up, so here is the explanation:
-e robots=off makes wget ignore robots.txt
-E renames html files to .html, so you don't get a file named 12345 but 12345.html instead
-nd no directory structure, downloads all files into the current folder
-nc no clobber, don't create a 12345.jpg.1 file if 12345.jpg already exists
-np no parent, don't go to parent directory, e.g. it will only download the thread but not the whole board
-r recursive, download all files that are referenced in the originally downloaded file
-k convert links in html files to point to the locally downloaded files
-H span hosts, don't just download from boards.4chan.org, but from every host
-Dimages.4chan.org,thumbs.4chan.org only download from these hosts (includes subdomains of thumbs.4chan.org, e.g. 0.thumbs.4chan.org)
Ok, that's it. pew.
Daniel, Austria - Fri, 08 Jan 2010 02:36
By the way, I thinks it's also easier to use this to just download the images. Use wget like this:
wget -e robots=off -E -nd -nc -np -r -H -Dimages.4chan.org -Rhtml [4chan url]
New options:
-Rhtml reject files with html extension
-k is no longer necessary because we delete the html file anyway.
This is cool, because there is no need for bash, grep, or anything else. You can just download wget for Windows and run this on Windows too.
Gosha, Sweden - Tue, 17 Aug 2010 15:34
That's really neat.
What's missing in the wget-only-version is that it only downloads it once, and not until the thread 404's. Also, the regex magic in the bash script makes it work on practically any wakaba board.
Redlegion, United States - Tue, 21 Dec 2010 18:50
Hey, thanks for the awesome script. I updated it a bit myself to download five images concurrently. Pretty handy for 150 image threads where each image is roughly 2mb in size. Check it out here:
http://pastebin.com/8zqRpKkY
NOTE: I had to change a switch for mktemp to suit FreeBSD's version of mktemp. Might want to revert that if you aren't using FreeBSD.
James, United States - Thu, 20 Jan 2011 08:15
Awesome script, thanks.