Broadcatch

I have rewritten the whole broadcatch script. Of course the basic idea and concepts are buried also in the new wersion. This new version is now running for 4 months. The use is confortable. I just receive an email report when something was downloaded. There is also an xml/xsl file to check out over the web if I'm interested. I was not interested the last 3 months. Email notification is great. This script can be used with any torrent software which monitors a directory for torrent files.

Requirements

  • Unix environment (sh,grep,sed etc. available)
  • wget or curl
  • ssmtp or msmtp (for email notifications if desired)

Usage

The basics

The steps below should be enough to get you going:

  • You have to get the files (download will be coming).
  • Unpack the files somewhere on the disk
  • If required edit the PATH statement in the broadcatch.sh script to point at your bin directory where sed/grep etc… reside
  • Set-up the broadcatch.conf file in the same directory as the broadcatch.sh script
  • Insert the invocation of the broadcatch.sh script into the crontab
  • if you don't use mlnet you can comment out all lines which contain S80mlnet.
  • If you use mlnet but you don't care / know / have clue what is going on you can remove the S80mlnet line too.
  • if you don't care / know / have clue what is msmtp r ssmtp you can comment out the lines containing those strings.

Config file

Here you can read about the configuration variables which should be set by you in the broadcatch.conf configuration file.

  • TORRENTDIR - directory where to move the torrent files (read the manual of your torrent client)
  • WEBTOOL - set to wget if you want to use the wget otherwise curl will be used

This is a list of variables which can be set but they already have some defaults inside of the script

  • TORRENTTIMEOUT - how much time to wait till the downloading of torrent file will timeout
  • TORRENTRETRY - how many retries to do before giving up on the torrent file.
  • DIFFORMAT - context/normal/unified C/N/U. This is the format which is used by the diff tool to present results. If you are one of those unlucky guys like me who have a diff supporting only one of the formats.

A example config file might look something like this

$DIFFORMAT=u
$TMPDIR="/opt/etc/mldonkey/tmp"
$TRACE="on"
$WEBTOOL="wget"
# seirei no moribito
easy|http://www.anime-kraze.org/torrent/series.php?series=61|mkv|<link>|<
# karas
easy|http://www.anime-kraze.org/torrent/series.php?series=44|torrent|<link>|<
# angel heart
easy|http://www.anime-kraze.org/torrent/series.php?series=54|torrent|<link>|<
# dennou coil
easy|http://www.anime-kraze.org/torrent/series.php?series=60|torrent|<link>|<
# Mobile Suit Gundam 00
easy|http://www.mininova.org/rss/conclave+mendoi+mobile+suit+gundam+00+H%20264+1280|tor|<link>|<
# Legend of Galactic Heroes
easy|http://rss.a.scarywater.net/tag/ca.rss|logh.*h264|<enclosure url=\"|\"
# Kaizoku fansubs
easy|http://xdcc.discoveryhosting.nl/tracker/|_One_Piece_[0-9].*\\.mp4\\.torrent|<a%20href=\\"|\\"
# the IT crowd season 2
easy|http://www.eztv.it/index.php?main=show&id=596|mininova|<a%20href='|'

Commented source

Well this is intended towards myself, should the thing stop working and I have to fix it. You are welcome to read it too if you want.

There are several basic components to the broadcatch script:

  • The common module which contains some widely used functions.
  • The broadcatch script itself which does the torrent handling
  • The plugins which handle the extraction of the torrent files from rss/html etc.
  • The mlnet startup/shutdown handling
  • crontab entry for the broadcatch file itself

Common module

This handles all the small things which are used over and over in the broadcatch and plugin script(s)

Load configuration file

This module loads the configuration file. In this configuration file variables can be set to influence the overall behaviour of the script. The variables must start with a $ (dollar) sign so this function can distinguish them from comments and torrent feeds. Parameters for this function are as follows:

  • $1 - original configuration file with full path preferably
  • $2 - temporary file for variables
  • $3 - names of variables to read from config file

From user point of view only the invocations in the main broadcatch script are interesting. These invocation contain the names of variables which can be used int the configuration file.

loadConfigFile()
{
  grep '^\$' "$1" | sed -e 's/^.//' > "$2"
  while read CONFIGLINE
  do
    for VAR in $3
    do
        GREPVAR=`echo "${CONFIGLINE}" | grep "$VAR=" | awk -F'=' '{ print $2 }'`
        if [ "$GREPVAR" != "" ]
        then
            eval "$VAR=\"$GREPVAR\""
        fi
    done
  done < "$2"
}

Replace restricted xml chars with metachars

This is not a big deal. It will replace all characters which cant be used in the xml file with the corresponding metacharacters. If interested see the w3c standart for which these are (or the source below of course). If the supplied parameter is an exisitng file the metacharacters will be replaced in the file. If the supplied string is not a regular file then the metacharacters in this string will be replaced.

xmlify()
{
  if [ -f "${1}" ]
  then
    sed -e 's/\&/\&amp;/g' \
        -e 's/</\&lt;/g' \
        -e 's/>/\&gt;/g' \
        -e 's/"/\&quot;/g' "${1}"
  else
    echo "${1}" | sed -e 's/\&/\&amp;/g' \
      -e 's/</\&lt;/g' \
      -e 's/>/\&gt;/g' \
      -e 's/"/\&quot;/g'
  fi
}

Print text to file

There was a time when I thought this was a good idea. Maybe it still is if a want to change the line to be printed into a file in some twisted way. Well whatever.

printXmlText()
{
  echo "$1" >> "$2"
}

Print single XML trace entry

This one print a trace entry. It helps me to reduce the typing in the main broadcatch script. The text supplied will be xmlified by the xmlify function and then written as a xml entry to the output file.

printXmlSingleTraceEntry()
{
  SAFETXT=`xmlify "${1}"`
  printXmlText "  <ENTRY TYPE=\"TRACE\" FROM=\"$MYNAME\">${SAFETXT}</ENTRY>" "${TMPTRC}"
}

Print single XML log entry

This is almost the same as the above. It just sets the entry type to FAILURE or SUCCESS instead of trace. This saves typing effort in the main broadcatch script. Additionaly to the text which should be logged you have to supply also the log status either FAILED or SUCCESSfull.

printXmlSingleLogEntry()
{
  SAFETXT=`xmlify "${1}"`
  printXmlText "  <ENTRY TYPE=\"$2\" FROM=\"$MYNAME\">${SAFETXT}</ENTRY>" "${TMPXML}"
}

Print text to trace file

These print functions somehow got out of control. But then again they save typing effort so I kept adding them. Here you simply supply a text which will be written into the trace file as is. No xmlification or such.

# Will print text to trace file
# $1 - text to print
printXmlTraceText()
{
  printXmlText "$1" "${TMPTRC}"
}

Print text to log file

Same as above. Just the output goes to the log file.

printXmlLogText()
{
  printXmlText "$1" "${TMPXML}"
}

Add the log/trace to the web log

Those log/trace printing functions write to temporary files. When the broadcatch script is finished it will append the written temporary logs/traces to the global xml log file. This global xml file is then accessible from the web. The parameters are:

  • $1 - log file where the xml output should be appended
  • $2 - temporary file which holds the new log entries to be added
  • $3 - temporarry file to store the original log
addToWebLog()
{
  # remove the closing tag 
  sed -e '/^\s*$/d' -e '$,$d' "$1" > "$3"
  CLOSINGTAG=`sed -n '$p' "$1"`
  # equivalent to above CLOSINGTAG=`sed -e '$!d' "$1"`
  # equivalent to above CLOSINGTAG=`tail -1 "$1"`
  # add the newest log/trace results to the web log
  cat "$2" >> "$3"
  echo "${CLOSINGTAG}" >> "$3"
  mv "$3" "$1"
}

Print an empty web log

Sometimes there is no previous global xml log file and we have to create a new one.

  • $1 - the root tag
  • $2 - log file
  • $3 - xsl stylesheet (optional)
emptyWebLog()
{
  XSLSHEET="${3}"
  ROOTAG=`echo "${1}" | tr '[a-z]' '[A-Z]'`
  echo '<?xml version="1.0" encoding="UTF-8"?>' > "${2}"
  if [ -n "${XSLSHEET}" ]
  then
    echo "<?xml-stylesheet type=\"text/xsl\" href=\"${XSLSHEET}.xsl\"?>" > "${2}"
  fi
  echo "<${ROOTAG}>" >> "$2"
  echo "</${ROOTAG}>" >> "$2"
}

Sort and uniq the given file

Another helper function. Used to keep the log of already downloaded torrents and the list of discovered torrents in shape.

  • $1 - file to process
  • $2 - temporary file
sortAndUniqFile()
{
  sort "$1" | uniq > "$2"
  rm -f "$1"
  mv "$2" "$1"
}

Download a file from web

Probably the only essential function here. It is used to retrieve the html pages, rss feeds and the torrents which should be downloaded. There is quite a big number of parameters. As end user you don't have to worry about them. You can set some of them like the timeout and retry counter but don't have to. There are defaults already set.

  • $1 - torrent file URL to download
  • $2 - where to save the torrent file
  • $3 - timeout value for http connection
  • $4 - number of retries
  • $5 - what to use wget/curl curl is default
  • $6 - file for cookies

The returned value is the exit status of the wget/curl process to diagnose failures if any.

getFileFromWeb()
{
  # if webtool specified use that, fallback option is curl
  case "$5" in
    wget)
      # -q quiet
      # -O <filename>
      # -T connect-timeout
      # -t retry
      wget -q "$1" -O "$2" -T "$3" -t "$4" --save-cookies "$6" --load-cookies "$6"
      ;;
    *)
      # -q ignore .curlrc
      # -s silence (quiet)
      # -g no globbing (don't tread {} and [] specially)
      # -o <filename> write output to <filename>
      # -b load cookies from this file
      # -c cookie jar save cookies after session here
      curl -q -s -g -o "$2" -c "$6" -b "$6" --retry "$4" --connect-timeout "$3" "$1"
      ;;
  esac
  return $?
}

Broadcatch itself

This script handles the downloading of pages for information scraping be it rss, html, whatever. The scraping itself is then done by various plugins. Also the torrent downloading and checking against the already downloaded torrents happens here. The succesfully retrieved torrent files arre moved to the torrent directory of the torrent downloader itself. The notification mail is sent out here too.

Remove temporary files

removeTmpFiles()
{
  # this must be cleaned up explicitly
  rm -f "${TMPMAIL}"
  rm -f "${TMPVAR}"
  rm -f "${PID}"
  # this deletions can be expressed implicitly
  rm -f ${TMPDIR}/${MYNAME}_${MYPID}*
}

Clean-up

Does a basic cleanup after the broadcatch script. The xml log is properly closed. Temporary files are removed.

cleanUp()
{
  printXmlSingleLogEntry "Interrupted!" "${FAILED}"
  printXmlLogText "</RUN>"
  removeTmpFiles
  echo "Interrupted!" >&2
  exit 1
}

Set temporary variables

Here you have to supply the list of variables which have to be created. The variables get the prefix of TMP.

setTmpVariables()
{
  for VAR in $1
  do
    EXT=`echo "$VAR" | tr '[A-Z]' '[a-z]'`
    eval "TMP${VAR}=\"${TMPDIR}/${MYNAME}_${MYPID}.${EXT}\""
  done
}

Preparation steps

  • Get the absolute path to script
WORKDIR=`echo "$0" | sed -e 's/\/[^\/]*$//'`
WORKDIR=`cd "${WORKDIR}" 2>/dev/null && pwd || echo "${WORKDIR}"`
cd "${WORKDIR}"
  • Source common functions
. "${WORKDIR}/common/common.sh"
  • derive name
MYNAME=`echo "$0" | awk -F'/' '{ print $NF }' | sed -e 's/\.[a-z]*$//'`
  • get PID number
MYPID=$$
  • set hardcoded filenames/values
CFG="${WORKDIR}/${MYNAME}.conf"
PID="/var/run/${MYNAME}.pid"
FAILED="FAILED"
SUCCESS="SUCCESS"
NOW=`date`
  • check if configuration is presen
if [ ! -r "$CFG" ]
then
  echo "The configuration file $CFG not found or not readable!" >&2
  exit 3
fi
  • check if already running (pid file used as lock file too) The locking is usually done using directory names.
if [ -f "${PID}" ]
then
  echo "Already running!" >&2
  exit 2
fi
echo "${MYPID}" > "${PID}"
  • if interupted do cleanup
trap cleanUp 1 2 3 15
  • set "not so much hardcoded" filenames
if [ -z "$TMP" ]
then
  TMPDIR="/tmp"
else
  TMPDIR="${TMP}"
fi
TMPVAR="${TMPDIR}/${MYNAME}_${MYPID}.var"
  • set default files/values which can be changed by conf file
OLD="${WORKDIR}/${MYNAME}.old"
LOG="/volume1/public/debian/chroottarget/var/www/${MYNAME}.xml"
TMPMAIL="/volume1/public/debian/chroottarget/tmp/${MYNAME}.txt"
COOKIES="${WORKDIR}/${MYNAME}.cos"
TORRENTDIR="/volume1/public/debian/chroottarget/root/.mldonkey/torrents/incoming"
TORRENTTIMEOUT=120
TORRENTRETRY=6

Load parameters from the conf file

  • OLD - text file which stores the list of already downloaded torrents
  • LOG - file where the finished XML file will be stored with proper header and <BROADCATCH> tag
  • TORRENTDIR - directory where to move the torrent files
  • TMPDIR - directory to store temporary files (exception is the *.var file)
  • TRACE - if set to on traces will be written, otherwise no traces will be written to LOG file
  • TORRENTTIMEOUT - how much time to wait till the downloading of torrent file will timeout
  • TORRENTRETRY - how many retries to do before giving up on the torrent file
  • WEBTOOL - set to wget if you want to use the wget otherwise curl will be used
  • DIFFORMAT - context/normal/unified C/N/U
loadConfigFile "${CFG}" "${TMPVAR}" "COOKIES OLD LOG TORRENTDIR TMPDIR TRACE TORRENTTIMEOUT TORRENTRETRY WEBTOOL DIFFORMAT"
  • set names of temporary files

Variables created are: $TMPCFG, ${TMPTMP}, ${TMPNEW} etc. (see list below)

setTmpVariables "CFG TMP NEW DOW XML DIF"

Traces

where to write traces (this should be equal either to ${TMPXML} or /dev/null), we don't write traces to a separate file, that's why the TRC doesn't go into the previous setTmpVariables parameter list.

case "${TRACE}" in
  [oO][nN])
    TMPTRC="${TMPXML}"
    ;;
  *)
    TMPTRC="/dev/null"
    ;;
esac
  • start trace
printXmlLogText "<RUN AT=\"${NOW}\" PID=\"${MYPID}\" NAME=\"${MYNAME}\">"
  • print parameters to trace
printXmlSingleTraceEntry "CFG=${CFG}"
printXmlSingleTraceEntry "OLD=${OLD}"
printXmlSingleTraceEntry "LOG=${LOG}"
printXmlSingleTraceEntry "COOKIES=${COOKIES}"
printXmlSingleTraceEntry "TMPDIR=${TMPDIR}"
printXmlSingleTraceEntry "TMPCFG=${TMPCFG}"
printXmlSingleTraceEntry "TMPTMP=${TMPTMP}"
printXmlSingleTraceEntry "TMPNEW=${TMPNEW}"
printXmlSingleTraceEntry "TMPDOW=${TMPDOW}"
printXmlSingleTraceEntry "TMPXML=${TMPXML}"
printXmlSingleTraceEntry "TMPMAIL=${TMPMAIL}"
printXmlSingleTraceEntry "TMPTRC=${TMPTRC}"
printXmlSingleTraceEntry "TORRENTDIR=${TORRENTDIR}"
printXmlSingleTraceEntry "Start $0 at ${NOW}"
  • print working directory
printXmlSingleTraceEntry "Working directory `pwd`"

Extract the feeds

This sequence greps out only the lines where rss feeds are specified (html and whatever you have a parser for too)

grep -v "^#" "$CFG" |
  grep -v "^$" |
  grep -v '^\$' > "$TMPCFG"
  • dump the feeds into the trace file
printXmlTraceText "  <ENTRY TYPE=\"TRACE\" FROM=\"$MYNAME\">Preprocessed configuration ${CFG}"
xmlify "$TMPCFG" >> "${TMPTRC}"
printXmlTraceText "  </ENTRY>"

Download and Process the defined feeds

The next code downloads all the defined feeds to local files. These files will then be handed over to the specified plugins to extract the torrent file links. The extracted links are then stored in a temporary file which will be processed further on. I have added the evaluation of also the fourth and fifth parameter. This was required to get additional parameters to the rss parsing script.

printXmlSingleTraceEntry "Remove temporary file ${TMPNEW}"
rm -f "${TMPNEW}"
touch "${TMPNEW}"
printXmlSingleTraceEntry "Reading configuration ${TMPCFG}"
while read COMMANDLINE
do
  printXmlSingleTraceEntry "Commandline: ${COMMANDLINE}"
  # parse command line
  PARSETYPE=`echo ${COMMANDLINE}| awk -F'|' '{print $1}'`
  PARSEURL=`echo ${COMMANDLINE}| awk -F'|' '{print $2}'`
  PARSEPATTERN=`echo ${COMMANDLINE}| awk -F'|' '{print $3}'`
  P3=`echo ${COMMANDLINE}| awk -F'|' '{print $4}'`
  P4=`echo ${COMMANDLINE}| awk -F'|' '{print $5}'`
  if getFileFromWeb "${PARSEURL}" "${TMPDIR}/${MYNAME}_${MYPID}.${PARSETYPE}" ${TORRENTTIMEOUT} ${TORRENTRETRY} ${WEBTOOL} ${COOKIES}
  then
    # call plugin
    "${WORKDIR}/plugins/parse_${PARSETYPE}.sh" "${TMPDIR}/${MYNAME}_${MYPID}" "${PARSEPATTERN}" "$P3" "$P4" >> "${TMPNEW}"
    printXmlTraceText "  <ENTRY TYPE=\"TRACE\" FROM=\"$MYNAME\">Result from parse_${PARSETYPE}.sh"
    xmlify "${TMPNEW}" >> "${TMPTRC}"
    printXmlTraceText "  </ENTRY>"
  else
    printXmlSingleLogEntry "Failed downloading ${PARSEURL}" "${FAILED}"
  fi
done < "$TMPCFG"

Find out what to download

This code will compare the already downloaded torrents with the list of newly acquired torrents. The list which is produced are torrents which haven't been seen.

sortAndUniqFile "${TMPNEW}" "${TMPTMP}"
diff "${OLD}" "${TMPNEW}" > "${TMPDIF}"
case "${DIFFORMAT}" in
  [uU])
    # sort and uniq the torrents found, uses the unified output forma
    sortAndUniqFile "${TMPDIF}" "${TMPTMP}"
    grep '^+' "${TMPDIF}" | grep -v '^+++' | sed -e 's/^\+//' > "${TMPDOW}"
    ;;
  *)
    # use this when the diff supports standard output
    grep '^>' "${TMPDIF}" | sed -e 's/^> //' > "${TMPDOW}"
    ;;
esac
printXmlTraceText "  <ENTRY TYPE=\"TRACE\" FROM=\"$MYNAME\">Files to download"
xmlify "${TMPDOW}" >> "${TMPTRC}"
printXmlTraceText "  </ENTRY>"

Prepare the mail notification

Ok, this is hardcoded. Maybe in future I make it configurable.

echo "From: broadcatch@script.run" > "$TMPMAIL"
echo "To: whoever@wherever.is" >> "$TMPMAIL"
echo "Subject: broadcatch report" >> "$TMPMAIL"
echo "" >> "$TMPMAIL"
echo "These torrents will be downloaded:" >> "$TMPMAIL"

Download and send files to mldonkey

Or any other torrent client for that matter. The only requirement is that there is a directory which is regularly monitored by the torrent software for torrent files. The new torrent files are downloaded by the script and stored in the mlnet torrent directory. Also the database of already downloaded torrents is extended here.

NEWITEMS=0
rm -f "${TMPTMP}"
touch "${TMPTMP}"
if [ `wc -l "${TMPDOW}" | awk '{print $1}'` -gt 0 ]
then
  COUNT=0
  while read TORRENT
  do
    COOKIESITE=`echo "${PARSEURL}" | awk -F'/' '{ print $1"//"$3"/"}`
    if [ `grep -c "${COOKIESITE}" "${TMPTMP}"` -eq 0 ]
    then
      # go to main site and get cookies
      printXmlSingleTraceEntry "Reading cookies from ${COOKIESITE}"
      getFileFromWeb "$COOKIESITE}" "/dev/null" ${TORRENTTIMEOUT} ${TORRENTRETRY} ${WEBTOOL} ${COOKIES}
      echo "${COOKIESITE}" >> "${TMPTMP}"
    fi
    STATUS="${FAILED}"
    if getFileFromWeb "${TORRENT}" "${TMPDIR}/${MYNAME}_${MYPID}_${COUNT}.torrent" ${TORRENTTIMEOUT} ${TORRENTRETRY} ${WEBTOOL} ${COOKIES}
    then
      mv "${TMPDIR}/${MYNAME}_${MYPID}_${COUNT}.torrent" "${TORRENTDIR}"
      STATUS="${SUCCESS}"
      NEWITEMS=1
      # add the torrent to the old ones
      echo "${TORRENT}" >> "${OLD}"
      echo "${TORRENT}" >> "${TMPMAIL}"
    fi
    printXmlSingleLogEntry "${TORRENT}" "${STATUS}"
    COUNT=`expr $COUNT + 1`
  done < "${TMPDOW}"
else
  printXmlSingleLogEntry "No torrents to download !" "${SUCCESS}"
fi
printXmlSingleTraceEntry "End $0 at ${NOW}"
printXmlLogText "</RUN>"

Finalization

In case new torrents have been downloaded try to start the mldonkey, send the notification mail and cleanup the database of downloaded torrents.

if [ "${NEWITEMS}" -gt 0 ]
then
  # there are torrents to download try to start mlnet
  "/opt/etc/init.d/S80mlnet start"
  # try to send email notification
  echo "" >> "$TMPMAIL"
  #chroot /volume1/public/debian/chroottarget /bin/bash -c "/usr/local/sbin/ssmtp -t < /tmp/${MYNAME}.txt"
  chroot /volume1/public/debian/chroottarget /bin/bash -c "/usr/local/bin/msmtp -t < /tmp/${MYNAME}.txt"
  sortAndUniqFile "${OLD}" "${TMPTMP}"
fi
  • print the XML file into the web directory, if file doesn't exist or if it is empty write a new empty log with XML header and root tag.
if [ ! -f "${LOG}" -a ! -s "${LOG}" ]
then
  emptyWebLog "${MYNAME}" "${LOG}" "${MYNAME}"
fi
  • add the log/trace to the web log file
if [ "${TMPXML}" != "/dev/null" ]
then
  addToWebLog "${LOG}" "${TMPXML}" "${TMPTMP}"
fi
  • remove temporary files and exit
removeTmpFiles
exit 0

RSS Parsing

I have now two basic "plugin" scripts used to parse RSS. The easy one and more flexible one. The scripts should be able to parse almost all rss feeds. Well actually the scripts are also able to parse html and whatever else there is that can be processed by regexp's.

Easy parsing

The code below was developed to extract the link and enclosure tags in first place. It turned out I can handle this way also html pages with torrent links. Quite convenient. You specify the starting tag (usually <link> or <enclosure http=" ) and the end tag ( < resp. " ) and you are all set. See the example configuration files for more details. This one is to use when you have a specific rss feed and you just need to extract the links or for scraping html pages of trackers.

#!/bin/sh
# $1 - temporary path + file name withoud extension
# $2 - pattern to grep for
# $3 - start tag (spaces are encoded with %20)
# $4 - end tag (spaces are encoded with %20)

# set internal variables
PATH="/opt/bin"
EXT=$(echo $0|sed -e 's/^.*parse_//' -e 's/\.sh$//')
TMPFILE="${1}.${EXT}"
TMPEXT="${1}.extract"
TMPTRC="${1}.xml"
GREPURL="${2}"
TAGSTART=$(echo $3|sed -e 's/%20/ /g')
TAGEND=$(echo $4|sed -e 's/%20/ /g')

. ./common/common.sh

# extract the data
cat "${TMPFILE}" | tr -d '\n' | sed -e 's/>\s*</></g' -e 's/@/\&#64;/g' -e "s/${TAGSTART}/@/g" | tr '@' '\n' | sed -e '1,1d' -e "s/${TAGEND}.*$//" -e 's/\&#64;/@/g' > "$TMPEXT"
printXmlTraceText "  <ENTRY TYPE=\"TRACE\" FROM=\"$0\">Extracted links from feed"
xmlify "$TMPEXT" >> "${TMPTRC}"
printXmlTraceText "  </ENTRY>"
# finally get what you want
if [ -n "$GREPURL" ]; then
  grep -i "$GREPURL" "$TMPEXT" | sort | uniq | sed -e 's|http://www.mininova.org/tor/|http://www.mininova.org/get/|'
else
  sort "$TMPEXT" | uniq | sed -e 's|http://www.mininova.org/tor/|http://www.mininova.org/get/|'
fi

# remove rubbish
rm -f "$TMPEXT" "$TMPFILE"

As you can see in the code above I have not avoided the mininova syndrome. Those darn people tend to use tor to indicate the page where the torrent is, but the torrent itself is stored under the link where tor is replaced by get. It is not really a problem it just makes my script look uglier than it really is.

Eztv parsing

Well the name is a bit deceptive. Yes, it was inspired by eztv because I have tested the script on their feeds. The use is not that specific. Basically it will extract all items from the rss feed. Then the filtering is done. That menas you could use a filter for a show on eztv in a form Show name: Heffalumpa;.*Season: 4; to get all episodes of Heffalumpa from season 4. You could grep for quality in the title maybe <title>.*PDTV or whatever. This script is to use when you have a RSS feed with many different torrents and you want to filter out only some of them.

#!/bin/sh
# $1 - temporary path + file name withoud extension
# $2 - pattern to grep for
# $3 - start tag (spaces are encoded with %20)
# $4 - end tag (spaces are encoded with %20)

# set internal variables
PATH="/opt/bin"
EXT=$(echo $0|sed -e 's/^.*parse_//' -e 's/\.sh$//')
TMPFILE="${1}.${EXT}"
TMPEXT="${1}.extract"
TMPTRC="${1}.xml"
GREPURL="${2}"
TAGSTART=$(echo $3|sed -e 's/%20/ /g')
TAGEND=$(echo $4|sed -e 's/%20/ /g')

. ./common/common.sh

# extract the data
cat "${TMPFILE}" | tr -d '\n' | sed -e 's/>\s*</></g' -e 's/@/\&#64;/g' -e "s/<item>/@/g" | tr '@' '\n' | sed -e '1,1d' -e "s/<\/item>.*$//" -e 's/\&#64;/@/g' > "$TMPEXT"
printXmlTraceText "  <ENTRY TYPE=\"TRACE\" FROM=\"$0\">Extracted links from feed"
xmlify "$TMPEXT" >> "${TMPTRC}"
printXmlTraceText "  </ENTRY>"
# finally get what you want
if [ -n "$GREPURL" ]; then
  grep -i "$GREPURL" "$TMPEXT" | sort | uniq > "$TMPFILE"
else
  sort "$TMPEXT" | uniq > "$TMPFILE"
fi
sed -e "s/^.*${TAGSTART}//g" -e "s/${TAGEND}.*$//" "${TMPFILE}"

# remove rubbish
rm -f "$TMPEXT" "$TMPFILE"

ToDo List

  • Add some form to cache the downloaded rss feeds if the same should be used more than once in the same run.
  • configure the mail agent properties in the config file not hardcoded
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.