Search a website for a particular text string

September 04, 2019 - posted by Josh Smith

Sometimes it is necessary to find a word or text string on a website. You could go through all the pages and manually scan each page but with most sites, this approach simply isn’t a good use of your time.

Recently I was asked to find a specific class within the markup. Sometimes the class was added in templates and other times the class was loaded in dynamically. I needed to find the pages that this class, basically a text string, appeared. I had to automate.

I found some scripts online and expensive tools that could do it, but I thought I’d make my own to do what I needed.

I’ve left the script for anyone to use and manipulate how they need. Find the findstring.sh script on my GitHub.

To use it, copy or download the script to your own computer. I like to put these little scripts in my home directory in a directory called scripts. ~/scripts. I have a few of these types of tools there.

When you want to use the script, open up a Terminal window and type cd. That command will take you to your home directory. Then you can type cd /scripts. Then type bash findstrings.sh. You will be prompted for the website’s URL and the string you for which you want to search. Then just note the output and fix or whatever you need to do to the pages.

findstrings.sh

“`

#!/bin/bash
# -----------  SET COLORS  -----------
COLOR_RED=$'\e[31m'
COLOR_CYAN=$'\e[36m'
COLOR_YELLOW=$'\e[33m'
COLOR_GREEN=$'\e[32m'
COLOR_RESET=$'\e[0m'
COLOR_HIGHLIGHT_FOUND=$'\e[0;30;42m'
#DOMAIN=https://www.efficiencyofmovement.com


echo "Let's search a website for a specfici string of text"
read -rp "Enter the full URL: " DOMAIN
echo "Cool, I have the URL of: $DOMAIN"
read -rp "Now input the string you want to search for: " STRING


spin()
{
  spinner="/|\\-/|\\-"
  while :
  do
    for i in `seq 0 7`
    do
      echo -n "${spinner:$i:1}"
      echo -en "\010"
      sleep 1
    done
  done
}

spin &
SPIN_PID=$!
trap 'kill -9 $SPIN_PID' $(seq 0 15)
## get all the working 200 URLs from a website
wget --spider --force-html -r "$DOMAIN" 2>&1 |
  grep '^--' | awk '{ print $3 }' |
  grep -E -v '\.(css|js|json|map|xml|png|gif|jpg|jpeg|JPG|bmp|txt|pdf|webmanifest)(\?.*)?$' |
  grep -E -v '\?(p|replytocom)=' |
  grep -E -v '\/wp-content\/uploads\/' |
  grep -E -v '\/feed\/' |
  grep -E -v '\/category\/' |
  grep -E -v '\/tag\/' |
  grep -E -v '\/page\/' |
  grep -E -v '\/widgets.php$' |
  grep -E -v '\/wp-json\/' |
  grep -E -v '\/xmlrpc' |
  grep -E '[/]$' |
  sort -u \
    > log.txt

#input="log.txt"
while IFS= read -r line; do

  if wget -q "$line" -O - | grep --color=always -ni -C 1 "$STRING"; then
    echo ${COLOR_RESET}${COLOR_GREEN}"Found (a) match(es) on: ${COLOR_HIGHLIGHT_FOUND}$line"${COLOR_RESET}
  else
    echo ${COLOR_RESET}${COLOR_CYAN}Nothing found on: "$line"${COLOR_RESET}
  fi
done < log.txt
kill -9 $SPIN_PID

“`