In order to perform the first crawl, you just need to edit the configuration file, which resides
in the configuration directory with the name 'htcheck.conf
' (you may use another
file as configuration file, but you gotta run htcheck
it with the '-c
' option).
Just change the 'start_url
' attribute to whatever you want, for example:
start_url: http://www.foo.com
Remember that every URL must start with the service name, that is to say 'http://
'.
Then set the 'limit_urls_to
' attribute to $(start_url)
, in order to
scan only the 'http://www.foo.com' website.
You may change many other attributes (database name included), but for now, in order to test if it works or not, that's enough.
You can finally enter the bin
directory inside the 'htcheck' installation directory (by
default /opt/htcheck
) and run:
htcheck -vs
However, here are the available options (just run htcheck --help
) and you will get this:
usage: htcheck [-isvkhr] [-c configfile] [-D dbname] [--help] [--version]
Options:
-v Verbose mode (more 'v's increment verbosity)
-s Statistics (broken links, etc...) available
-i Initialize the database (drop a previous db)
-k Initialize the database (drop tables, keep the db)
-c configfile
Configuration file
-D dbname
Name of the database
--help Display this
-h Same as --help
--version Display version
-r Same as --version
Remember that htcheck
always check if the database already exists in the MySQL server. If it
does not exist, it is created from scratch. On the other hand, if htcheck
is launched with
the '-i' option, this database is initialized again (this means that a new crawl is performed), else
the program just use a previous database, which is useful in order to get some reports like
broken links and anchors, content-type summaries (in this case you gotta set the '-s' option).
Since version 1.2.0 it is possible not to drop a database, but keep it alive, and recreate the structure: in technical words, ht://Check tables are dropped and then recreated: this feature was proposed by Patrick Guillot (<pguillot@paanjaru.com>) and enables to use ht://Check within a database that can be used for other purposes as well.