Skip to content
This repository was archived by the owner on Apr 10, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions devel/loadtest_collect/loadtest_collect.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ModPagespeedFileCachePath "#HOME/apache2/pagespeed_cache/"
AddOutputFilterByType MOD_PAGESPEED_OUTPUT_FILTER text/html

<VirtualHost localhost:8080>
ModPagespeed on

<Location /pagespeed_admin>
Order allow,deny
Allow from localhost
Allow from 127.0.0.1
SetHandler pagespeed_admin
</Location>
<Location /pagespeed_global_admin>
Order allow,deny
Allow from localhost
Allow from 127.0.0.1
SetHandler pagespeed_global_admin
</Location>

KeepAlive On
KeepAliveTimeout 60

<Directory "#HOME/apache2/htdocs/" >
# This is enabled to make sure we don't crash mod_negotiation.
Options +MultiViews
</Directory>

ModPagespeedRewriteLevel AllFilters
ModPagespeedSlurpDirectory #SLURP_DIR
ModPagespeedSlurpReadOnly off
ModPagespeedRewriteDeadlinePerFlushMs -1
CustomLog "#LOG_PATH" %r
</VirtualHost>
67 changes: 67 additions & 0 deletions devel/loadtest_collect/loadtest_collect_corpus.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#!/bin/bash
#
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This script collects slurps and URLs (post-optimization, if possible), of
# some websites with the help of phantomjs.

function usage {
echo "Usage: loadtest_collect/loadtest_collect_corpus.sh pages.txt out.tar.bz2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment that pages.txt is a file of URLs, one per line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

echo "Where pages.txt has a URL (including http://) per line"
}

set -u # exit the script if any variable is uninitialized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about an option to scrape the alexa 500. Something like this is a start, but we need to do a bit more sedding to strip out the site name properly:

wget -q -O - www.alexa.com/topsites/global:0 | grep href="/siteinfo/

The '0' in global:0 can be paged up to 19, giving you the alexa 500.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems brittle...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and not required for this CL. I still think it'd be useful as a follow-up even if a "this is brittle" comment was put on it.

set -e

if [ $# -ne 2 ]; then
usage
exit 1
fi

if [ -d devel ]; then
cd devel
fi
if [ ! -d loadtest_collect ]; then
echo Run this script from the top or devel/ directories
exit 1
fi

if [ ! $(which phantomjs) ]; then
echo "phantomjs not found, trying to install it with apt-get"
sudo apt-get install phantomjs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about echoing what you are doing first. If you run this from CentOS I think it'll just fail saying "apt-get command not found", and the user won't know it's just trying to install phantom without debugging the script. With the echo at least the CentOS user could just install it manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite that bad since you would get something like "sudo: apt-getttt: command not found"...
... but there is still a problem in that it just has sudo asking for a password for no good reason, so I've added an informational message. (or rather will have added once I push this, not at time of this comment)

(I am not a fan of the echo foo \n foo pattern since it duplicates the 'foo')

fi

SLURP_TOP_DIR=$(mktemp -d)
SLURP_DIR=$SLURP_TOP_DIR/slurp
mkdir $SLURP_DIR
LOG_PATH=$SLURP_TOP_DIR/log.txt
URLS_PATH=$SLURP_TOP_DIR/corpus_all_urls.txt

make clean_slate_for_tests
make apache_debug_stop

sed -e "s^#HOME^$HOME^" -e "s^#SLURP_DIR^$SLURP_DIR^" \
-e "s^#LOG_PATH^$LOG_PATH^" \
< loadtest_collect/loadtest_collect.conf > ~/apache2/conf/pagespeed.conf
make -j8 apache_debug_restart

for site in $(cat $1); do
echo $site
phantomjs --proxy=127.0.0.1:8080 loadtest_collect/script.js $site
done
cat $LOG_PATH | grep ^GET | cut -d ' ' -f 2 > $URLS_PATH
cd $SLURP_TOP_DIR
tar cvjf $2 .

12 changes: 12 additions & 0 deletions devel/loadtest_collect/script.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// This is basically just the stripped down http://phantomjs.org/quick-start.html
// example.
var page = require('webpage').create();
var system = require('system');
if (system.args.length === 1) {
console.log('Usage: script.js <some URL>');
phantom.exit();
}

page.open(system.args[1], function(status) {
phantom.exit();
});