gemini-grc

A crawler for the Gemini network. Easily extendable as a "wayback machine" of Gemini.

Features

Security Note

This crawler uses InsecureSkipVerify: true in TLS configuration to accept all certificates. This is a common approach for crawlers but makes the application vulnerable to MITM attacks. This trade-off is made to enable crawling self-signed certificates widely used in the Gemini ecosystem.

How to run

make build
./dist/crawler --help

Check misc/sql/initdb.sql to create the PostgreSQL tables.

Configuration

Available command-line flags:

  -blacklist-path string
        File that has blacklist regexes
  -dry-run
        Dry run mode
  -gopher
        Enable crawling of Gopher holes
  -log-level string
        Logging level (debug, info, warn, error) (default "info")
  -max-db-connections int
        Maximum number of database connections (default 100)
  -max-response-size int
        Maximum size of response in bytes (default 1048576)
  -pgurl string
        Postgres URL
  -response-timeout int
        Timeout for network responses in seconds (default 10)
  -seed-url-path string
        File with seed URLs that should be added to the queue immediately
  -skip-if-updated-days int
        Skip re-crawling URLs updated within this many days (0 to disable) (default 60)
  -whitelist-path string
        File with URLs that should always be crawled regardless of blacklist
  -workers int
        Number of concurrent workers (default 1)

Example:

./dist/crawler \
  -pgurl="postgres://test:[email protected]:5434/test?sslmode=disable" \
  -log-level=info \
  -workers=10 \
  -blacklist-path="./blacklist.txt" \
  -whitelist-path="./whitelist.txt" \
  -max-response-size=10485760 \
  -response-timeout=10 \
  -max-db-connections=100 \
  -skip-if-updated-days=7 \
  -gopher \
  -seed-url-path="./seed_urls.txt"

Development

Install linters. Check the versions first.

go install mvdan.cc/[email protected]
go install github.com/golangci/golangci-lint/cmd/[email protected]

Snapshot History

The crawler now supports versioned snapshots, storing multiple snapshots of the same URL over time. This allows you to view how content changes over time, similar to the Internet Archive's Wayback Machine.

Accessing Snapshot History

You can access the snapshot history using the included snapshot_history.sh script:

# Get the latest snapshot
./snapshot_history.sh -u gemini://example.com/

# Get a snapshot from a specific point in time
./snapshot_history.sh -u gemini://example.com/ -t 2023-05-01T12:00:00Z

# Get all snapshots for a URL
./snapshot_history.sh -u gemini://example.com/ -a

# Get snapshots in a date range
./snapshot_history.sh -u gemini://example.com/ -r 2023-01-01T00:00:00Z 2023-12-31T23:59:59Z

TODO

Add snapshot history
Add a web interface
Provide to servers a TLS cert for sites that require it, like Astrobotany
Use pledge/unveil in OpenBSD hosts

TODO (lower priority)

More protocols? http://dbohdan.sdf.org/smolnet/

Notes

Good starting points:

gemini://warmedal.se/~antenna/

gemini://tlgs.one/

gopher://i-logout.cz:70/1/bongusta/

gopher://gopher.quux.org:70/

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
cmd/crawler		cmd/crawler
common		common
config		config
contextutil		contextutil
db		db
gemini		gemini
gopher		gopher
hostPool		hostPool
misc/sql		misc/sql
robotsMatch		robotsMatch
util		util
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
seed_urls.txt		seed_urls.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gemini-grc

Features

Security Note

How to run

Configuration

Development

Snapshot History

Accessing Snapshot History

TODO

TODO (lower priority)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

antanst/gemini-grc

Folders and files

Latest commit

History

Repository files navigation

gemini-grc

Features

Security Note

How to run

Configuration

Development

Snapshot History

Accessing Snapshot History

TODO

TODO (lower priority)

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages