Use the create-your-internet package to conveniently scaffold out a new application using this package.
$ yarn create your-internet
Or, if you prefer npm
.
$ npm create your-internet
Feel free to file an issue or submit a PR. More will be added to this SDK over time.
- Installation
- Authenticating
- Quickstart
- Codegen under the hood
- Working with the local browser
- Imitating a trip
- How much does this cost?
Important: See the examples/
folder for complete code examples.
Want internet data in your TypeScript application? Just install it with npm.
$ npm i internetdata
Or add it with yarn.
$ yarn add internetdata
When authenticating to LSD, what you're connecting to is our postgres compatible database hence using the terms "user" and "password".
The values below refer to your email and API key, which you can obtain from your profile.
When calling drop.tab()
, you can provide a ConnectionConfiguration
object with two properties user
and password
.
const trip = drop.tab({
user: "<[email protected]>",
password: "<api_key>",
});
An alternative method for authenticating is by writing to an .lsd
file in your home directory containing a JSON with the properties user
and password
.
{
"user": "<[email protected]>",
"password": "<api_key>"
}
If you authenticate this way, you can exclude the connection configuration when calling drop.tab()
.
const trip = drop.tab();
An alternative method for authenticating is by setting the LSD_USER
and LSD_PASSWORD
environment variables.
$ export LSD_USER='<[email protected]>'
$ export LSD_PASSWORD='<api_key>'
If you authenticate this way, you can exclude the connection configuration when calling drop.tab()
.
const trip = drop.tab();
Note: If you're running into difficulties when using this approach, check the environment variables were properly set in the terminal or application you're attemping to run from.
$ echo $LSD_USER
$ echo $LSD_PASSWORD
Shown below are the necessary pieces for getting started after installing. The guide also assumes you've created an API key.
Note: You must use either the API key provided to the local browser after logging in or create one on your profile after you've logged into the browser. This is so we're able to correlate the correct browser to be facilitating instructions on. If you're still running into problems, feel free to schedule a call.
Go here for the full code example.
- Import the default export from
internetdata
as well as zod.
import drop from "internetdata";
import { z } from "zod";
- Call the
tab(connectionConfiguration?: ConnectionConfiguration)
method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
Note: The code snippet above assumes you've saved the username and API key to the LSD_USER
and LSD_PASSWORD
environment variables respectively. If you'd like to pass in a connection configuration object you can do so like below:
const trip = await drop.tab({
user: "[email protected]",
password: "<api key>",
}); // Promise<Trip>
- Declare the zod schema you're interested in getting data from the web back in.
const hnSchema = z.array(
z.object({
post: z.string(),
}),
);
Additionally, you can infer a strong type definition for the objects you're interested in.
type HNType = typeof hnSchema;
Note: If you're running into confusing Zod-related errors, see this related guide on working with generic functions and Zod.
- Now you can effectively pipeline the web data you're looking to retrieve:
const frontPage = await trip
.navigate("https://news.ycombinator.com")
.group("span.titleline")
.select("a@href", "post")
.extrapolate<HNType>(hnSchema);
Breaking this down line by line:
At the end when we call extrapolate
we will be working with a promise for the results hence await
ing here.
const frontPage = await trip;
The URL we're interested in retrieving data from is https://news.ycombinator.com.
.navigate('https://news.ycombinator.com')
On the page there is a repeating container for each post that can be matched with the CSS selector span.titleline
.
.group('span.titleline')
The href attribute is what we'd understand as being the "post".
.select('a@href', 'post_link')
Then finally we're going to extrapolate the object or list of objects from the trip.
.extrapolate<HNType>(hnSchema);
- Now you have a strongly typed collection for the front page of Hacker News!
console.log("What are the posts on the front page of HN?");
console.log(frontPage);
Go here for the full code example.
- Like with Hacker News, import the default export from
internetdata
as well as zod.
import drop from "internetdata";
import { z } from "zod";
- Call the
tab(connectionConfiguration?: ConnectionConfiguration)
method to get a promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
Note: The code snippet above assumes you've saved the username and API key to the LSD_USER
and LSD_PASSWORD
environment variables respectively. If you'd like to pass in a connection configuration object you can do so like below:
const trip = await drop.tab({
user: "[email protected]",
password: "<api key>",
}); // Promise<Trip>
- Declare the zod schema you're interested in getting data from the web back in.
const docsSchema = z.array(
z.object({
title: z.string(),
}),
);
- Now you can effectively pipeline the web data you're looking to retrieve:
const docsTitle = await trip
.on("TRAVERSER")
.navigate(`https://lsd.so/docs`)
.click('a[href="/docs/database"]')
.select("title")
.extrapolate<typeof docsSchema>(docsSchema);
Curious what the "Traverser" means?
- Now you have a strongly typed collection for the title of the docs page!
console.log("What is the tile of the database docs page?");
console.log(docsTitle);
There may be times where you just want to refer to the thing without having to actually uncover what the thing technically is exactly. We currently have AI natively embedded in the language for SELECT statements.
See the Lobsters codegen example for the full code example.
const trip = await drop.tab();
const lobstersSchema = z.array(
z.object({
author: z.string(),
}),
);
const authors = await trip
.on("BROWSER")
.navigate("https://lobste.rs")
.group("ol.stories li")
.select("author")
.extrapolate<typeof lobstersSchema>(lobstersSchema);
console.log("Who are the authors on the front page of Lobsters?");
console.log(authors);
Most of the above code matches what you'll find in the other tutorials within this README except for the .select()
call:
.select("author")
As you may notice, the word "author" is not a valid CSS selector but the program still comes out in the end with the requested data. This is due to two ingredients: the page HTML is accessible (by default LSD will not attempt to retrieve a page HTML solely for fulfilling an invalid selector). And, two, the page HTML's available at the step of selecting "author" because it was channeled through the local browser.
If you were to request data through a cloud browser and then attempt to codegen a CSS selector like above, it'd then work thanks to the default 15 minute cache.
There are a variety of reasons why you'd be interested in working with a local browser however this can be best understood as covering that "last mile" of web scraping thanks to the LSD language being accomodating of both headless cloud browsers as well as our own local "Bicycle" browser.
After you've downloaded and logged in to the local browser, either copy the credentials offered or then go to your profile (for the same Gmail) and create an API key. All that's needed left is to indicate you're interested in tripping .on()
the "BROWSER"
.
A while back we incorporated local browser control into the LSD language, here's how that looks using the SDK.
- Import the default export from
internetdata
as well as zod.
import drop from "internetdata";
import { z } from "zod";
- Call the
tab(connectionConfiguration?: ConnectionConfiguration)
method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
- Declare the zod schema you're interested in getting data from the web back in.
const googleSchema = z.array(
z.object({
result: z.string(),
}),
);
type GoogleType = typeof googleSchema;
- Now you can effectively pipeline the web data you're looking to retrieve:
const googleResults = await trip
.on("BROWSER")
.navigate(`https://www.google.com/search?q=what+is+lsd.so%3F`)
.group("div#search a")
.select("div#search a@href", "result")
.extrapolate<GoogleType>(googleSchema);
- Now you have a strongly typed collection for the title of the docs page!
console.log("What is LSD.so according to Google?");
console.log(googleResults);
- Import the default export from
internetdata
as well as zod.
import drop from "internetdata";
import { z } from "zod";
- Call the
tab(connectionConfiguration?: ConnectionConfiguration)
method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
- Declare the zod schema you're interested in getting data from the web back in.
const mcmasterSchema = z.array(
z.object({
name_of_screw: z.string(),
}),
);
-
- Now you can effectively pipeline the web data you're looking to retrieve:
const screwResults = await trip
.on("BROWSER")
.navigate(`https://www.mcmaster.com/products/screws/`)
.group('div[class*="TileLayout_textContainer"]')
.select('div[class*="TileLayout_titleContainer"]', "name_of_screw")
.extrapolate<McMasterType>(mcmasterSchema);
- Now you have a strongly typed collection for the names of screws on McMaster-Carr!
console.log("What screws are available on McMaster Carr?");
console.log(screwResults);
There may be flows that are accomplishable in the LSD language that are not yet accomplishable via the SDK, for these scenarios we allow you to "imitate" the trip that was defined before you.
For example, the trip yev/hacker_news
has the following definition:
hn <| https://news.ycombinator.com/ |
container <| span.titleline |
post <| a |
post_link <| post@href |
front_page_of_hn <|> <|
FROM hn
|> GROUP BY container
|> SELECT post, post_link |
front_page_of_hn
Therefore, we can define the following Zod schema:
const hnSchema = z.array(
z.object({
post: z.string(),
post_link: z.string(),
}),
);
And then imitate the trip detailed above:
const frontPage = await trip
.imitate("yev/hacker_news")
.extrapolate<typeof hnSchema>(hnSchema);
console.log("What are the posts on the front page of HN?");
console.log(frontPage);
We care about enabling and empowering developers to program the web they want to. In short, we're planning to be free as a developer-friendly Wayback Machine unless you're interested in receiving support or prioritization for working on specific features. Reach out if this interests you.