Skip to content

lsd-so/internetdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

internetdata

NPM Version

TypeScript logo but with blotter paper as the background

Quickstart

Use the create-your-internet package to conveniently scaffold out a new application using this package.

$ yarn create your-internet

Or, if you prefer npm.

$ npm create your-internet

Issues/Contributing

Feel free to file an issue or submit a PR. More will be added to this SDK over time.

Contents

Important: See the examples/ folder for complete code examples.

Installation

Want internet data in your TypeScript application? Just install it with npm.

$ npm i internetdata

Or add it with yarn.

$ yarn add internetdata

Authenticating

When authenticating to LSD, what you're connecting to is our postgres compatible database hence using the terms "user" and "password".

The values below refer to your email and API key, which you can obtain from your profile.

Configuration object

When calling drop.tab(), you can provide a ConnectionConfiguration object with two properties user and password.

const trip = drop.tab({
  user: "<[email protected]>",
  password: "<api_key>",
});

Configuration file

An alternative method for authenticating is by writing to an .lsd file in your home directory containing a JSON with the properties user and password.

{
  "user": "<[email protected]>",
  "password": "<api_key>"
}

If you authenticate this way, you can exclude the connection configuration when calling drop.tab().

const trip = drop.tab();

Environment variables

An alternative method for authenticating is by setting the LSD_USER and LSD_PASSWORD environment variables.

$ export LSD_USER='<[email protected]>'
$ export LSD_PASSWORD='<api_key>'

If you authenticate this way, you can exclude the connection configuration when calling drop.tab().

const trip = drop.tab();

Note: If you're running into difficulties when using this approach, check the environment variables were properly set in the terminal or application you're attemping to run from.

$ echo $LSD_USER
$ echo $LSD_PASSWORD

Quickstart

Shown below are the necessary pieces for getting started after installing. The guide also assumes you've created an API key.

Note: You must use either the API key provided to the local browser after logging in or create one on your profile after you've logged into the browser. This is so we're able to correlate the correct browser to be facilitating instructions on. If you're still running into problems, feel free to schedule a call.

Hacker News

Go here for the full code example.

  1. Import the default export from internetdata as well as zod.
import drop from "internetdata";
import { z } from "zod";
  1. Call the tab(connectionConfiguration?: ConnectionConfiguration) method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>

Note: The code snippet above assumes you've saved the username and API key to the LSD_USER and LSD_PASSWORD environment variables respectively. If you'd like to pass in a connection configuration object you can do so like below:

const trip = await drop.tab({
  user: "[email protected]",
  password: "<api key>",
}); // Promise<Trip>
  1. Declare the zod schema you're interested in getting data from the web back in.
const hnSchema = z.array(
  z.object({
    post: z.string(),
  }),
);

Additionally, you can infer a strong type definition for the objects you're interested in.

type HNType = typeof hnSchema;

Note: If you're running into confusing Zod-related errors, see this related guide on working with generic functions and Zod.

  1. Now you can effectively pipeline the web data you're looking to retrieve:
const frontPage = await trip
  .navigate("https://news.ycombinator.com")
  .group("span.titleline")
  .select("a@href", "post")
  .extrapolate<HNType>(hnSchema);

Breaking this down line by line:

At the end when we call extrapolate we will be working with a promise for the results hence awaiting here.

const frontPage = await trip;

The URL we're interested in retrieving data from is https://news.ycombinator.com.

    .navigate('https://news.ycombinator.com')

On the page there is a repeating container for each post that can be matched with the CSS selector span.titleline.

    .group('span.titleline')

The href attribute is what we'd understand as being the "post".

    .select('a@href', 'post_link')

Then finally we're going to extrapolate the object or list of objects from the trip.

    .extrapolate<HNType>(hnSchema);
  1. Now you have a strongly typed collection for the front page of Hacker News!
console.log("What are the posts on the front page of HN?");
console.log(frontPage);

Interacting with LSD docs

Go here for the full code example.

  1. Like with Hacker News, import the default export from internetdata as well as zod.
import drop from "internetdata";
import { z } from "zod";
  1. Call the tab(connectionConfiguration?: ConnectionConfiguration) method to get a promise for a trip.
const trip = await drop.tab(); // Promise<Trip>

Note: The code snippet above assumes you've saved the username and API key to the LSD_USER and LSD_PASSWORD environment variables respectively. If you'd like to pass in a connection configuration object you can do so like below:

const trip = await drop.tab({
  user: "[email protected]",
  password: "<api key>",
}); // Promise<Trip>
  1. Declare the zod schema you're interested in getting data from the web back in.
const docsSchema = z.array(
  z.object({
    title: z.string(),
  }),
);
  1. Now you can effectively pipeline the web data you're looking to retrieve:
const docsTitle = await trip
  .on("TRAVERSER")
  .navigate(`https://lsd.so/docs`)
  .click('a[href="/docs/database"]')
  .select("title")
  .extrapolate<typeof docsSchema>(docsSchema);

Curious what the "Traverser" means?

  1. Now you have a strongly typed collection for the title of the docs page!
console.log("What is the tile of the database docs page?");
console.log(docsTitle);

Codegen under the hood

There may be times where you just want to refer to the thing without having to actually uncover what the thing technically is exactly. We currently have AI natively embedded in the language for SELECT statements.

See the Lobsters codegen example for the full code example.

const trip = await drop.tab();

const lobstersSchema = z.array(
  z.object({
    author: z.string(),
  }),
);

const authors = await trip
  .on("BROWSER")
  .navigate("https://lobste.rs")
  .group("ol.stories li")
  .select("author")
  .extrapolate<typeof lobstersSchema>(lobstersSchema);

console.log("Who are the authors on the front page of Lobsters?");
console.log(authors);

Most of the above code matches what you'll find in the other tutorials within this README except for the .select() call:

  .select("author")

As you may notice, the word "author" is not a valid CSS selector but the program still comes out in the end with the requested data. This is due to two ingredients: the page HTML is accessible (by default LSD will not attempt to retrieve a page HTML solely for fulfilling an invalid selector). And, two, the page HTML's available at the step of selecting "author" because it was channeled through the local browser.

If you were to request data through a cloud browser and then attempt to codegen a CSS selector like above, it'd then work thanks to the default 15 minute cache.

Working with the local browser

There are a variety of reasons why you'd be interested in working with a local browser however this can be best understood as covering that "last mile" of web scraping thanks to the LSD language being accomodating of both headless cloud browsers as well as our own local "Bicycle" browser.

After you've downloaded and logged in to the local browser, either copy the credentials offered or then go to your profile (for the same Gmail) and create an API key. All that's needed left is to indicate you're interested in tripping .on() the "BROWSER".

Google

A while back we incorporated local browser control into the LSD language, here's how that looks using the SDK.

  1. Import the default export from internetdata as well as zod.
import drop from "internetdata";
import { z } from "zod";
  1. Call the tab(connectionConfiguration?: ConnectionConfiguration) method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
  1. Declare the zod schema you're interested in getting data from the web back in.
const googleSchema = z.array(
  z.object({
    result: z.string(),
  }),
);

type GoogleType = typeof googleSchema;
  1. Now you can effectively pipeline the web data you're looking to retrieve:
const googleResults = await trip
  .on("BROWSER")
  .navigate(`https://www.google.com/search?q=what+is+lsd.so%3F`)
  .group("div#search a")
  .select("div#search a@href", "result")
  .extrapolate<GoogleType>(googleSchema);
  1. Now you have a strongly typed collection for the title of the docs page!
console.log("What is LSD.so according to Google?");
console.log(googleResults);

McMaster-Carr

  1. Import the default export from internetdata as well as zod.
import drop from "internetdata";
import { z } from "zod";
  1. Call the tab(connectionConfiguration?: ConnectionConfiguration) method to get a Promise for a trip.
const trip = await drop.tab(); // Promise<Trip>
  1. Declare the zod schema you're interested in getting data from the web back in.
const mcmasterSchema = z.array(
  z.object({
    name_of_screw: z.string(),
  }),
);
    1. Now you can effectively pipeline the web data you're looking to retrieve:
const screwResults = await trip
  .on("BROWSER")
  .navigate(`https://www.mcmaster.com/products/screws/`)
  .group('div[class*="TileLayout_textContainer"]')
  .select('div[class*="TileLayout_titleContainer"]', "name_of_screw")
  .extrapolate<McMasterType>(mcmasterSchema);
  1. Now you have a strongly typed collection for the names of screws on McMaster-Carr!
console.log("What screws are available on McMaster Carr?");
console.log(screwResults);

Imitating a trip

There may be flows that are accomplishable in the LSD language that are not yet accomplishable via the SDK, for these scenarios we allow you to "imitate" the trip that was defined before you.

For example, the trip yev/hacker_news has the following definition:

hn <| https://news.ycombinator.com/ |
container <| span.titleline |
post <| a |
post_link <| post@href |

front_page_of_hn <|> <|
FROM hn
|> GROUP BY container
|> SELECT post, post_link |

front_page_of_hn

Therefore, we can define the following Zod schema:

const hnSchema = z.array(
  z.object({
    post: z.string(),
    post_link: z.string(),
  }),
);

And then imitate the trip detailed above:

const frontPage = await trip
  .imitate("yev/hacker_news")
  .extrapolate<typeof hnSchema>(hnSchema);

console.log("What are the posts on the front page of HN?");
console.log(frontPage);

How much does this cost?

We care about enabling and empowering developers to program the web they want to. In short, we're planning to be free as a developer-friendly Wayback Machine unless you're interested in receiving support or prioritization for working on specific features. Reach out if this interests you.

About

Want internetdata in your TypeScript project? Just npm install it

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published