CHAI is an attempt at an open-source data pipeline for package managers. The goal is to have a pipeline that can use the data from any package manager and provide a normalized data source for myriads of different use cases.
Use Docker
- Install Docker
- Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
- Using a terminal, navigate to the cloned repository directory
- Run
docker compose build
to create the latest Docker images - Then, run
docker compose up
to launch.
Note
This will run CHAI for all package managers. As an example crates by itself will take over an hour and consume >5GB storage.
Currently, we support only two package managers:
- crates
- Homebrew
You can run a single package manager by running
docker compose up -e ... <package_manager>
We are planning on supporting NPM
, PyPI
, and rubygems
next.
Specify these eg. docker compose -e FOO=bar up
:
FREQUENCY
: Sets how often (in hours) the pipeline should run.TEST
: Runs the loader in test mode when set to true, skipping certain data insertions.FETCH
: Determines whether to fetch new data from the source when set to true.NO_CACHE
: When set to true, deletes temporary files after processing.
Note
The flag NO_CACHE
does not mean that files will not get downloaded to your local
storage (specifically, the ./data directory). It only means that we'll
delete these temporary files from ./data once we're done processing them.
These arguments are all configurable in the docker-compose.yml
file.
db
: PostgreSQL database for the reduced package dataalembic
: handles migrationspackage_managers
: fetches and writes data for each package managerapi
: a simple REST api for reading from the db
Stuff happens. Start over:
rm -rf ./data
: removes all the data the fetcher is putting.
Our goal is to build a data schema that looks like this:
You can read more about specific data models in the dbs readme
Our specific application extracts the dependency graph understand what are critical pieces of the open-source graph. We also built a simple example that displays sbom-metadata for your repository.
There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)
Tip
Help us add the above to the examples folder.
- The database url is
postgresql://postgres:s3cr3t@localhost:5435/chai
, and is used asCHAI_DATABASE_URL
in the environment.psql CHAI_DATABASE_URL
will connect you to the database.
These are tasks that can be run using [xcfile.dev]. If you use pkgx
, typing
dev
loads the environment. Alternatively, run them manually.
rm -rf db/data data .venv
docker compose build
Requires: build
docker compose up -d
Env: TEST=true Env: DEBUG=true
docker compose up
Requires: build Env: TEST=true Env: DEBUG=true
docker compose up
docker compose down
docker compose logs
Requires: stop
rm -rf db/data
Inputs: MIGRATION_NAME Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic upgrade head
Inputs: STEP Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic downgrade -$STEP
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
Refreshes table knowledge from the db.
docker-compose restart api
docker compose down --remove-orphans