I recommend downloading the raw dataset into a directory at data/
(relative to this repo's root). Usually this is done
with aws s3 sync
.
For the FIREBALL project, you can extract the dataset release here.
The pipeline requires Python 3.10+.
I recommend creating a virtual environment to install the Python requirements:
# installing Python requirements
$ python --version
Python 3.10.2
$ python -m venv venv
$ source venv/bin/activate
# If the venv is already set up, you can skip to this step
(venv) $ pip install -r requirements.txt
To reproduce the FIREBALL data processing, run these python scripts in this order:
distill1_time_group.py
distill2_authors.py
distill3a_ic_regex.py
distill3b_ic_classifier_gpt.py
distill4_normalize.py
(the results of this step are included in the FIREBALL release for all instances)finetune_prep.py
The resulting data will be in the extract/
directory.
originally AWS Kinesis Dataset Exploration Tool
NOTE: This code is not part of the FIREBALL preprocessing pipeline; it was used to explore the dataset and iterate on heuristics early in the project. You can use this code to explore the dataset as well.
This repo contains a set of tools in order to explore datasets collected via AWS Kinesis Firehose quickly and intuitively, while providing the framework to quickly iterate on heuristics and visualize raw data.
- Operates directly on gzipped JSONL files output by AWS Kinesis Firehose, no extraction needed
- Memory efficient (streaming heuristic applicator)
- High-throughput and horizontally scalable (multiprocessing out of the box)
- Low-latency (streaming API & client)
I built this tool for the FIREBALL project (https://www.cis.upenn.edu/~ccb/language-to-avrae.html) and this is the dataset this repo is implemented for, but the tool is designed with some degree of dataset-agnosticism in mind.
For help using this tool in your own data science project, contact me at [email protected] or view "Customizing the Explorer" below.
The first step of exploring the dataset is to define and apply heuristics to the dataset.
To define a heuristic, add it to the heuristics
module - a function that takes an iterator of event dicts (a combat
session) and returns a single float (we'll use this later - the scale and meaning can be fairly arbitrary). Make sure to
import any added heuristics in heuristics/__init__.py
.
Next, you should compute each heuristic over the dataset - to do this efficiently, run python heuristic_worker.py
.
This will compute each defined heuristic for each combat instance in parallel and save the results
to heuristic_results/
.
If a heuristic has been computed for the dataset previously (based on heuristic name and dataset checksum), it will not
be recomputed. Make sure to delete any prior result from your output directory or run with --force-recompute
after
modifying heuristic code.
The heuristic worker includes some additional arguments for more fine-grained control. You can view these arguments
with python heuristic_worker.py --help
:
usage: heuristic_worker.py [-d DATA_DIR] [-o OUTPUT_DIR] [-h HEURISTIC] [--force-recompute] [--help]
Applies defined heuristics to a dataset.
options:
-d DATA_DIR, --data-dir DATA_DIR
the directory containing the raw data (default: data/)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
the directory to save the heuristic results to (default: heuristic_results/)
-h HEURISTIC, --heuristic HEURISTIC
the heuristic(s) to run (defaults to all)
--force-recompute forces the worker to recompute regardless of prior computation
--help displays CLI help
After defining and computing some heuristics, the next step is to open up the dataset in the dataset explorer and view each recording instance empirically alongside the computed heuristics.
Building the explorer app locally is optional - the prebuilt files can be downloaded from TODO.
To build the explorer web app locally, Node.js 16+ is required.
The explorer app uses some modern web technologies that are not yet supported by all browsers; Chrome 71+, Firefox 105+, Edge 79+, Safari 14.1+, or Opera 58+ is required (IE is not supported).
# installing Node requirements (optional)
$ node --version
v16.17.0
$ npm --version
8.17.0
$ cd explorer
$ npm install
The explorer app is a Vue app that lives in explorer/
. To build it, run npm run build
from the explorer directory.
Alternatively, you can download an automatically built prebuilt distribution from
this repo's CI pipeline
(click on the latest run and download the explorer-dist
artifact).
To use the prebuilt distribution, create the explorer/dist
directory and extract it to that directory. The project
file structure should look like this:
aws-kinesis-dataset-exploration-tool/
explorer/
dist/
assets/
index.***.css
index.***.js
index.html
This project provides a simple local web app to accomplish this. Run python explorer_server.py
and the explorer will
be served at http://127.0.0.1:31415/explorer
.
Similarly to the heuristic worker, you can point the explorer to an alternate dataset directory and heuristic results
directory by setting the DATA_DIR
and HEURISTIC_DIR
environment variables, respectively.
The implementation of the explorer in this repo is built for the Avrae NLP project. To use this tool for your own project, you will need to change your event visualizer and models.
This step provides typed interfaces in order to make building the custom event visualizer easier. You can skip this step
by setting type AnyEvent = any
in explorer/src/events.ts
if you do not need static typing.
Otherwise, create a directory in explorer/src
for your own dataset, and define your event type(s) as a TypeScript
interface. Once you've done that, set the AnyType
type to your newly defined type(s) in explorer/src/events.ts
.
In order to use this tool's event annotation features, each event should have a unique ID. To define how to extract this
ID from an event, implement getEventId(event: AnyEvent): string | null
in explorer/src/events.ts
.
If the function returns null
, the event for which it did will not support annotations.
To visualize an event, define a Vue component in your dataset-specific directory that takes your event as a prop. Use this template:
<!-- explorer/src/(dataset)/EventComponent.vue -->
<script setup lang="ts">
// don't change this!
import type {AnyEvent} from "@/events";
defineProps<{event: AnyEvent}>();
// any custom JS logic goes here
</script>
<template>
<!-- by default, this displays the event as JSON - update the template to your liking -->
<pre>
{{ event }}
</pre>
</template>
<style scoped>
/* css goes here */
</style>
Then, update the import in explorer/src/views/InstanceViewer.vue
to use your dataset-specific component:
<!-- explorer/src/views/InstanceViewer.vue -->
<script setup lang="ts">
[...]
import EventComponent from "@/(dataset)/EventComponent.vue"; // change this line!
[...]
</script>
As JavaScript's number
type loses precision for integers greater than 2^53, the explorer will automatically parse
any integer that would otherwise cause rounding as a BigNumber
.