This is the guide to using Data-Forge.
Data-Forge is a open-source toolkit for data transformation and analysis in JavaScript and TypeScript. This guide will help you dive deeper into Data-Forge and equip you with skills to build more complex data processing pipelines.
This document is a constant work in progress. If you notice any problems please log an issue.
Please also consider forking this repo, then add or update this document based on your own experimentation or experience with Data-Forge and then submit a pull-request. I'm happy to accept contributions to both code and documentation!
You can also read about Data-Forge for data science in the book JavaScript for Data Science.
For other resources and documentation please visit the home page or view the repo's main README.
Sponsor my open-source work and receive a free licence for Data-Forge Notebook .
Happy data wrangling!
Table of Contents generated with DocToc
- Project Overview
- Installation
- Key Concepts
- A note about JavaScript anonymous functions
- Immutability and Chained Functions
- Lazy evaluation through iterators
- Basic Usage
- Working with data
- Data exploration and visualization
- Sorting
- Transformation
- Filtering
- Data subsets
- Combining
- Collapsing unique values
- Groups and windows
- Summarization and Aggregation
- Filling gaps and missing data
- Node.js examples
- Browser examples
This section is a short overview of the aims, principles and implementation methods for this project.
The aims of this project:
- To combine the best aspects of Pandas and LINQ and make them available in JavaScript.
- To be able to load, transform and save data.
- To be able to visualize data (see Data-Forge Plot)
- To be able to work with massive data files (although it's not quite there yet - want to add streaming in the future).
The principles that drive decision making and tradeoffs:
- The API should be simple, easy to learn and easy to use.
- Simple tools should be composable to build solutions to complex data problems.
- Intellisense in Visual Studio Code (or other IDE) should make using the API a breeze.
- Minimize the magic: everything should be understandable, the API should be orthogonal.
- The library should have high performance.
- The TypeScript or JavaScript code you build during exploratory coding should be easily transplantable to a webapp, server or microservice.
General implementation goals:
- Immutable: every operation generates a new immutable data set.
- Lazy evaluation, to make the performance of immutability acceptable.
- Should be easily extensible.
- Core code is created through test driven development.
Install via NPM:
npm install --save data-forge
If you are on Node.js and want the file system functions please also install the file system module Data-Forge FS:
npm install --save data-forge-fs
Require the module into your JavaScript code:
const dataForge = require('data-forge');
Or import the entire module in TypeScript:
import * as dataForge from 'data-forge';
Or just import classes and functions in TypeScript:
import { readFile, Series, DataFrame } from 'data-forge';
To use the file system functions also import the fs module in JavaScript:
require('data-forge-fs');
Or in TypeScript:
import 'data-forge-fs';
The file system module auguments the core Data-Forge API and adds readFile and writeFile function to support access to data files under Node.js.
To visualize data, also install Data-Forge Plot:
npm install --save data-forge-plot
And require it in JavaScript:
require('data-forge-plot');
Or in TypeScript:
import 'data-forge-plot';
Data-Forge can be used from the browser and has been tested with Browserify and Webpack.
Please see links to examples on the main page.
If you aren't using Browserify or Webpack, the npm package includes a pre-packed browser distribution that you can install and included in your HTML as follows:
<script language="javascript" type="text/javascript" src="node_modules/data-forge/dist/web/index.js"></script>
This gives you the data-forge package mounted under the global variable dataForge
.
Please remember that you can't use data-forge-fs or data-forge-plot from the browser.
Install via NPM as described in previous sections or clone, fork or download the code from GitHub:
- https://github.com/data-forge/data-forge-ts
- https://github.com/data-forge/data-forge-fs
- https://github.com/data-forge/data-forge-plot
The concepts section has been moved to a separate file.
Use of Data-Forge benefits heavily on the use of JavaScript anonymous functions. For most of the examples in this guide I use the JavaScript arrow syntax that was introduced in ES6.
An example of an arrow syntax anonymous function:
(param1, param2) => 1 + 1
However this can't be consisently used in the browser or in older versions of node, so please substitute the traditional syntax where necessary:
function (param1, param2) {
return 1 + 1;
}
You will notice in all examples here that multiple functions are often chained together using the Data-Forge fluent API.
Data-Forge supports only immutable operations. Each operation returns a new immutable dataframe or series. No in place operations are supported (one of the things I found confusing about Pandas).
This is why, in the following example, the final dataframe is always captured after all operations have been applied:
const df = new dataForge.DataFrame(config).setIndex("Col3").dropSeries("Col3");
Consider an alternate structure:
const df1 = new dataForge.DataFrame(config);
const df2 = df1.setIndex("Col3");
const df3 = df2.dropSeries("Col3");
Here df1, df2 and df3 are separate dataframes with the results of the previous operations applied. These dataframes are all immutable and cannot be changed. Any function that transforms a dataframe returns a new and independent dataframe. If you are not used to this sort of thing, it may require some getting used to!
Most operations in Data-Forge happen only through lazy evaluation with results computed just in time when they are needed.
For example in the following example you might think that the index has been set and Col3 has been dropped:
const df = new dataForge.DataFrame(config).setIndex("Col3").dropSeries("Col3");
But that's not the case due to lazy evaluation which ensures that the work to modify the series or dataframe is only done when it needs to be. For example converting the dataframe to a JavaScript array will force lazy evaluation and bake all our changes into the resulting array:
const array = df.toArray();
Lazy evaluation in Data-Forge is implemented through JavaScript iterators. This means you can use for..of
on any series or dataframe to iterate its values. Series and dataframes can be passed to any JavaScript function that expects an iterator.
You can force a dataframe or series to be evaluated using the bake
function, this can sometimes help improve performance:
const df = new dataForge.DataFrame(config)
.setIndex("Col3")
.dropSeries("Col3")
.bake();
A dataframe represents a sequence of tabular data. That is to say that it contains multiple named series, or columns of data. You can think of it like a JavaScript array but with extra operations and support for columns. I like to think of a dataframe as a spreadsheet in memory.
The DataFrame
constructor is passed a config object that specifies the initial contents of the dataframe and additional options.
Create a data frame from column names and rows:
const df = new dataForge.DataFrame({
columnNames: ["Col1", "Col2", "Col3"],
rows: [
[1, 'hello', new Date(...)],
[5, 'computer', new Date(...)],
[10, 'good day', new Date(...)]
]
});
A dataframe can also be created from an array of JavaScript objects and the column names are inferred from the fields in the first object that it sees:
const df = new dataForge.DataFrame({
values: [
{
Col1: 1,
Col2: 'hello',
Col3: new Date(....)
},
{
Col1: 5,
Col2: 'computer',
Col3: new Date(....)
},
{
Col1: 10,
Col2: 'good day',
Col3: new Date(....)
}
]
});
If you don't need additional options, you can simply pass in an array of objects:
const df = new dataForge.DataFrame([
{
Col1: 1,
Col2: 'hello',
Col3: new Date(....)
},
{
Col1: 5,
Col2: 'computer',
Col3: new Date(....)
},
{
Col1: 10,
Col2: 'good day',
Col3: new Date(....)
}
]);
If you have irregular data you can enable considerAllRows, but be warned that this can be expensive as every row must be examined to determine column names:
const df = new dataForge.DataFrame({
values: [
{
Col1: 1,
Col2: 'hello',
Col3: new Date(....)
},
{
Col1: 5,
Col5: 'these are irregular columns',
Col6: new Date(....)
},
{
Col5: 10,
Col7: 'another irregular column',
Col10: new Date(....)
}
],
considerAllRows: true, // Examine all rows to determine column names.
});
A dataframe can also be constructed from separate columns as follows using arrays or Series objects:
const df = new dataForge.DataFrame({
columns: {
Col1: [1, 2, 3, 4],
Col2: ['a', 'b', 'c', 'd'],
Col3: new dataForge.Series(...),
},
});
A series is a sequence of values, again a bit like a JavaScript array but with more options. DataFrame
and Series
are very similar and share many functions. A DataFrame
can be actually be thought of as a collection of named series.
Very similar to creating a DataFrame
, pass a configuration object to the Series
constructor with values and additional options:
const series = new dataForge.Series({
values: [1, 2, 3]
});
If you don't need additional options you can simply pass in an array of values:
const series = new dataForge.Series([1, 2, 3]);
To get back the names of columns:
const columnNames = df.getColumnNames();
To get back an array of objects (with column names as field names):
const objects = df.toArray();
To get back an array of rows (arrays of values in column order):
const rows = df.toRows();
To get back index and value pairs:
const pairs = df.toPairs();
Because a dataframe is also an iterable you can iterate it using for...of
:
for (const value of df) {
// ... do something with value ...
}
There's also a forEach function:
df.forEach(value => {
// ... do something with value ...
});
To retreive the data from Series as an array:
const values = series.toArray();
To get back index and value pairs:
const pairs = series.toPairs();
A series can also be iterated:
for (const value of series) {
// ... do something with value ...
}
And of course, there's a forEach function:
series.forEach(value => {
// ... do something with value ...
});
So why is DataFrame and Series special? How they are different from a JavaScript array?
Well besides both having many useful functions for slicing and dicing your data, they can both be indexed by values other than numbers. Indexes support efficient lookup, merging and aggregation of your data.
In the previous examples of creating dataframes and series no index was specified, in those cases a default zero-based index was generated (just like with a normal JavaScript array).
An index can be set explicitly when creating a series or dataframe:
const df = new dataForge.DataFrame({
values: [ ... ] ,
index: [5, 10, 100, ...and so on...]
});
const series = new dataForge.Series({
values: [ ... ],
index: [5, 10, 100, ...and so on...]
});
An index can be any type that you want. For example time-series data will be indexed by Date
:
const timeSeries = new dataForge.Series({
values: [ ... ],
index: [new Date(...), new Date(...), ...and so on...]
});
A new index can easily be assigned to either series or dataframe using the withIndex
function:
const dataFrameWithNewIndex = df.withIndex([1, 2, 3, ...]);
Most likely when using a dataframe you will want to promote an existing column to an index:
const df = new dataForge.DataFrame(someConfig).setIndex("Column3");
You can also use withIndex
to assign or compute an index for each data value:
const indexedDf = df.withIndex(row => row.Column3);
Be aware that promoting a column to an index doesn't remove the column (as it does in Pandas), however you can easily achieve this by calling dropSeries
:
const df = new dataForge.DataFrame(someConfig).setIndex("Column3").dropSeries("Column3");
Note an index is required for certain operations like join
and withSeries
.
To reset the dataframe's index to the default zero-based index, you can use resetIndex
.
const reindexedDf = df.resetIndex();
WARNING: You must have Data-Forge FS installed to use the file system functions. This uses the NodeJS fs
module, so reading and writing files can't work in the browser which has no access to the local file system. Only try this under Node.js!
If your CSV has a header with column names:
const df = dataForge.readFileSync('some-csv-file.csv')
.parseCSV();
If your CSV doesn't have a header you must specify the column names:
const df = dataForge.readFileSync('some-csv-file.csv')
.parseCSV({ columnNames: ["some", "explicit", "column", "names"] });
dataForge.readFile('some-csv-file.csv')
.parseCSV()
.then(df => {
// Use your dataframe after the promise resolves.
})
.catch(err => {
// Some error occurred.
});
async function myAsyncFunction() {
const df = await dataForge.readFile('some-csv-file.csv')
.parseCSV();
// Use your dataframe.
}
The options object passed to parseCSV
is passed directly to PapaParse.parse, please see PapaParse docs for additional options.
df.asCSV().writeFileSync('some-other-csv-file.csv');
df.asCSV().writeFile('some-other-csv-file.csv')
.then(() => {
// File has been written.
})
.catch(err => {
// Some error occurred.
});
async function myAsyncFunction(df) {
await df.asCSV().writeFile('some-other-csv-file.csv');
}
CSV output options can be provided to asCSV
like this:
df.asCSV(options)
The options object is passed directly to PapaParse.unparse, please see PapaParse docs for additional options.
NOTE: The fromCSV
and toCSV
functions do work in the browser, because they are in-memory operations and don't deal directly with any files.
If you already have CSV data (loaded into a string) you can parse it into a dataframe via fromCSV
:
const inputCsvData = // ... some string with CSV data ...
const df = dataForge.fromCSV(inputCsvData);
And if you need to specify column names:
const inputCsvData = // ... some string with CSV data ...
const df = dataForge.fromCSV(inputCsvData, { columnNames: ["some", "explicit", "column", "names"] });
The options object passed to fromCSV
is passed directly to PapaParse.parse, please see PapaParse docs for additional options.
You can stringify a dataframe to CSV by calling toCSV
:
const outputCsvData = df.toCSV();
CSV output options can be provided to toCSV
like this:
df.toCSV(options)
The options object is passed directly to PapaParse.unparse, please see PapaParse docs for additional options.
WARNING: You must have Data-Forge FS installed to use the file system functions. This uses the NodeJS fs
module, so reading and writing files can't work in the browser which has no access to the local file system. Only try this under Node.js!
const df = dataForge.readFileSync('some-json-file.json')
.parseJSON();
dataForge.readFile('some-json-file.json')
.parseJSON()
.then(df => {
// Use your dataframe after the promise resolves.
})
.catch(err => {
// Some error occurred.
});
async function myAsyncFunction() {
const df = await dataForge.readFile('some-json-file.json')
.parseJSON();
// Use your
}
df.asJSON().writeFileSync('some-json-file.json');
df.asJSON().writeFile('some-other-json-file.json')
.then(() => {
// File has been written.
})
.catch(err => {
// Some error occurred.
});
async function myAsyncFunction(df) {
await df.asJSON().writeFile('some-other-json-file.json');
}
NOTE: The fromJSON
and toJSON
functions do work in the browser, because they are in-memory operations and don't deal directly with any files.
If you already have JSON data (loaded into a string) you can parse it into a dataframe via fromJSON
:
const inputJsonData = // ... some string with JSON data ...
const df = dataForge.fromJSON(inputJsonData);
You can stringify a dataframe to jSON by calling toJSON
:
const outputJsonData = df.toJSON();
Often when you parse a data set from a text format you will need to parse string values in specific columns to particular types. This is especially true for CSV data which contain only string data once loaded. It is less true for JSON data which can store values as numbers, although the JSON format has no native date format, so when you load JSON files you will still need to parse the dates.
Data-Forge has various helper functions for parsing string values: parseInts
, parseFloats
and parseDates
.
You can call these on a series, for example:
const stringSeries = new dataForge.Series(["15", "16"]);
assert.isString(stringSeries.first());
const parsedSeries = stringSeries.parseInts();
assert.isNumber(parsedSeries.first());
To call these functions on a dataframe you must pass in the name of the column that is to be parsed, for example say you load from a CSV (which loads in string data) and want to parse a particular column:
const stringDataFrame = dataForge.readFileSync("some-csv-file.csv").parseCSV();
assert.isString(stringDataFrame.first().Column1);
const parsedDataFrame = stringDataFrame.parseInts("Column1");
assert.isNumber(parsedDataFrame.first().Column1);
For each of these functions you can also specify an array of column names to be parsed:
const parsedDataFrame = stringDataFrame.parseInts(["Column1", "Column2"]);
assert.isNumber(parsedDataFrame.first().Column1);
assert.isNumber(parsedDataFrame.first().Column2);
When parsing dates you can specify an optional format string that specifies the format of the dates to be parsed:
const stringDataFrame = dataForge.fromCSV("Column1\n2016-09-25\n2016-10-25");
const parsedDataFrame = stringDataFrame.parseDates("Column1", "YYYY-MM-DD");
Data-Forge uses Moment.js under the hood, please see its docs for valid formatting syntax.
When parsing columns using CSV we can have values parsed automatically depending on what type they appear to be.
const df = dataForge.readFileSync("some-csv-file.csv").parseCSV({ dynamicTyping: true });
assert.isNumber(df.first().SomeNumberColumn);
Please note that this will parse numbers and booleans, but it won't parse dates.
When you are saving out data in a text format or displaying data on screen you will often want to transform values in specific columns to particular types. For numbers this happens automatically, but this is essential when formatting dates for output, for example:
const df = ...
assert.instanceof(df.first().Column1, Date);
const stringifiedDataFrame = df.toStrings("Column1", "YYYY-MM-DD");
assert.isString(stringifiedDataFrame.first().Column1);
Data-Forge uses Moment.js under the hood, please see its docs for valid formatting syntax.
Values can be extracted from a dataframe in several ways.
NOTE: the following functions cause lazy evaluation to complete (like the toArray function in LINQ). This can be performance intensive.
To extract rows as arrays of data (ordered by column):
const arrayOfArrays = df.toRows();
To extract rows as objects (with column names as fields):
const arrayOfObjects = df.toArray();
To extracts index + row pairs:
const arrayOfPairs = df.toPairs();
A new dataframe can also be created from the values between a pair of indicies:
const startIndex = // ... starting row index to include in subset.
const endIndex = // ... ending row index to include in subset.
const subset = df.between(startIndex, endIndex);
NOTE: To use between
and similar functions your index must already be in sorted order.
Invoke a callback for each row in a dataframe using forEach
:
df.forEach(row => {
// Callback function invoked for each row.
});
Because dataframe is implemented as an iterable we can also use for..of
:
for (const row of df) {
// Do something with row.
}
Get the names of the columns:
const arrayOfColumnNames = df.getColumnNames();
Iterate a series of all columns:
for (const column of df.getColumns()) {
const name = column.name;
const series = column.series;
// ... do something with the series ...
}
The advantage to having a series of columns, rather than a normal JavaScript array is that you can access all the tools that series offers for slicing and dicing a sequence, for example:
const sortedColumnsSubject = df.getColumns()
.where(column => column.name !== "Date")
.skip(2)
.take(3)
.orderBy(column => column.name);
Extract a named series from the dataframe:
const series = df.getSeries('MyColumn');
Or to avoid magic strings:
const series = df.deflate(row => row.MyColumn);
Extract a series and compute a new series from it:
const computedSeries = df.deflate(row => row.MyColumn * 10);
Create a new dataframe from a subset of columns:
const subset = df.subset(["Some-Column", "Some-Other-Column"]);
NOTE: the follow functions cause lazy evaluation to complete (like the toArray function in LINQ). This can be performance intensive.
Extract the values from the series as an array:
const arrayOfValues = series.toArray();
Extract index + value pairs from the series as an array:
const arrayOfPairs = series.toPairs();
Invoke a callback for each value in the series using forEach
:
series.forEach(value => {
// Callback function invoked for each value.
});
Series is also implemented as an iterable so we can use for..of
:
for (const value of series) {
// Do something with value.
}
Retrieve the index from a dataframe:
const index = dataFrame.getIndex();
Retrieve the index from a series:
const index = series.getIndex();
An index is actually basically just a series so you can call the toArray
function or anything else that normally works for a series:
const arrayOfIndexValues = index.toArray();
New columns can be added to a dataframe. This doesn't change the original dataframe, it generates a new one with the additional column.
const newDf = df.withSeries("Some-New-Column", someNewSeries);
You can also use the column spec format to avoid magic strings and also to add multiple columns at once:
const newDf = df.withSeries({
NewColumn1: seriesA,
NewColumn2: seriesB
});
withSeries
can also replace an existing column:
const newDf = df.withSeries({
SomeExistingColumn: someNewSeries
});
Again note that it is only the new dataframe that includes the modified column.
withSeries
can be used to generate a new column from existing columns as follows:
const newDf = df.withSeries({
SomeNewColumn: df => df.select(row => row.InputColumn1 + row.InputColumn2)
);
This same thing can be done more slighlty conveniently with the generateSeries
function:
const newDf = df.generateSeries({
SomeNewColumn: row => row.InputColumn1 + row.InputColumn2
});
withSeries
can be used to transform an existing column by passing in a function:
const newDf = df.withSeries({
SomeExistingColumn:
df => df.deflate(row => row.SomeExistingColumn)
.select(value => transformValue(value)
});
But this operation can be done more conveniently with the transformSeries
function:
const newDf = df.transformSeries({
SomeExistingColumn: value => transformValue(value)
});
This syntax can be used to add, generate and transform any number of colums at once.
One or more columns can easily be removed:
const newDf = df.dropSeries(['col1', 'col2']);
Also works for single columns:
const newDf = df.dropSeries('Column-to-be-dropped');
Alternatively you can select the columns to keep and drop the rest:
const newDf = df.subset(["Column-to-keep", "Some-other-column-to-keep"]);
A particular value of a series or a row of a dataframe can be retrieved by specifying the index using the at
function:
const df = // ...
const row = df.at(10); // Get a row at index 10.
This also works when the index is a different type, eg a time-series index:
const row = df.at(new Date(2016, 5, 22));
The at
function also works for a series.
In order to understand the data we are working with we must explore it, understand the data types involved and the composition of the values.
DataFrame
and Series
both provide a toString
function that can be used to dump data to the console in a readable format.
Use the LINQ functions skip
and take
to preview a subset of the data (more on LINQ functions soon):
// Skip 10 rows, take 20 rows, print the result.
console.log(df.skip(10).take(20).toString());
Or more conveniently:
// Get a range of rows starting at row index 10 and ending at (but not including) row index 20.
console.log(df.between(10, 20).toString());
Use the head
function to view the rows at the start of a series or dataframe:
console.log(df.head(5).toString());
Use the tail
function to view the rows at the end of a series or dataframe:
console.log(df.tail(5).toString());
As you explore a data set you may want to understand what data types you are working with. You can use the detectTypes
function to produce a new dataframe with information on the data types in the dataframe you are exploring:
// Create a data frame with details of the types from the source data frame.
const typesDf = df.detectTypes();
console.log(typesDf.toString());
For example, here is the output with some stock market data:
__index__ Type Frequency Column
---------------- ------ --------- ---------
0 date 100 Date
1 number 100 Open
2 number 100 High
3 number 100 Low
4 number 100 Close
5 number 100 Volume
6 number 100 Adj Close
You also probably want to understand the composition of values in the dataframe. This can be done using detectValues
that examines the values and reports on their frequency:
// Create a data frame with the information on the frequency of values from the source data frame.
const valuesDf = df.detectValues();
console.log(valuesDf.toString());
Series and dataframes can be sorted using the LINQ-style functions: orderBy
and orderByDescending
.
const sortedAscending = df.orderBy(row => row.SomeColumn);
const sortedDescending = df.orderByDescending(row => row.SomeColumn);
Use thenBy
and thenByDescending
to specify additional sorting criteria:
const sorted = df.orderBy(row => row.SomeColumn)
.thenByDescending(row => row.AnotherColumn)
.thenBy(row => row.SomeOtherColumn);
A dataframe can be transformed using the LINQ-style select
function:
const transformedDataFrame = df.select(row => {
return {
NewColumn: row.OldColumn * 2, // <-- Transform existing column to create a new column.
AnotherNewColumn: rand(0, 100) // <-- Create a new column (in this cause just use random data).
};
});
This produces an entirely new immutable dataframe. However the new dataframe has the same index as the source dataframe, so both can be merged back together, if required.
Note that the powerful selectMany
function is also available and works the same as LINQ's SelectMany.
A series can be transformed using select
:
const series = df.deflate(row => row.SomeColumn);
const newSeries = series.select(value => transform(value)); // Apply a transformation to each value in the column.
const newDf = df.withSeries({ SomeColumn: newSeries }); // Plug the modified series back into the dataframe.
The source index is preserved to the transformed series which is what allows the resulting series to be merged back into the dataframe.
The result of select
is a completely new immutable series.
The selectMany
function is also available for series.
Data-Frame offers a convenience function transformSeries
for when you need a simple convenient mechanism to extract, transform and plug back in one or more series at once. For example to simplify the previous code example:
const newDf = df.transformSeries({
SomeColumn: value => transform(value), // Apply a transformation to each value in the series.
});
Dataframes and series can be filtered using the LINQ-style where
function:
const filteredDf = df.where(somePredicateFunction);
The predicate function must return truthy to keep the row, or falsy to filter it out, for example:
const filteredDf = df.where(row => row.SomeColumn > 10);
There are multiple ways to extract a subset of data from a series or dataframe.
At the most basic skip
and take
allow a specified number of values to be skipped or taken.
const subset = series.skip(10).take(15);
head
and tail
are handy functions that can extract X elements at the start or end of the sequence:
const head = series.head(10);
const tail = series.tail(5);
A bit more advanced are skipWhile
, takeWhile
, skipUntil
and takeUntil
. These all skip or take values according to the boolean result of a predicate function:
const subset = series.skipWhile(row => somePredicate(row));
More sophisticated again are startAt
, endAt
, after
, before
and between
. These functions intelligently filter values based on the index. Note that your index must already be sorted to use these functions.
startAt
retreives all values starting at a particular index.
endAt
retreives all values ending at a particular index (inclusive).
after
retreives all values after a particluar index (exclusive).
before
retreives all values before a particular index (exclusive).
Finally between
retreives all values between two indicies (inclusive).
Series and dataframes can be concatenated:
const df1 = // ... some dataframe ...
const df2 = // ... some other dataframe ...
const concatenated = df1.concat(df2);
Multiple series or dataframes may be passed to concat:
const concatenated = df1.concat(df2, df3, df4, etc);
Or an array may be used:
const toConcat = [df2, df3, df4, etc];
const concatenated = df1.concat(toConcat);
You can also concatenate by passing an array of series or dataframes to the global data-forge functions Series.concat
or DataFrame.concat
:
const concatenated = dataForge.DataFrame.concat([df1, df2, df3, df4, etc]);
Note that the new index is a concatenation of the indices of the input dataframes and can contain duplicate values - especially when concatenating dataframes with zero-based indices. Performing operations on such a dataframe might lead to unexpected results.
To avoid duplicate index values when concatenating dataframes with zero-based indices, use resetIndex
after concatenating.
const df1 = // ... a dataframe with a zero-based index ...
const df2 = // ... another dataframe with a zero-based index ...
const concatenated = df1.concat(df2).resetIndex();
A simple merge of series or dataframes by index can be done using the merge
function:
const mergedSeries = Series.merge([series1, series2, etc]);
The merged series that is output is a series of arrays. Each array contains the values, aligned by index, across each of the input series.
You can also use the instance version of merge
to merge 1 or more series into another series:
const mergedSeries = series1.merge(series2, series3, etc);
The merge
function can also be applied to dataframes:
const mergedDf = DataFrame.merge([df1, df2, etc]);
Each row in the merged dataframe contains, aligned by index, the merged fields of each of the input dataframes. When dataframes have overlapping column names, values for those columns are overwritten by later dataframes in the list.
You can also use the instance version of merge
for dataframes:
const mergedDf = df1.merge(df2, df3, etc);
The output of a merge is always sorted ascending by index values.
Series and dataframes can be joined using the join
function as in LINQ. This performs an inner join. Data-Forge also has additional functions for outer joins: joinOuter
, joinOuterLeft
and joinOuterRight
. Thanks to Ryan Hatch for the implementation.
Following is an example translated from Pandas code on Chris Albon's blog:
const df_a = new dataForge.DataFrame({
columnNames: [
'subject_id',
'first_name',
'last_name',
],
rows: [
[1, 'Alex', 'Anderson'],
[2, 'Amy', 'Ackerman'],
// ... and more.
],
});
const df_b = new dataForge.DataFrame({
columnNames: [
'subject_id',
'first_name',
'last_name',
],
rows: [
[4, 'Billy', 'Bonder'],
[5, 'Brian', 'Black'],
// ... and more.
],
});
const df_n = new dataForge.DataFrame({
columnNames: [
"subject_id",
"test_id",
],
rows: [
[1, 51],
[2, 15],
// .. and more.
],
});
const df_new = df_a.concat(df_b);
const df_merged = df_new.join(
df_n,
left => left.subject_id,
right => right.subject_id,
(left, right) => {
return {
subject_id: left.subject_id,
first_name: left.first_name,
last_name: left.last_name,
test_id: right.test_id,
};
}
);
Series and dataframes can be zipped together in the same was in LINQ.
One or more additional series or dataframes can be passed to the zip
function. You must provide a selector that combines the values from each series or dataframe:
const zipped = df1.zip(df2, df3, (df1_row, df2_row) => mergeRows(df1_row, df2_row));
The distinct
function for Series
and DataFrame
works very much like LINQ Distinct.
The dataframe version of this function must be supplied a selector that selects which column to use for comparison:
const distinctDf = df.distinct(row => row.SomeColumn); // Compare 'SomeColumn' for distinct values.
The result is a DataFrame
with duplicate rows removed. The first index for each group of duplicates is preserved.
The series version of this function takes no parameters:
const distinctSeries = series.distinct(); // Return only distinct values in the series.
The result is a series with duplicate values removed. The first index for each group of duplicates is preserved.
The sequentialDistinct
function for Series
and DataFrame
is similar to distinct
, but only operates on sequentially distinct values.
The resulting series or dataFrame has duplicate values or rows removed, but only where the duplicates where adjacent to each other in the data sequence. The first index for each group of sequential duplicates is preserved.
Data-Forge provides various methods for grouping data. All of these methods return a series of windows. Each window is a series or dataFrame containing grouped data.
Use any of the data transformation or aggregation functions to transform a series of windows into something else.
The groupBy
function groups series or dataFrame based on the output of the user-defined selector. This works in the same way as LINQ GroupBy.
For example, grouping a dataframe with sales data by client:
const salesByClient = salesData.groupBy(row => row.ClientName);
This returns a series of data windows. Each window contains one group, each of which is a separate dataframe with only those rows that are part of the group as specified by the selector.
This can also be done with series:
const outputSeries = series.groupBy(value => value); // Can potentially select a different value here.
The output is still a series of data windows. Each group contains a separate series with only those values that are part of the group as specified by selector.
The groupSequentialBy
function for Series
and DataFrame
is similar to groupBy
, except that it only groups adjacent values or rows in the data sequence.
const outputSeries = series.groupSequentialBy(value => {
return // ... your grouping criteria ...
});
The window
function groups a Series
or DataFrame
into equally sized batches. The window passes over the series or dataframe batch-by-batch, taking the first N rows for the first window, then the second N rows for the next window and so on.
The output is a series of data windows. Each window contains the values or rows for that group.
const windowSize = 5; // Looking at 5 rows at a times.
const newSeries = series.window(windowSize); // Series that chunks the input series into groups of 5.
Use any of the data transformation functions to transform the series of windows into something else.
An example that summarizes weekly sales data:
const salesData = // ... series containing amount of sales for each day ...
const weeklySales = salesData.window(7) // Group into lots of 7 (for 7 days).
.select(window => {
return [ // Return index and value.
window.lastIndex(), // Week ending.
window.sum() // Total the amount sold during the week.
];
})
.withIndex(pair => pair[0]) // Promote index.
.select(pair => pair[1]); // Restore values.
Note that it's only necessary to 'promote the index' when we need to merge the resulting series with another series or dataframe that has the same index.
If you don't need to merge your data, the previous code snippet can be simplified drastically:
const weeklySales = salesData.window(7).select(window => window.sum());
The rollingWindow
function groups a Series
or DataFrame
into batches, this function differs from window
in that it rolls the window across the sequence value-by-value rather than batch-by-batch.
The percentChange
function that is included in Data-Forge is probably the simplest example use of rollingWindow
. It computes a new series with the percentage increase of each subsquent value in the original series.
The implementation of percentChange
looks a bit like this:
function computePctChange (window) {
const amountChange = window.last() - window.first(); // Compute amount of change.
const pctChange = (amountChange / window.first()) * 100; // Compute % change.
return pctChange;
}
const pctChangeSeries = series.rollingWindow(2).select(computePctChange);
percentChange
is simple because it only considers a window size of 2 (eg it considers each adjacent pair of values).
Now consider an example that requires a configurable window size. Here is some code that computes a simple moving average:
const smaPeriod = // ... configurable moving average period ...
const smaSeries = series.rollingWindow(smaPeriod).select(window => window.sum() / smaPeriod);
Who would have thought that computing a rolling average in JavaScript was so simple?
The variableWindow
function groups a Series
or DataFrame
into windows that have a variable amount of values per window. Adjacent values and rows are compared using a user-defined comparer. When the comparer returns true
(or truthy) adjacent data items are combined into the same group.
An example:
const outputSeries = series.variableWindow((a, b) => {
return // ... compare a and b for equality, return true if they are equal ...
};
Aggregation, reduction or summarization works as in LINQ.
Here's an example of the aggregate
function to sum a series:
const sum = series.aggregate(0, (prevValue, nextValue) => prevValue + nextValue);
Fortunately (as with LINQ) there is actually a sum
function (among other helper functions) that can do this for you, it is actually built on aggregate
so it's a nice (and simple) example. Using the sum
function we rewrite the previous example as:
const sum = series.sum();
Another good example is averaging a series where the first element in the series is used as the seed:
const average = series
.skip(1)
.average(
series.first(), // The seed
(prevValue, nextValue) => (prevValue + nextValue) / 2
);
This can be simplified by building on sum
:
const average = series.sum() / series.count();
Again though there is already an average
helper function that do this for us:
const average = series.average();
Also check out the functions for min
, max
and median
. These all help to summarise values in a series.
A dataframe can be aggregated in the same way, for example summarizing sales data:
const salesData = // ... today's sales, including Price and Revenue ...
const seed = {
TotalSales: 0,
AveragePrice: salesData.first().AveragePrice,
TotalRevenue: salesData.first().Revenue,
};
const summary = salesData
.skip(1)
.aggregate(seed,
(agg, row) => {
return {
TotalSales: agg.TotalSales + 1,
AveragePrice: (agg.AveragePrice + row.Price) / 2,
TotalRevenue: agg.TotalRevenue + row.Revenue,
};
}
);
If you are looking to summarize the data in a dataframe you can use the summarize
function which is a bit like the aggregate
function but much simpler.
Summarization takes an input dataframe and produces a single object that is a summary of all the values in the dataframe.
To summarize all values in a dataframe, simply call summarize
with no arguments:
const summary = df.summarize();
console.log(summary);
We can add parameters to restrict which columns are summarized and how their values are aggregated.
Summarize a single column as follows:
const summary = df.summarize({
ColumnToBeSummed: series => series.sum(),
});
console.log(summary);
But you probably want to summarize multiple columns:
const summary = df.summarize({
ColumnToBeSummed: series => series.sum(),
ColumnToBeAveraged: series => series.average(),
});
We can simplify our code substantially by specifying the aggregation method as a named function:
const summary = df.summarize({
ColumnToBeSummed: Series.sum,
ColumnToBeAveraged: Series.average,
ColumnToBeCounted: Series.count,
});
By the way, you aren't limited to Data-Forge's pre-defined functions, you can easily create your own:
function myAggregationFunction(someSeries) {
return // ... your own summary of the series ...
}
const summary = df.summarize({
SomeColumn: myAggregationFunction,
});
We can generalize the structure so far to the following pattern:
const summary = df.summarize({
Column1: functionThatAggregatesSeries,
Column2: functionThatAggregatesSeries,
ColumnN: ...,
});
We can take this further and produce multiple output columns from each input column:
const summary = df.summarize({
Column1: {
Column1_Sum: series => series.sum(),
Colum1_Avg: series => series.average(),
Column1_Count: series => series.count(),
},
Column2: {
Column2_Sum: series => series.sum(),
Colum2_Avg: series => series.average(),
},
});
Again we can simply provide references to functions that do the series aggregation for us:
const summary = df.summarize({
Column1: {
Column1_Sum: Series.sum,
Colum1_Avg: Series.average,
Column1_Count: Series.count,
},
Column2: {
Column2_Sum: Series.sum,
Colum2_Avg: Series.average,
},
});
Let's generalize a full pattern as follows:
const summary = df.summarize({
InputColumn1: {
OutputField1: functionThatAggregatesSeries,
OutputField2: functionThatAggregatesSeries,
...
},
InputColumn2: {
OutputField3: functionThatAggregatesSeries,
OutputField4: functionThatAggregatesSeries,
...
},
InputColumnN: ...
});
Now go forth and summarize your dataframes!
This an example of using groupBy
and aggregate
to summarize a dataframe:
// Group by client.
const summarized = salesData
.groupBy(row => row.ClientName)
.select(group => ({
ClientName: group.first().ClientName,
Amount: group.deflate(row => row.Sales).sum(), // Sum sales per client.
}))
.inflate() // Series -> dataframe.
.toArray(); // Convert to regular JS array.
Please see example 12 in the Data-Forge examples repo for a working version of this.
You can use the pivot
function for advanced reshaping and transformation of a dataframe.
Pivoting combines grouping, aggregation and sorting operations into one simple function call.
Before you try and understand how to pivot a dataframe make sure you understand how to summarize a dataframe by reading the earlier Summarize section.
Here is the simplest example of a pivot that sums groups of values:
const pivotted = df.pivot("ColumnToGroupBy", "ColumnToSum", series => series.sum());
This groups the dataframe by "ColumnToGroupBy" then for each group it aggregates the column "ColumnToSum" using the function series => series.sum()
.
As with the summarize
function we can simplify our code and specify the aggregation method using named functions, for example:
const pivotted = df.pivot("ColumnToGroupBy", "ColumnToSum", Series.sum);
We can generalize this pattern of usage as follows:
function functionThatAggregatesSeries(someSeries) {
return // ... your own summary of the series ...
}
const pivotted = df.pivot("ColumnToGroupBy", "ColumnToAggregate", functionThatAggregatesSeries);
Perhaps the best way to understand what is happening is to expand out the pivot
function, so we can see the underlying group, aggregation and sorting that pivot
is doing for you:
const pivottedDf = df.groupBy(row => row.PivotColumn)
.select(group => ({
PivotColumn: group.first().PivotColumn,
ValueColumn: group.deflate(row => row.ValueColumn).average()
}))
.orderBy(row => row.PivotColumn);
Just remember that we are still only looking at the simplest example of pivotting. We are about to go deeper. I won't show you further expanded examples of pivot, but just remember that all it's doing under the hood is group, aggregate and sort.
Now let's up the ante and use pivot to produce multiple output columns:
const pivotted = df.pivot("ColumnToGroupBy", {
ColumnToSum: series => series.sum(),
ColumnToAverage: series => series.average(),
});
We can also use named functions to specify the aggregation method for each output:
const pivotted = df.pivot("ColumnToGroupBy", {
ColumnToSum: Series.sum,
ColumnToAverage: Series.average,
});
The general pattern now look like this:
const pivotted = df.pivot("ColumnToGroupBy", {
Column1: functionThatAggregatesSeries,
Column2: functionThatAggregatesSeries,
ColumnN: ...
});
We can take this further still and produce multiple output columns for each input column:
const pivotted = df.pivot("ColumnToGroupBy", {
InputColumn1: {
OutputColumn1: series => series.sum(),
OutputColumn2: series => series.average(),
...
},
InputColumn1: {
OutputColumn3: series => series.count(),
...
},
...
});
Pivot also supports nested multi-level grouping by passing in an array of column names:
const columnsToGroupBy = ["Column1", "Column2", "etc"];
const pivotted = df.pivot(columnsToGroupBy, {
... pivot spec ...
});
The function fillGaps
works the same for both series and dataframes:
const sequenceWithGaps = // ...
// Predicate that determines if there is a gap.
const gapExists = (pairA, pairB) => {
// Returns true if there is a gap.
return true;
};
// Generator function that produces new rows to fill the game.
const gapFiller = (pairA, pairB) => {
// Create an array of index, value pairs that fill the gaps between pairA and pairB.
return [
newPair1,
newPair2,
newPair3,
];
};
const sequenceWithoutGaps = sequenceWithGaps.fillGaps(gapExists, gapFiller);
For a more concrete example, let's fill gaps in daily stock-market data (with some help from Moment.js):
const moment = require('moment');
const sequenceWithGaps = // ...
const gapExists = (pairA, pairB) => {
// Return true if there is a gap longer than a day.
const startDate = pairA[0];
const endDate = pairB[0];
const gapSize = moment(endDate).diff(moment(startDate), 'days');
return gapSize > 1;
};
const gapFiller = (pairA, pairB) => {
// Fill values forward.
const startDate = pairA[0];
const endDate = pairB[0];
const gapSize = moment(endDate).diff(moment(startDate), 'days');
const numEntries = gapSize - 1;
const startValue = pairA[1];
const newEntries = [];
for (let entryIndex = 0; entryIndex < numEntries; ++entryIndex) {
newEntries.push([
moment(pairA[0]).add(entryIndex + 1, 'days').toDate(), // New index
startValue // New value, copy the start value forward to fill the gaps.
]);
}
return newEntries;
}
const sequenceWithoutGaps = sequenceWithGaps.fillGaps(gapExists, gapFiller);
npm install --save data-forge
const mongodb = require('mongodb');
const dataForge = require('data-forge');
mongodb.MongoClient.connect("mongodb://localhost")
.then(client => {
const db = client.db("some-db");
const inputCollection = db.collection("some-collection");
const outputCollection = db.collection("some-other-collection");
return inputCollection.find()
.toArray()
.then(documents => {
const input = new dataForge.DataFrame(documents);
const output = input.select(row => transform(row));
return outputCollection.insert(output.toArray());
})
.then(() => client.close());
})
.then(() => {
console.log("Done");
})
.catch(err => {
console.error(err);
});
const request = require('request-promise');
const dataForge = require('data-forge');
request({
method: 'GET',
uri: "http://some-host/a/rest/api",
json: true,
})
.then(data {
const input = new dataForge.DataFrame(data);
const output = input.select(row => transform(row));
return request({
method: 'POST',
uri: "http://some-host/another/rest/api",
body: {
data: output.toArray()
},
json: true,
});
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
bower install --save data-forge
This example depends on the jQuery get function.
Include in HTML
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript for JSON
const url = "http://somewhere.com/rest/api";
$.get(url, function (data) {
const df = new dataForge.DataFrame(data);
// ... work with the data frame ...
});
const df = ...
$.post(url, df.toArray(), function (data) {
// ...
});
Javascript for CSV
const url = "http://somewhere.com/rest/api";
$.get(url, function (data) {
const df = dataForge.fromCSV(data);
// ... work with the data frame ...
});
const df = ...
$.post(url, df.toCSV(), function (data) {
// ...
});
Install
bower install --save data-forge
Include in HTML
<script src="bower_components/angular/angular.js"></script>
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript
// Assume [$http](https://docs.angularjs.org/api/ng/service/$http) is injected into your controller.
const url = "http://somewhere.com/rest/api";
$http.get(url)
.then(function (data) {
const df = new dataForge.DataFrame(data);
// ... work with the data frame ...
})
.catch(function (err) {
// ... handle error ...
});
const df = ...
$http.post(url, df.toArray())
.then(function () {
// ... handle success ...
})
.catch(function (err) {
// ... handle error ...
});