MDS (acronym of Multiple Dimension Spread) is a Schema-less columnar storage format. Provide flexible representation like JSON and efficient reading similar to other columnar storage formats.
There was a problem that it is too large to compress and save the data as it is in the Big Data era. From the demand for improvement in compression ratio and read performance, several columnar data formats (for example, Apache ORC and Apache Parquet) were proposed. They achieve the high compression ratio from similar data in column and reading performance for grouping data by column when data is used.
However, these data formats are required the data structure in a row (or a record) should be defined before saving the data. It was necessary to decide how to use it at the time of data storage, and it was often a problem that it was difficult to decide what kind of data to use.
In this project, we provide a new columnar format which does not require the schema at the time of data storage with compression and read performance equal to (or higher in case) than other formats.
Analyzing big data requires store data compactly and get data smoothly. MDS as a columnar format is useful for this needs.
Data Lake is a data pool that is not required the data structure (as a schema) in the row at the time of data storage. And stored data can be used with defining its schema at the time of analyzing. See DataLake.
Firstly, please get MDS related repositories following section named "How to get source".
MDS format can treat data without Hadoop environment. However, it is useful for big data. so, it needs a Hadoop environment for storage and Hive for read to use efficiently.
We have a plan to create a docker environment of Hadoop and Hive for test use, but current situation, you need to prepare Hadoop and Hive firstly.
CLI is a Command Line Interface tool for using MDS. following tools are provided.
- bin/setup.sh # for gathering MDS related jars
- bin/mds.sh # create mds data, and show data
mds.sh needs some jars, so please create jar files before using.
$ mvn package
For preparation, get MDS jars and store then to proper directories.
$ bin/setup.sh # get MDS jars from Maven repository (bin/setup.sh -h for help)
And, put MDS related jars to Hadoop.
$ cp -r jars/mds /tmp/mds_lib
$ hdfs dfs -put -r /tmp/mds_lib /mds_lib
convert JSON data to MDS format.
$ bin/mds.sh create -i src/example/src/main/resources/sample_json.txt -f json -o /tmp/sample.mds
$ bin/mds.sh cat -i /tmp/sample.mds -o '-' # show whole data
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}
$ bin/mds.sh cat -i /tmp/sample.mds -o '-' -p '[ ["name"] ]' # show part of data
{"name":"apple"}
{"name":"orange"}
Copy MDS file to HDFS environment.
$ hdfs dfs -mkdir -p /tmp/ss
$ hdfs dfs -put /tmp/sample.mds /tmp/ss/sample.mds
Enter Hive and add jar files to use MDS format.
$ hive -i jars/mds/add_jar.hql
> create database test;
> use test;
> create external table sample_json (
summary struct<total_price: bigint, total_weight: bigint>,
number bigint,
price bigint,
name string,
class string
)
ROW FORMAT SERDE
'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
location '/tmp/ss';
> select * from sample_json;
{"total_price":550,"total_weight":412} 5 110 apple fruits
{"total_price":800,"total_weight":600} 10 80 orange fruits
See document Hive for further detail to use.
Support and discussion of MDS are on the Mailing list. Please refer the following subsection named "How to contribute".
We plan to support and discussion of MDS on the Mailing list. However, please contact us via GitHub until ML is opened.
We welcome to join this project widely.
See document MDS
This project is on the Apache License. Please treat this project under this license.
User support and discussion of MDS development are on the following Mailing list. Please send a blank e-mail to the following address.
- subscribe: [email protected]
- unsubscribe: [email protected]
Archive is useful for what was communicated at this project.
Please accept Contributer licence agreement when participating as a developer.
We invite you to JIRA as a bug tracking, when you mentioned in the above Mailing list.
Following environments are required.
- Mac OS X or Linux
- Java 8 Update 92 or higher (8u92+), 64-bit
- Maven 3.3.9 or later (for building)
- Hadoop 2.7.3 or later
- Hive 2.0 or later (for reading data)
MDS library constructs jar files on following modules.
- multiple-dimension-spread
- dataplatform-config
- dataplatform-schema-lib
MDS sources are there.
Install gpg and create a gpg key for maven plugin to use git clone.
gpg --gen-key
gpg --list-keys
Add following gpg setting to maven-local-repository-home/conf/settings.xml . Usually, maven-local-repository-home is $HOME/.m2 .
</profiles>
<profile>
<id>sign</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<properties>
<gpg.passphrase>***YOUR-PASSPHRASE***</gpg.passphrase>
</properties>
</profile>
</profiles>
MDS sources can get from the Maven repository.
- multiple-dimension-spread-arrow
- multiple-dimension-spread-common
- multiple-dimension-spread-hadoop
- multiple-dimension-spread-hive
- multiple-dimension-spread-schema
Compile each source following instructions.
$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/multiple-dimension-spread.git
$ cd multiple-dimension-spread
$ mvn clean install
$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/dataplatform-schema-lib.git
$ cd dataplatform-schema-lib
$ mvn clean install
$ cd /local/mds/home
$ git clone https://github.com/yahoojapan/dataplatform-config.git
$ cd dataplatform-config
$ mvn clean install