Skip to content

FormalLanguageConstrainedPathQuerying/GLL4Graph

 
 

Repository files navigation

example workflow

GLL4Graph

This is the implementation of the GLL-based context-free path querying (CFPQ) algorithm. It is based on a high-performance GLL parsing algorithm implemented in Iguana project. Then, it was modificated to support graph-structured input data. Proposed solution solves both reachability and all paths problems for all pairs and multiple sources cases.

Performance

The proposed solution has been evaluated on several real-world graphs for both the all pairs and the multiple sources scenarios. The evaluation shows that the proposed solution is more than 45 times faster than the previous solution for Neo4j and is comparable, in some scenarios and cases, with the linear algebra based solution.

Machine configuration: PC with Ubuntu 20.04, Intel Core i7-6700 3.40GHz CPU, DDR4 64Gb RAM.

Enviroment configuration:

  • Java HotSpot(TM) 64-Bit server virtual machine (build 15.0.2+7-27, mixed mode, sharing).
  • JVM heap configuration: 55Gb both xms and xmx.
  • Neo4j 4.0.3 is used. Almost all configurations are default except two:
    • total off-heap transaction memory (tx_state_max_off_heap_memory parameter) is set to 24Gb
    • pagecache_warmup_enabled is set to true.

Graphs

The graph data is selected from CFPQ_Data dataset. There are two types of graph:

  • Graphs related to RDF analysis problems
  • Graphs related to static code analysis problems

A detailed description of the graphs is listed below.

RDF analysis

Graph name |V| |E| #subClassOf #type #broaderTransitive
Core 1 323 2 752 178 0 0
Pathways 6 238 12 363 3 117 3 118 0
Go hierarchy 45 007 490 109 490 109 0 0
Enzyme 48 815 86 543 8 163 14 989 8 156
Eclass_514en 239 111 360 248 90 962 72 517 0
Geospecies 450 609 2 201 532 0 89 065 20 867
Go 582 929 1 437 437 94 514 226 481 0
Taxonomy 5 728 398 14 922 125 2 112 637 2 508 635 0

Static code analysis

Graph name |V| |E| #a #d
Apache 1 721 418 1 510 411 362 799 1 147 612
Block 3 423 234 2 951 393 669 238 2 282 155
Fs 4 177 416 3 609 373 824 430 2 784 943
Ipc 3 401 022 2 931 498 664 151 2 267 347
Lib 3 401 355 2 931 880 664 311 2 267 569
Mm 2 538 243 2 191 079 498 918 1 692 161
Net 4 039 470 3 500 141 807 162 2 692 979
Postgre 5 203 419 4 678 543 1 209 597 3 468 946
Security 3 479 982 3 003 326 683 339 2 319 987
Sound 3 528 861 3 049 732 697 159 2 352 573
Init 2 446 224 2 112 809 481 994 1 630 815
Arch 3 448 422 2 970 242 6 712 95 2 298 947
Crypto 3 464 970 2 988 387 678 408 2 309 979
Drivers 4 273 803 3 707 769 858 568 2 849 201
Kernel 11 254 434 9 484 213 1 981 258 7 502 955

Field-sensitive points-to analysis

Graph name |V| |E|
sunflow 15 464 15 957
lusearch 15 774 14 994
luindex 18 532 17 375
avrora 24 690 25 196
eclipse 41 383 40 200
h2 44 717 56 683
pmd 54 444 59 329
xalan 58 476 62 758
batik 60 175 63 089
fop 86 183 83 016
tomcat 111 327 110 884
jython 191 895 260 034
tradebeans 439 693 466 969
tradesoap 440 680 468 263

Grammars

All queries used in evaluation are variants of same-generation query. The inverse of an x relation and the respective edge is denoted as x_r.


Grammars used for RDF graphs:

G1

S -> subClassOf_r S subClassOf | subClassOf_r subClassOf 
     | type_r S type | type_r type

G2

S -> subClassOf_r S subClassOf | subClassOf

Geo

S -> broaderTransitive S broaderTransitive_r
     | broaderTransitive broaderTransitive_r 

Grammar used for static code analysis graphs:

PointsTo

M -> d_r V d
V -> (M? a_r)* M? (a M?)* 

Grammar template used for field-sensitive points-to analysis:

PointsTo -> (assign | load_f Alias store_f )* alloc
Alias -> PointsTo FlowsTo
FlowsTo -> alloc_r (assign_r | store_f_r Alias load_f_r)*

TermiFor all fields f the terminals load_f, store_f are generated

Results

The results of the all pairs reachability queries evaluation on graphs related to RDF analysis are listed below.

The sign ’–’ in cells means that the respective query is not applicable to the graph, so time is not measured.

Graph name G1 G2 Geo
time (sec) #answer time (sec) #answer time (sec) #answer
Core 0,02 204 0,01 214
Pathways 0,07 884 0,04 3117
Go hierarchy 3,68 588 976 5,42 738 937
Enzyme 0,22 396 0,17 8163 5,7 14 267 542
Eclass 1,5 90 994 0,97 96 163
Geospecies 2,89 85 2,65 0 145,8 226 669 749
Go 5,56 640 316 4,24 659 501
Taxonomy 45,47 151 706 36,06 2 112 637

The evaluation results for single source CFPQ for graphs related to RDF analysis and G1, G2, Geo grammars respectively in reachability and all paths scenarious:

time time time

The results for graphs related to static code analysis are compared to results of Azimov’s CFPQ algorithm based on matrix operations. The implementation from CFPQ_PyAlgo was taken as the implementation of the matrix CFPQ algorithm. This library contains the implementation for both scenarios, all pairs reachability and single source reachability. To perform matrix operations pygraphblas is used. Pygraphblas is a python wrapper over the SuiteSparse library, which based on the GraphBLAS framework.

The results of the all pairs reachability queries evaluation on graphs related to static code analysis are listed below.

The sign ’–’ in cells means that the respective query and graph require a considerable amount of memory during algorithm execution that leads to unpredictable time to get the result.

Graph name PointsTo
Neo4j time (sec) GraphBLAS time (sec) #answer
Apache 536,7 92 806 768
Block 113,01 123,88 5 351 409
Fs 167,73 105,72 9 646 475
Ipc 109,43 79,52 5 249 389
Lib 111,09 121,79 5 276 303
Mm 77,92 84,15 3 990 305
Net 160,64 206,29 8 833 403
Postgre 969,88 90 661 446
Security 115,75 181,7 5 593 387
Sound 120,14 133,64 6 085 269
Init 87,25 45,84 3 783 769
Arch 130,77 119,92 5 339 563
Crypto 128,8 122,09 5 428 237
Drivers 371,18 279,39 18 825 025
Kernel 614,047 378,05 16 747 731

The evaluation results for single source CFPQ for graphs related to static code analysis and pointsTo grammar in reachability and all paths scenarious:

time

The results of the all pairs reachability queries evaluation on graphs related to field-sensitive points-to analysis. The results are compared to results of .NET GLL implementation.

Graph name Reachability All paths
#answer Neo4j time (sec) In memory time (sec) .NET time (sec) Neo4j time (sec) In memory time (sec) .NET time (sec)
avrora 21 532 5.022 2.168 5.022 7.879 5.042 2.984
batik 45 968 12.659 5.656 2.087 30.282 30.137 6.349
eclipse 21 830 5.712 2.223 1.245 14.095 13.418 3.557
fop 76 615 19.400 9.394 4.649 53.721 45.088 15.353
h2 92 038 19.643 13.507 19.065 30.414 23.375 108.554
jython 561 720 OOM OOM 1283.518 OOM OOM OOM
luindex 9 677 2.121 0.874 0.438 3.782 2.326 2.361
lusearch 9 242 1.417 0.547 0.285 2.481 1.614 1.419
pmd 60 518 32.193 23.034 45.487 46.522 36.269 OOT
sunflow 16 354 0.878 0.397 0.340 2.074 1.854 1.733
tomcat 82 424 58.637 26.187 3.872 124.182 88.235 24.168
tradebeans 696 316 OOT OOM 134.641 OOT OOM OOM
tradesoap 698 567 OOT OOM 134.321 OOT OOM OOM
xalan 52 382 24.140 10.991 4.490 39.168 34.247 62.832

Download and build

The project is build with Maven.

git clone https://github.com/JetBrains-Research/GLL4Graph
cd GLL4Graph
mvn compile

Usage

To replicate experiments:

  • make the directory to save results
mkdir results
  • run the following command with arguments
mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="aguments"
  • print help
mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-h"
usage: GraphBenchmark [-h] -d <dataset name> -gm <path> -gp <path> -gs <storage type> -m <number> -p <problem type> -S <s=value1 a=value2> -w <number>
 -d,--dataset <dataset name>            The name of the dataset, an important component of the file name with the results
 -gm,--grammar <path>                   Path to JSON file contains context-free grammar
 -gp,--graph <path>                     Path to directory contains files nodes.csv and edges.csv
 -gs,--graph_storage <storage type>     Graph storage type, allowed values: NEO4J, IN_MEMORY
 -h,--help                              Print help message
 -m,--measurement_iterations <number>   Number of measurement iterations
 -p,--problem <problem type>            Benchmarking algorithm, allowed values: REACHABILITY, ALL_PATHS
 -S,--scenario <s=value1 a=value2>      Benchmarking scenario and its argument, 's' property allowed values: ALL_PAIRS, MULTIPLE_SOURCES, 'a' property
                                        contains number of nodes if 's' equals ALL_PAIRS or path to file with vertices chunks if 's' equals
                                        MULTIPLE_SOURCES
 -w,--warmup_iterations <number>        Number of warm-up iterations

Example

Here is an example which can be run right after project is downloaded and results directory is created without any additional preparation.

To run all pairs reachability algorithm on Core graph on G1 grammar use the following command:

mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-d Core -gm test/resources/grammars/graph/g1/grammar.json -gp data/core -gs NEO4J -p REACHABILITY -S -s ALL_PAIRS -a $(( $(cat "data/core/nodes.csv" | wc -l)-1 )) -w 1 -m 10"

Data

To get more graph data examples use Python script:

graph_loader.py --graph {graph_name} --relationships {comma_separated_relationships_list}

For example:

graph_loader.py --graph core --relationships subClassOf,type

License

This project is licensed under OpenBSD License. License text can be found in the license file.

Languages

  • Java 95.1%
  • Scala 3.1%
  • Rascal 1.0%
  • RouterOS Script 0.5%
  • OCaml 0.1%
  • Python 0.1%
  • Other 0.1%