Speed up RITA parser #654

Zalgo2462 · 2021-07-07T00:27:30Z

Goal

This PR speeds up the RITA parser for both JSON and TSV. In one benchmark parsing large amounts JSON files, this PR executes in a quarter of time needed by the master branch.

Note to the reviewer

I would recommend reviewing it by looking at the individual commits, as there have been a large number of changes. If this is overwhelming, we can split up the PR into sections by splitting off the individual commits into their own PRs.

Changes

In activecm/rita@bfed5a9, I refactored the dependencies for the FSImporter. This was important because any profiling done on the FSImporter simply reported that parseFiles was taking a long time. parseFiles contained the parsing logic for each type of log file. In order to break up this method, I needed to simplify the method's dependencies.
In activecm/rita@05627e5, the logic for parsing each type of file has been moved to its own file. Additionally, breaking up this logic allowed me to break up the mutex that was used to make writing out the parsing results threadsafe. In this revision, each map within the parse results has its own mutex, allowing for greater parallelism.
activecm/rita@741a9cd adds in a printed message which tells the user how long it took to read in a batch of logs. Additionally, this commit embeds profiling logic as comments into the code to make it easier to enable cpu and ram profiling. We can revert this commit if need be.
In activecm/rita@ae7cf37, I replaced calls to strings.Split with calls to strings.Index and slice operations in order to reduce string allocations in the TSV parsing code. Since this section of code was called for each line we parsed, the targeted allocations created a good deal of garbage that the GC had to constantly take care of.
In activecm/rita@ddef24b, strconv.ParseInt was replaced with strconv.Atoi because the Go library has added a performance hack for small numbers in one method but not the other. The ParseInt method was identified as a major time sink in the TSV parser via go tool pprof.
In activecm/rita@a0cfecd, the map which tells the TSV parser which struct field each sequential token maps to was replaced with an array. The profiler indicated that the TSV parser spent the majority of its time hashing field names. Since the names of the fields were always accessed sequentially, it was possible to map from the sequential index directly to the target struct fields.
In activecm/rita@0e911ee and activecm/rita@dbf8ce9, the Golang gzip library was replaced with calls to the system pigz or gzip commands. The profiler indicated that much of the time in the parse loop was spent decompressing files. After a google search, I found that Docker has run into similar issues: Gzip performance moby/moby#10181. I then implemented a fix similar to theirs. This fix saved a great deal of time for both the TSV and JSON parsers.
In activecm/rita@0616f3e, I replaced the standard JSON parser with a faster library. The profiler indicated that much of the time spent by the JSON parser was deep in the Golang standard library. After a few searches, I found that Prometheus, Docker, and Grafana as well as many other well-known Go projects used the package json-iter to speed up their JSON marshalling. It was a drop in replacement and yielded a massive speedup.
In activecm/rita@cd8dd38 and activecm/rita@180de84, I converted the UniqueIPSet implementation over from using a slice to a hashmap. The profiler indicated that the parsing code spent a great deal of time running linear searches against the UniqueIPSets when processing hostnames. Switching the implementation over to a hashmap sped up the process considerably at the expense of around 4x the RAM. However, RAM usage has not grown considerably overall.
In activecm/rita@1cf8e14, I changed the batching limit to take the greater of the old default (4GB) or half of the system's total memory. This allows us to run the MongoDB analysis phase quicker when the system can support it.
In activecm/rita@c4b5b3f and activecm/rita@12215e0, I replaced array based string sets which used calls to StringInSlice with map based string sets similar to the UniqueIP sets found in activecm/rita@cd8dd38
In activecm/rita@43a7dc7, I added a bit of code in util.ContainsIP and util.IPIsPubliclyRoutable to cache the results of net.IP.To4() which is called by many of the net api methods. It turns out that when you parse an IPv4 address in Go, the IP is stored using 16 bytes, but then the net api slices out the 4 bytes it needs via To4() in all the api methods.
In activecm/rita@4f78c1e, I reordered the checks in data.UniqueIP to prevent parsing empty UUIDs. In previous versions, we'd call uuid.Parse("") at least twice for every log entry. Each of these calls would result in allocating an error object on the heap that we'd immediately throw away, creating a good deal of work for the garbage collector.
In activecm/rita@3028e62, I fixed up the SSL parser to increment the hostMap source/ destination counters to address issue host collection connection counts are undercounted #684.

Performance Testing

I've tested the changes on a large JSON dataset as well as several small TSV datasets.
The JSON dataset consisted of 3414 gzipped Zeek logs of which 2063 were conn, http, ssl, and dns logs. The 2063 logs that needed to be processed by RITA totaled 70.385 GB.

While testing for performance, I used Go's pprof tool to find the sections of code which took up the majority of the execution time.
I commented out lines in fsimporter.go which builds the MongoDB collections from the parse results beginning with

// Set chunk before we continue so if process dies, we still verify with a delete if
// any data was written out.
fs.metaDB.SetChunk(fs.config.S.Rolling.CurrentChunk, fs.database.GetSelectedDB(), true)

down to

// record file+database name hash in metadabase to prevent duplicate content
fmt.Println("\t[-] Indexing log entries ... ")
err := fs.metaDB.AddNewFilesToIndex(indexedFileBatch)
if err != nil {
	fs.log.Error("Could not update the list of parsed files")
}

I added a line to print out how long it took to import each batch of files to the original master branch and ran it against the large JSON dataset on a DO droplet with (virtualized) SSD storage, 16GB of RAM and 8 CPU cores. The results were as follows:

Master branch
        [-] Finished parsing logs in 5m28.344s
        [-] Finished parsing logs in 8m5.327s
        [-] Finished parsing logs in 10m8.033s
	[-] Finished parsing logs in 15m42.579s
        [-] Finished parsing logs in 20m5.076s
        [-] Finished parsing logs in 19m41.801s
        [-] Finished parsing logs in 19m4.292s
        [-] Finished parsing logs in 15m17.52s
        [-] Finished parsing logs in 15m12.626s
        [-] Finished parsing logs in 15m22.514s
        [-] Finished parsing logs in 15m52.611s
        [-] Finished parsing logs in 12m2.84s
        [-] Finished parsing logs in 7m47.568s
        [-] Finished parsing logs in 9m22.946s
        [-] Finished parsing logs in 7m12.611s
        [-] Finished parsing logs in 6m58.84s
        [-] Finished parsing logs in 9m58.732s
        [-] Finished parsing logs in 7m56.549s
	TOTAL: 3h41m20.809s

Compare this to the results after the PR changes:

        [-] Finished parsing logs in 3m7.787s
        [-] Finished parsing logs in 6m45.158s
        [-] Finished parsing logs in 6m5.804s
        [-] Finished parsing logs in 6m16.351s
        [-] Finished parsing logs in 6m14.18s
        [-] Finished parsing logs in 5m57.668s
        [-] Finished parsing logs in 6m30.06s
        [-] Finished parsing logs in 5m37.612s
        [-] Finished parsing logs in 5m23.952s
	TOTAL 0h:51m58.572s

I've attached high resolution SVGs of the images above as well as the raw pprof data which can be opened via go tool pprof.
pprof_data.zip

…f the different file types

…with open connections where the host map's counters for unexpected protocol port service tuples weren't being incremented

…of parsing

…ted extra strings, function runs ~1.2x faster now, and GC is doing a small bit better

…for simply formatted integers

…eld offsets using an array. We previously mapped from each Zeek field's name to the offsets using a hashmap. This took a lot of time since the code was executed a lot.

…r specifically uses pigz for this purpose.

… either 4GB (as before) or half of system RAM. Note that RAM usage is much lower than the batch limit since we don't store every line we read.

Zalgo2462 · 2021-07-23T01:25:58Z

The latest master merge commit activecm/rita@93d094a implements #665

…new parser

…ing the set representation

Zalgo2462

Left some comments noting where the different commits pop up in the PR.

Zalgo2462 · 2021-08-10T18:51:45Z

commands/import.go

@@ -232,12 +232,12 @@ func (i *Importer) run() error {
 	}
 	i.res.Config.S.Rolling = rollingCfg

-	importer := parser.NewFSImporter(i.res, i.threads, i.threads, i.importFiles)
+	importer := parser.NewFSImporter(i.res)


FSImporter was refactored such that the threads and file set are passed in in importer.CollectFileDetails and importer.Run

activecm/rita@bfed5a9

Zalgo2462 · 2021-08-10T18:52:10Z

commands/import.go

@@ -259,7 +259,20 @@ func (i *Importer) run() error {
 		fmt.Printf("\t[+] Non-rolling database %v will be converted to rolling\n", i.targetDatabase)
 	}

-	importer.Run(indexedFiles)
+	/*


This could be removed if desired.

activecm/rita@741a9cd

Zalgo2462 · 2021-08-10T18:53:27Z

commands/import.go

@@ -288,7 +301,7 @@ func (i *Importer) handleDeleteOldData() error {

 	// Remove the analysis results for the chunk
 	targetChunk := i.res.Config.S.Rolling.CurrentChunk
-	removerRepo := remover.NewMongoRemover(i.res)
+	removerRepo := remover.NewMongoRemover(i.res.DB, i.res.Config, i.res.Log)


All of the packages under pkg have been refactored to take in the DB, Config, and Log structures separately in their constructors.

activecm/rita@bfed5a9

Zalgo2462 · 2021-08-10T18:54:24Z

commands/show-long-connections.go

-)
-
-// https://gist.github.com/harshavardhana/327e0577c4fed9211f65#gistcomment-2557682
-func duration(d time.Duration) string {


This has been moved into the util package. I have added a line which prints out how long it took to read in the files from the filesystem and I wanted to reuse the code here for formatting.

activecm/rita@741a9cd

Zalgo2462 · 2021-08-10T19:01:29Z

database/meta.go

@@ -6,7 +6,7 @@ import (
 	"time"

 	"github.com/activecm/rita/config"
-	fpt "github.com/activecm/rita/parser/fileparsetypes"


I've moved around some of the files in the parser/ directory in order to separate logic dependent on the file system from the logic involved in aggregating log entries.

In this case IndexedFile has moved from parser/indexedfile.go to parser/files/indexing.go.

activecm/rita@bfed5a9

Zalgo2462 · 2021-08-10T20:28:12Z

pkg/useragent/analyzer.go

@@ -60,21 +60,23 @@ func (a *analyzer) start() {
 			// set up writer output
 			var output update

-			if len(datum.OrigIps) > 10 {
-				datum.OrigIps = datum.OrigIps[:10]
+			origIPs := datum.OrigIps.Items()


Note that .Items() must be called to convert the map into an array.

activecm/rita@cd8dd38
activecm/rita@180de84

Zalgo2462 · 2021-08-10T20:29:22Z

parser/fsimporter.go

-	indexingThreads int, parseThreads int, importFiles []string) *FSImporter {
+func NewFSImporter(res *resources.Resources) *FSImporter {
+	// set batchSize to the max of 4GB or a half of system RAM to prevent running out of memory while importing
+	batchSize := int64(util.MaxUint64(4*(1<<30), (memory.TotalMemory() / 2)))


In 1cf8e14, I changed the batching limit to take the greater of the old default (4GB) or half of the system's total memory. This allows us to run the MongoDB analysis phase quicker when the system can support it.

Zalgo2462 · 2021-08-10T20:30:54Z

util/ip.go

@@ -54,6 +54,11 @@ func ParseSubnets(subnets []string) (parsedSubnets []*net.IPNet) {

 //IPIsPubliclyRoutable checks if an IP address is publicly routable. See privateIPBlocks.
 func IPIsPubliclyRoutable(ip net.IP) bool {
+	// cache IPv4 conversion so it not performed every in every ip.IsXXX method


In 43a7dc7, I added a bit of code in util.ContainsIP and util.IPIsPubliclyRoutable to cache the results of net.IP.To4() which is called by many of the net api methods. It turns out that when you parse an IPv4 address in Go, the IP is stored using 16 bytes, but then the net api slices out the 4 bytes it needs via To4() in all the api methods.

Zalgo2462 · 2021-08-10T20:31:20Z

parser/ssl.go

+	return
+}
+
+func updateHostsBySSL(srcIP, dstIP net.IP, srcUniqIP, dstUniqIP data.UniqueIP, srcKey, dstKey string,


Fixes issue #684.

Zalgo2462 · 2021-08-10T20:35:49Z

util/util.go

@@ -88,3 +99,40 @@ func StringInSlice(value string, list []string) bool {
 	}
 	return false
 }
+
+//Int64InSlice returns true if the int64 is an element of the array
+func Int64InSlice(value int64, list []int64) bool {


It seemed the int64 linear searches did not take up too much time (mainly used when unique'ing unique connection timestamps). My guess is that the go compiler is smart enough to optimize the linear scan using SIMD or that the CPU is smart enough to speculatively execute checks in parallel. (Moved from fsimporter)

Zalgo2462 · 2021-08-19T18:13:25Z

Merged master back into this branch.
Integrated #690 and double checked that the rolling proxy beacons were still being reported correctly.

fullmetalcache

On a first pass, everything looks good to me. Looks like the majority of changes were splitting out the parsers into different files, changing variable types/functions called, other optimization tricks, and some renaming/reorganizing of other variable types to improve readability. Just one minor question on the renaming of bro to zeek. Not a big deal though.

I will test this out and give it a second pass tomorrow. If everything else checks out, I am good with approving this.

Nice work man, so many good changes and learned so much here! Incredible how some of those variable types and function choices can make such a huge difference. If we don't already have something, we should consider having a doc or something else with some of these efficiency findings for types and functions so that we can refer to them in the future.

fullmetalcache · 2021-08-19T21:35:08Z

parser/files/reading.go

+}
+
+func mapZeekHeaderToParseType(header *BroHeader, broDataFactory func() pt.BroData, logger *log.Logger) (ZeekHeaderIndexMap, error) {
+	broData := broDataFactory()


Kind of a tough one as it goes along with the comment above but would we want to rename this to zeekData? Either way, at some point I think we should change all bro references to zeek for consistency. I will only mention the naming once here as to not clutter the comments with comments on renaming bro to zeek. If we decide to do that, we can just do some find-and-replacing rather than me comment on every instance.

I kept this one as broData since the type was kept as pt.BroData from previous versions. I've added an issue to the tracker to generally rename the references from Bro to Zeek (#693).

fullmetalcache · 2021-08-19T22:05:50Z

pkg/beaconproxy/mongodb.go

-	res *resources.Resources
-	min int64
-	max int64
+	database *database.DB


Nice, I like how this removes a level from a lot of the places where we use this struct. Makes things a bit cleaner

fullmetalcache · 2021-08-19T22:10:22Z

pkg/data/ip.go

@@ -202,23 +209,24 @@ func (p UniqueIPPair) BSONKey() bson.M {
 //UniqueIPSet is a set of UniqueIPs which contains at most one instance of each UniqueIP
 //this implementation is based on a slice of UniqueIPs rather than a map[string]UniqueIP
 //since it requires less RAM.
-type UniqueIPSet []UniqueIP
+type UniqueIPSet map[string]UniqueIP


Very nice on the type replacement

fullmetalcache

Found just one very minor thing on a second pass. Tested on a semi-large set on a system with 16GB RAM and 8 Cores allocated that is running on a server with 8 10K disks in RAID 10 config. Got around 8 minutes of parse time on current master. Got around 4 minutes with these changes. Nice!

fullmetalcache · 2021-08-20T18:23:31Z

parser/files/reading.go

+			err := errors.New("type mismatch found in log")
+			logger.WithFields(log.Fields{
+				"error":         err,
+				"type in log":   header.Types[index],


Minor thing but can we please remove the spaces and replace with underscores for the field names?

fullmetalcache

LGTM!

Zalgo2462 force-pushed the speed-up-parser branch from d34d79f to 79add8e Compare July 7, 2021 00:34

Logan L added 8 commits July 15, 2021 13:02

Refactor import command dependencies so I can split out the parsing o…

bfed5a9

…f the different file types

Split parsing for different file types into their own files. Fix bug …

05627e5

…with open connections where the host map's counters for unexpected protocol port service tuples weren't being incremented

Add time taken to parse display, add comments for enabling profiling …

741a9cd

…of parsing

Change strings.Split to strings.Index in parseTSVField to stop alloca…

ae7cf37

…ted extra strings, function runs ~1.2x faster now, and GC is doing a small bit better

Use strconv.Atoi instead of strconv.ParseInt since it has a shortcut …

ddef24b

…for simply formatted integers

Map from each Zeek field's index in the header to the parse struct fi…

a0cfecd

…eld offsets using an array. We previously mapped from each Zeek field's name to the offsets using a hashmap. This took a lot of time since the code was executed a lot.

Rely on system gzip/ pigz when possible instead of golang gzip. Docke…

0e911ee

…r specifically uses pigz for this purpose.

Switch from standard json lib to json-iter

0616f3e

Zalgo2462 force-pushed the speed-up-parser branch from 4af8ff5 to 5b00ef7 Compare July 15, 2021 19:03

convert unique ip sets over to hashmaps from slices

cd8dd38

Zalgo2462 force-pushed the speed-up-parser branch from 5b00ef7 to cd8dd38 Compare July 15, 2021 19:11

Logan L added 4 commits July 15, 2021 18:10

fix bug in gzip changes where subprocesses were not properly closed

dbf8ce9

Change batching limit such that batches are limited to the maximum of…

1cf8e14

… either 4GB (as before) or half of system RAM. Note that RAM usage is much lower than the batch limit since we don't store every line we read.

linter fixes

b150308

Replace string sets backed by string slices with maps

c4b5b3f

Zalgo2462 force-pushed the speed-up-parser branch from a9f7f7c to c4b5b3f Compare July 16, 2021 03:52

Logan L added 3 commits July 16, 2021 17:40

Cache IPv4 format conversions perfomed by the golang library

43a7dc7

provide a fast path for creating unique ip objects without agent info

4f78c1e

Merge branch 'master' into speed-up-parser

74067f8

Zalgo2462 marked this pull request as ready for review July 17, 2021 01:39

Zalgo2462 and others added 2 commits July 19, 2021 18:41

Merge branch 'master' into speed-up-parser

46a55f5

Merge branch 'master' into speed-up-parser

93d094a

Logan L added 2 commits August 4, 2021 18:21

(#684) Fix host collection connection counts are undercounted in the …

3028e62

…new parser

Fix bug where UniqueIPSets were not being properly used after convert…

180de84

…ing the set representation

Zalgo2462 commented Aug 10, 2021

View reviewed changes

Fix bug in string sets where they were not being properly initialized

12215e0

Zalgo2462 mentioned this pull request Aug 10, 2021

Remove unnecessary deserialization logic #668

Merged

CI kick

1039f20

This was referenced Aug 11, 2021

Add Config Option to Filter External to Internal Communications #655

Merged

Fix UniqueIP host collection subdocument queries #683

Merged

Analyze Past 24 Hours with Beacons Proxy #690

Merged

Merge branch 'master' into speed-up-parser

8c1252f

fullmetalcache reviewed Aug 19, 2021

View reviewed changes

fullmetalcache requested changes Aug 20, 2021

View reviewed changes

replace spaces in log fields with underscores

3283be6

fullmetalcache approved these changes Aug 20, 2021

View reviewed changes

fullmetalcache merged commit 13b2e77 into master Aug 20, 2021

fullmetalcache deleted the speed-up-parser branch August 20, 2021 21:28

Zalgo2462 mentioned this pull request Apr 19, 2022

Profile RITA RAM Usage #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up RITA parser #654

Speed up RITA parser #654

Zalgo2462 commented Jul 7, 2021 •

edited

Loading

Zalgo2462 commented Jul 23, 2021

Zalgo2462 left a comment

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 Aug 10, 2021

Zalgo2462 commented Aug 19, 2021

fullmetalcache left a comment

fullmetalcache Aug 19, 2021

Zalgo2462 Aug 20, 2021

fullmetalcache Aug 19, 2021

fullmetalcache Aug 19, 2021

fullmetalcache left a comment •

edited

Loading

fullmetalcache Aug 20, 2021

Zalgo2462 Aug 20, 2021

fullmetalcache left a comment

Speed up RITA parser #654

Speed up RITA parser #654

Conversation

Zalgo2462 commented Jul 7, 2021 • edited Loading

Goal

Note to the reviewer

Changes

Performance Testing

Zalgo2462 commented Jul 23, 2021

Zalgo2462 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zalgo2462 commented Aug 19, 2021

fullmetalcache left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fullmetalcache left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fullmetalcache left a comment

Choose a reason for hiding this comment

Zalgo2462 commented Jul 7, 2021 •

edited

Loading

fullmetalcache left a comment •

edited

Loading