The Road to Streaming Graph Analysis

AJ Heller
Insight
Published in
5 min readMar 21, 2017

--

A story about an Insight Data Engineering project that nearly wasn’t, and the beginning of an unexpected new career.

I vaguely recall my first encounter with a trampoline — the physical, acrobatic sort of trampoline, not the Lispy control flow concept. I remember the joy I found in the novelty of such a gigantic toy. The exhilaration of launching myself into the air, unsupported, defying gravity. The discovery that I can get a lot higher with some well-timed help from a friend. And noticing that I wanted to keep jumping long after I got tired, because it was fun. And I wasn’t about to let tired get in the way of fun.

My time as an Insight Data Engineering Fellow went much the same way. Maybe it’s because I got to learn something new and exciting, or maybe because I really enjoyed being around the other people who found themselves here, too: through an intense 7 weeks, there’s something about the Insight experience that drove me to work very hard and to be grateful for the chance to do so. Now, as a Program Director, I consider myself a custodian of that experience, and hope to share it with all who come through our doors.

I came to Insight in September 2016 with a Chromebook and a bad haircut — a new Fellow looking to show the big-data world the beauty that lies in simple, efficient solutions. “Necessity is the mother of invention” was my mantra. That’s how I justified not upgrading to an actual laptop, anyway. I already had Arch Linux installed on my Chromebook, and the 16GB SSD held the JDK with room to spare. I had a machine, and I was convinced I could make it work. The hard part, interestingly enough, was figuring out what to work on.

I went through five project ideas before I settled on the one that I ended up presenting. For context, five ideas is not a lot for a Fellow these days; there are many things you need to understand before you can wrap your head around a modern data engineering challenge. Five iterations on your ideas is fine and expected, but for many reasons my ideas took a long time to invalidate, and I was left without a project at the end of the second week of the program. The end of week two is just beyond the point where we were expected to have our project MVP built and usable. I burned two weeks of a three-week project, and I had just buried my fourth idea.

There was a decision I had to make over and over again, and it went a bit like this:

  • I knew I could quit; leave the program and thank them for the opportunity.
  • Or I could accept that things are as they are, and choose to keep moving forward.

Put in such simple terms, the decision was easy: I was going to keep working until they kicked me out. Gladly, they didn’t.

Through all my “failed” project ideas, I ended up with a decent understanding of Flink’s model and internals. One of my Program Directors, Ronak Nathani, recommended I actually play around with the tool (as I hadn’t yet) and see if anything inspires me. I took interest in the streaming graph library, gelly-streaming, and learned it had an algorithm that could find connected components incrementally from the beginning of time, but for instance, couldn’t show you the connected components that had developed in just the past 2 hours. I didn’t have a great application in mind, but intuitively, it seemed there must be value in isolating a view of the graph that captures change in the network over time with respect to disjoint sets.

To get more concrete, take Github for example. You make pull requests to people’s repos, comment on others’ issues, and fork someone’s project to work on it yourself. If you treat these interactions as connections in a network (and squint a bit), you can see a developing social network, right? Flink’s gelly-streaming could tell you if there’s ever been a connection in your extended network of Github interactions to someone like Brendan Gregg, but it couldn’t tell you if such a connection happened recently, or if it happens periodically. It couldn’t show the ephemeral nature of groups as novel repos come and go, or identify the people that are most likely to bridge disparate groups as they are growing and changing.

I decided to implement this feature in the gelly-streaming framework. The concept isn’t too heavy-duty. My Insight project and demo consisted of a pipeline that identified groups (connected components) in the Github event stream, within a sliding window. At ~20,000 events per second, this pipeline could chew through 5 years of Github data in about 8 hours. I titled the project Network Pulse, but Github offered the repo name “literate-garbanzo”, and I just couldn’t bring myself to change it.

Lamentably, my contribution wasn’t in line with gelly-streaming’s goals. I also bumbled the implementation at first, and I didn’t make a good case for its inclusion since this sort of algorithm deviates a bit from the rest. Either way, my work is Open Source but not immediately consumable, so if you think you have a use for this, let me know and I’ll see if I can help!

And throughout it all, my workhorse was a $150 Acer C710 Chromebook. In my late-night stupor, I even spent hours building JARs locally, swapping memory to a USB2.0 flash drive when my SSD ran out of space, before I realized it’d take mere minutes to build the same things on any of the AWS EC2 nodes I had up and running. It was a hard lesson on the false economy of neglecting adequate rest. Live and learn.

Insight was my trampoline into a new career. I once called it “the shortest path between me and a really cool job.” I didn’t expect that job would be here at Insight, but I’m grateful that it is. Now, to deal with that bad hair cut …

Given some free AWS resources, 7 dedicated weeks, a bit of instruction, and an amazing professional network … if you have the drive and perseverance to turn those ingredients into a career, I highly encourage you to apply. The deadline for applications is March 29th. I hope to see you here!

--

--