Graphs, pyramids and organic growth

For last couple of months I have been thinking, researching, trying and failing to build different approach to software architecture.As I am getting closer to my 40ties it is high time to summarize what I have learned so far. All the past I spent following others, learning from people, adapting paradigms, testing them in real life, and throwing majority of them to a trashcan, as they were idealistic, utopian visions of reality.
Does it mean I want to concur the world with yet another manifesto? Give consultants a chance to write one more book? Does it mean I want to start a revolution? Ask you to forget all you have learned? For God’s sake, no. If I every do this, it means that I was drunk, or somebody forced me to do it.
The things I think about is rather a librarian like work, a village’s shaman who decided to put all the collective knowledge and wisdom of the tribe, written in the stone. It is rather collection of articles,blog posts, discussions, things so obvious they even don’t have name, so it is hard to talk about them.
So I named this thing, I called it Patterns of Organic Architecture. Nice name, would look nice at the cover book I will never have time to write.

What’s all about? Over the years I tried many approaches to software architecture. We, as an industry have written tons of worthless material around it, and we still struggle what does it really mean “to do software architecture”, “to be software architect”. We tried big design upfront, we tried bottom-up and top-down, we tried “don’t give a f..k about it” (so called Agile, or rather how we interpreted it). And yes this post is NOT against Agile, so stop whining about how bad I am, that I don’t understand it, I don’t believe in values. Again f..k it.
I am talking about reality here, not about Agile as an idea. I am talking about us, about us who had to implement what was said by Mr. Senior Principal Enterprise Architect, about us asked to implement one of these “world class” monolithic architecture frameworks, about us being asked if we can improve overall architecture, in the meantime adding some sexy features here and there. About us hearing everyday that there is no time for architecture, because we need to deliver some mythical business value, which usually turns into customers leaving product, because they had to wait to long to log in, because they had to wait to long to get new feature. Whatever you do to build your architecture, or even if you don’t care, you still can use what I call Patterns of Organic Architecture. A nature’s way of building simple and beautiful stuff.
Let it grow, go with the flow, but don’t leave it alone. Watch it. Measure it. Dig in, and when necessary take action. Understand forces that drive your system, and use them for your own good.
One of the patterns I have identified, “grow and harvest”, assumes that over time there will be pieces of your system which will become stable, by stable I mean ratio of changes in last months is close to zero.
One of the problems is how to identify these pieces, so we can “harvest” them from the code base, seal in separate repository, release binary artifact, or even remove them because they are not used, and enjoy smaller code base, faster build times and so on.
The idea is not new, this is something Michael Feathers is talking about quite often in his posts. Your SCM has all the information you need, you just need to dig in, reach out. The problem I have is that I tend to find myself in situations where I need to deal with gargantuan code bases, years of history, tons of files and technology and architecture changes, and if in majority of cases simple Perl script will do the job, it is hard, really hard to reason about such beasts without better tools.
Recently a friend of mine (to some known as @LAFK_pl) asked me for help with some interesting problem he was trying to solve with graph database, this way I found out that Neo4j was just updated to version 2.0. I was working with Neo4j couple of years back, trying to replace some legacy system for airlines with lovely graph model of airports and flight connections. Since then I didn’t had time to really track what is happening in this space, until last week. And suddenly I realized that graph is all I was looking for. What if I push all the information I have about files, Maven modules, packages and such plus SCM change sets in one graph, I can ask for anything that comes to my mind? Couple minutes later I had came up with this dirty code snippet which reads data from Mercurial log and puts content of this log into Neo4j database. Beware!!! This is not OOP, DDD, TDD code, this is just few minutes hack.

import static com.google.common.collect.FluentIterable.from;

import java.io.FileReader;
import java.util.List;

import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import net.sf.saxon.om.NodeInfo;

import org.neo4j.graphdb.Direction;
import org.neo4j.graphdb.DynamicLabel;
import org.neo4j.graphdb.DynamicRelationshipType;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexHits;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.uncommons.maths.combinatorics.CombinationGenerator;
import org.xml.sax.InputSource;

import com.google.common.base.Optional;
import com.google.common.base.Predicate;

public class App {

	private static final Logger LOGGER = LoggerFactory.getLogger(App.class);

	public static void main(String[] args) throws Exception {

		GraphDatabaseService graphDatabase = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder("db")
		        .newGraphDatabase();

		XPathFactory factory = XPathFactory.newInstance();
		XPath xpath = factory.newXPath();

		@SuppressWarnings("unchecked")
		List<NodeInfo> changesets = (List<NodeInfo>) xpath.evaluate("/changes/changeset", new InputSource(
		        new FileReader("out.xml")), XPathConstants.NODESET);

		for (NodeInfo node : changesets) {

			@SuppressWarnings("unchecked")
			List<NodeInfo> files = (List<NodeInfo>) xpath.evaluate("file/text()", node, XPathConstants.NODESET);

			if (files.size() >= 2) {
				CombinationGenerator<NodeInfo> generator = new CombinationGenerator<NodeInfo>(files, 2);

				Transaction tx = graphDatabase.beginTx();
				Index<Node> nodes = graphDatabase.index().forNodes("files");
				for (List<NodeInfo> pair : generator) {

					NodeInfo first = pair.get(0);
					NodeInfo second = pair.get(1);

					LOGGER.info("changeset pair {}<->{}", first.getStringValue(), second.getStringValue());

					Node firstNode = addNode(graphDatabase, first, nodes);
					final Node secondNode = addNode(graphDatabase, second, nodes);

					createChangeset(firstNode, secondNode);

					createChangeset(secondNode, firstNode);
				}
				tx.success();
				tx.close();
			}

		}

	}

	private static void createChangeset(Node startNode, final Node endNode) {
		Iterable<Relationship> relationships = startNode.getRelationships(
		        DynamicRelationshipType.withName("changeset"), Direction.OUTGOING);

		Optional<Relationship> firstMatch = from(relationships).firstMatch(new Predicate<Relationship>() {

			public boolean apply(Relationship r) {
				return r.getEndNode().getId() == endNode.getId();
			}

		});

		Relationship relationship = firstMatch.orNull();
		if (relationship == null) {
			Relationship relationshipTo = startNode.createRelationshipTo(endNode,
			        DynamicRelationshipType.withName("changeset"));
			relationshipTo.setProperty("times", 1L);
		} else {
			Long property = (Long) relationship.getProperty("times");
			relationship.setProperty("times", ++property);
		}
	}

	private static Node addNode(GraphDatabaseService graphDatabase, NodeInfo df, Index<Node> nodes) {
		IndexHits<Node> indexHits = nodes.get("filename", df.getStringValue());
		Node single = indexHits.getSingle();
		if (single == null) {
			single = graphDatabase.createNode();
			single.addLabel(DynamicLabel.label(df.getStringValue()));
			nodes.putIfAbsent(single, "filename", df.getStringValue());
		}
		return single;
	}
}

This piece of code reads all change sets from the XML, and for each file, which in this case is a node in a graph, creates relation to another file in the same change set. If such relation already exists it just simply increases counter which is stored on the edges (relation) between nodes. Of course I am lazy enough to not write my own implementation of combinations from combinatorics, so I use Uncommons Math, and its CombinationGenerator, which generates pairs of all of the combinations of files within one change set. One thing I have spotted working later on with this graph is that, because Neo4j stores directed graphs, I had to generate two relations per each pair, incoming and outgoing. Which in fact is true, file A was changed together with file B, and file B was changed together with file A. Thanks to this I could simplify my Cypher queries. For those of you who are new to Neo4j, Cypher is a language which allows you to work with graphs, including queries and modifications. So what kind of information I can get from graph?

neo4j-sh (?)$ MATCH (a)-[c:`changeset`]->(b) RETURN labels(a),c.times,labels(b) order by c.times desc limit 5;                         
+-----------------------------------------------------------------------------------------------------------+                                                                
| labels(a)                                      | c.times | labels(b)                                      |                                                                
+-----------------------------------------------------------------------------------------------------------+                                                                
| [".../listeners/SummaryScenarioListener.java"] | 13      | [".../listeners/LoggingScenarioListener.java"] |                                                                
| [".../listeners/LoggingScenarioListener.java"] | 13      | [".../listeners/SummaryScenarioListener.java"] |                                                                
| ["pom.xml"]                                    | 12      | ["roadrunner-core/pom.xml]                     |        
| [".../cli/BenchTest.java"]                     | 12      | [".../cli/RunTest.java"]                       |                                                                
| [".../cli/RunTest.java"]                       | 12      | [".../cli/BenchTest.java"]                     |                                                                
+-----------------------------------------------------------------------------------------------------------+                                                                

This is the simplest query, which shows all pairs of files which where modified together, ordered by number of times this pair occurred in any change set. What is important to understand is that this kind of analysis not only shows code level dependencies, but as well feature/function level dependencies, it can also show cross technology dependencies, between your JavaScript, CSS and Java files, which is pretty hard to get even with all modern IDEs we have at our disposal.Of course this type of SCM “big data” 🙂 analysis can be sometime misleading, there can and will be a lot of “falsy truths” about your code base. Especially when your teams favor large commits at the end of each sprint (which I hope doesn’t happen in any organization taking code quality and continuous integration seriously). But compare this kind of information with no information at all. At least you have places in the system from which you can start you travel back in time. Of course you use other tools to visualize your code, like Gephi or Graphviz. In my past I did many such “back in time” travels, and I was always coming back with interesting and precious information. In many cases as a result we were reorganizing code base, cutting of new libraries, Maven modules and so on. Better built times, less complex code. It is worth to look back. It is worth to “grow and harvest” you code. Enjoy!

Advertisements
Tagged , , , , ,

One thought on “Graphs, pyramids and organic growth

  1. jexp says:

    Software analytics with graphs is really a lot of fun as you can just pour data from different parts (code, scm, bug reports, test-failures, modules) in and get a lot of insights by just running a few cypher queries.

    Others have enjoyed that too, in case you want to see what they did:
    http://jexp.de/blog/2014/06/geekout-software-analytics-with-graphs/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: