Follow me: Jack Histon's Twitter Share on LinkedIn Share on Google+ RSS Feed

Author avatar

Welcome. I am Jack Histon. My career would not be what it is today without dedication and hard work from software bloggers. My purpose is to give back to that online community.

Trust your Data, Not your Mind

Wednesday, 27 September 2017

Tweet about this on Twitter Share on Facebook Share on LinkedIn Share on Google+ Pin on Pinterest Share on Reddit Share on StumbleUpon

When you add decor to your home, previous decor can look worse. You paint one wall a bright blue, but at the same time, the shine of the wall next to it dulls a little.

Your codebase is a never-ending cycle of refactoring. Small improvements to one part, makes you see small improvements for another part. Shining a light on a piece of your code, shines problems on code around it. It makes you pick up your paint brush, and bring your new bright blue to the adjacent wall.

Ofcourse, your auxiliary code hasn't become worse. It doesn't become inefficient, It doesn't become confusing, It doesn't become fragile. It is unchanged. Your mind is playing tricks on you. It makes you feel uneasy, with new refactorings living with old refactorings.

Your mind can, and will, compare everything to anything. It leads you down paths that are unnecessary, cumbersome, and wasteful. On rare occasions, it can provide a happy accident. You stumble around like a toddler finding its feet. For a split second you make it, until you tumble to the floor.

Your mind can not be trusted.

How can you navigate your way through your codebase, and make necessary, straightforward, and frugal brush strokes?

Version Control

Version control systems are the answer. If you don't use version control, then refactoring is like painting your wall bright blue, while your house is on fire and your family has left you.

Adam Tornhill is a fantastic programmer that designs tools for software analysis. One such tool is Code Maat:

Code Maat is a command line tool used to mine and analyze data from version-control systems.

Adam sees your code as a crime scene. A crime scene leaves clues to what could have occurred prior to the crime. Analyses of blood spatter, time of death, physical evidence, and other clues, can all help a detective in solving the case, and catching the killer.

To catch a serial killer, patterns need to emerge. What is the serial killer's modus operandi? is there a specific time of day, area, and method to the serial killer's crimes?

Adam has taken the skill set of a detective, and applied it to modern day software development. Clues such as time of day, area's frequented, and methodology used, can all be found inside your version control system.

Detectives find patterns about crimes committed using data analysis. In a similar way, you can find patterns within your code that reveal problem areas.

Firstly, we need data. I love git, and so for me, the correct command to use with Code Maat is:


git log --all --numstat --date=short --pretty=format:'--%h--%ad--%aN' --no-renames

Like organising the killer's crimes by when they occurr, we come to a format that open's a door into our git activities. From here, we can use different commands to feed us different results.

First, we need statistics on our killer. We can run:


java -jar code-maat-0.9.0.jar -l logfile.log -c git -a summary

This command gives us a feel for the amount of work that has happened over a given time.

I am a .NET enthusiast. For my job I use the ASP.NET Core repository on GitHub. So let's find potential problem areas it has. Running the summary for it:


statistic,value
number-of-commits,3934
number-of-entities,9518
number-of-entities-changed,54952
number-of-authors,160

Here we see that there have been 160 participants in this repository. 160 people who could have been at the scene of the crime. So now, let us find some crime scenes in the code:


java -jar code-maat-0.9.0.jar -l logfile.log -c git

This will give us the number of people who have worked on a file. The higher the number of killings in an area of a city, the more likely the serial killer will have visited:


entity,n-authors,n-revs
src/Microsoft.AspNet.Mvc/MvcServices.cs,26,134
src/Microsoft.AspNet.Mvc.Core/Controller.cs,24,69
src/Microsoft.AspNet.Mvc.Razor/Compilation/RoslynCompilationService.cs,19,88
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,19,46
src/Microsoft.AspNet.Mvc/MvcServiceCollectionExtensions.cs,18,59
src/Microsoft.AspNet.Mvc.Core/MvcRouteHandler.cs,18,37
src/Microsoft.AspNet.Mvc.Razor.Host/MvcRazorHost.cs,17,72

...

What is this telling us? It is telling us that there might be a problem in the MvcServices file. It has had 134 revisions, with 26 different people changing it over time. That is a lot of churn.

Interestingly, MvcServices is a class that no longer exists in the repository. What I have done here, is executed the command to the beginning of the repositories existence. Like pulling murder records for places that no longer exist, some records may be irrelavant. But crimes committed by your serial killer could have happened many years ago.

One interesting result is the ControllerBase class. This does exist in the current state of the repository. We have found an area of interest, which might be a problem we need to fix.

The key here is the file might be a problem. There are no guarantees that it is a problem. Like painting your wall bright blue, you replaster the wall first to create a smooth wall. If your file has a lot of churn, it could be from another reason for change, and not an area people spend hours fixing bugs in.

So you have found your crime scenes, but how do you link your serial killer to it?

Two files are logically coupled if they change together. Another way of analysing your repository is with logical coupling:


java -jar code-maat-0.9.0.jar -l logfile.log -c git -a coupling

Giving us a result of:


entity,coupled,degree,average-revs
PhysicalFileResultExecutor.cs,VirtualFileResultExecutor.cs,100,13
XmlDataContractSerializerInputFormatter.cs,XmlSerializerInputFormatter.cs,100,7

...

This is telling us that every single time someone changes the PhysicalFileResultExecutor, they have to change the VirtualFileResultExector. Every time we paint a wall bright blue, we first plaster the wall with a nice smooth finish; both actions are logically coupled.

Like a serial killer and his modus operandi, there could be false positives; Copy-cats, coincidences, etcetera. However, this could be a clear sign that there is high coupling between two files, and we should investigate further.

To link a serial killer to a crime, there needs to be evidence to put them in the correct time and place for it to occur. You have found your problem file, but who works on that file the most?


java -jar code-maat-0.9.0.jar -l logfile.log -c git -a entity-ownership

With this command, we can find out who spends the most time changing an area of code. Here are the results of this command:


...

entity,author,added,deleted
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,Pranav K,201,60
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,Kristian Hellang,39,0
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,Jaspreet Bagga,217,0
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,ryanbrandenburg,30,16
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,Charlie Daly,60,180
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,matt kocaj,6,6
src/Microsoft.AspNetCore.Mvc.Core/ControllerBase.cs,Hao Kung,10,10

...

Here are authors that have either added or deleted the most from a specific file. Here we can see that Pranav, K. has created the most churn within the controller base class. But what does this mean?

What we can take from these results, is who is best placed to clean up the scene of the crime. Like a local for an area of a crime scene, it is good to question them on recent happenings in the neighbourhood. It can give you key insights, and perhaps unveil an issue that none of you understood existed previously.

Summary

Painting a house is a large task. Good advice is to work on one room at a time. Better advice, is to work on one problem at a time. The room is not the problem.

Having legacy code in your system is always going to be the case. It is tempting to jump in, and start polishing everything you can.

Teams can waste months, if not years, on rewrites that may not have had to happen at all. Identifying key problem areas, and refactoring those, will give you data driven evidence of improvements over time.

You and your company have a finite set of resources to solve your problems. You should not monitor an entire city for a serial killer's movements. But you should monitor the problem areas of that city, increasing your chances in finding them.

This article is not here to improve your detective skills. It is not here to help you interpret your data. It is here to help you realise the importance of data.

Stop listening to your gut; analyse the crime scene with Code Maat.

Share with a friend

Please share this blog post so others can learn from it as well.

Tweet about this on Twitter Share on Facebook Share on LinkedIn Share on Google+ Pin on Pinterest Share on Reddit Share on StumbleUpon

Recent Posts

Archives



© 2017 - Jack Histon - Blog