An alternative way to visualize graph data with pandas and ipycytoscape
Goal
The goal of this article/notebook is to approach the visualization of graphs using ipycytoscape assuming you are given the data in a tabular form (excel, CSV, google sheet etc) which actually might be relatively frequent. This article/notebook follows part 1 and part 2. The example I am basing this article is built upon the previous ones. We are rendering a small European rail network.
What do you need to know?
You are expected to have basic notions of python programming. You should have read the previous two notebooks/articles in the series (part 1 and part 2). You should have notions of pandas.
Pandas: Another approach
Let’s face it, most of the data we manipulate is somehow stored in a tabular manner. Excel is certainly and by far the most used data manipulation tool in the planet.
Recap: We are plotting (part of) the European rail Network and I would like you to imagine that the graph that is going to be built in this notebook is real part of a user interface. The interface is supposed to be used by people sit behind the computer and looking and manipulating the graph and at the same time the graph might be updated continuously according to real time data. Something similar to an air traffic controller. So the colourings and appearance of the graph might be changing continuously depending on the number of passengers on trains, number of trains in stations, different colours for different train speeds etc. So if you were the coder of the project you will have to reflect all that data into the GUI to the controller; and we have chosen to use Jupyter notebooks, voila, ipycytoscape and pandas for that.
Imagine that you are given the data in two tables:
- “stations.xls” -> all the data about the train stations.
- “railconnections.xls” -> all the data about the connections between stations.
The most natural way to work with tabular data in python is using the pandas library. In pandas you can import directly the excel tables into a dataframe.
Note 1: The graph and interface can be deployed with voilà (it is not the purpose of this article to deepen into voilà, you just need to know that there are tools to render, i.e. display, ipycytoscape graphs into a browser outside of a Jupyter notebook.)
Note 2: Ipycytoscape includes already an API that allows to pass a pandas DataFrame to a graph constructor (see here: https://ipycytoscape.readthedocs.io/en/latest/examples/pandas.html). Nevertheless, my approach is slightly different; I would like to show a way to manipulate and change graphs using strictly pandas. The use case will be apparent after this article/notebook.
Regarding the data used in this article/notebook I pasted here a dictionary in order for you to be able to follow along without the need of reading an external file and hence the article self-content.
And here the file “railconnections.xls”:
Lets now add the necessary LAYOUT data to the table. The class of the new stations is ‘EU’ (look part one and two of this series for understanding classes).
Every EU rail station should be displayed orange, and the rest of the stations blue, except for the German capital which will be yellow.
Lets add this data to the DataFrame.
The above showed data tables containing the attributes of the stations and rails connections. Now a method to pass all this data to ipycytoscape is needed.
The following method manipulates both tables above in order to output a JSON file that can then be passed to ipycytoscape. The method itself returns the ipycytoscape graph. Afterwards it is only necessary to plot it. The following is a big chunck of code. You DONT NEED TO DIG into it. Just get the idea that the method tranforms two tables with nodes (rail stations) and rail connections (edges) into the real graph.
Note: It is not a purpose of this article to deal with location manipulation of nodes, hence the position of cities does not correspond to reality.
But I started this article saying that I would show why being able to work directly with pandas dataframes is a good way to go.
Now assume that you want to change the colours of the Graph in the following way.
Stations with more than 200000 passengers should be rendered red and the rest green.
The high-speed rail lines should be painted red. (high-speed line is considered the one with more or equal than 300km/h) We can do that operating over the data frame and pass the result again to the ipycytoscape constructor of the method above defined. Which ones are then the CSS affected attributes? For the nodes is the ‘background-color’ (we already used it) and for the lines is the ‘line-color’. Let’s see.
With only two lines we added the necessary changes and the dataframes are now as follows.
stations_df:
rails_df:
Changing appearance of the edges & nodes with the help of pandas
In order to further illustrate the power of this approach I will add another layout change. A new security regulation was passed in parliament and all the stations where a high train speed arrives should have special anti-fire measures. So the GUI should show in violet all the stations where at least a high speed train arrives and in green those in which that is not the case.
Stations in which at least one high speed line arrives
changing color of nodes
Lets color the with high speed train connections.
Lets change the color of the stations in the stations dataframe.
changing shapes of nodes
Next the stations with more than 350000 passengers are going to be plotted as squares to differentiate them from the other ones that are smaller.
changing the size of the nodes
We want to make the size of the nodes proportional to the amount of passengers of every station.
In this way the graph will reflect more appropiately the (relative) weight of every station in the whole net.
If you google for normalizing data you will find that often time the preprocessing library of sklearn is used. In order to not to make things here more complicated a formula will be used without making use of yet another very heavy lybrary.
formula = value / (max_value — min_value)
Again we can create a column in our nodes dataframe representing the score of every of station.
We have a normalized column with the size of the rail stations. if we assume a size of 20px minimun we have to add 20 to the column.
The same can be done with the rail connections. The higher the speed the thicker we want to pain the rail connections.
Conclusion
I shown how to use pandas to layout graphs with ipycytoscape in a more direct way taking advance of all the powerful pandas API.
This approach allows the data scientist to focus on pandas and not on the ipycytoscape API, above all when he feels more confortable with pandas which is my case.
The two next neccesary steps are manipulting the x,y coordinates of the nodes and adding interaction (ipywidgets).
Stay tuned