Episode 4: "Game of Ties" – Centrality measures battling it out
עודכן ב: ינו 19
What's the rumpus :)
I'm Asaf Shapira and this is NETfrix.
Who are the bottlenecks in my organization? Who should we target for an advertising campaign? How to tell a story about data? Are rhetorical questions an essential part of network analysis? Let's find out.
Like most of our world, the network is also a Power Law distribution, that is, there are a few central nodes and most of the nodes are marginal. So, if there are only a few big hubs in the network, how shall we find these big needles in the haystack?
In order to find these nodes, we will need to use SNA (Social Network Analysis) algorithms, known as "Centrality Measures". these metrics allow us to find the centers of gravity in networks that will help us "control" the network, understand what is going on, dismantle it if necessary, etc.
Let's start our quest with addressing a common misperception. Often in the search for key players, we intuitively look for the most active ones in the network. Although the activity of nodes is also distributed as a Power Law (there are few activists and the majority do little), the fact that something is active, does not necessarily make it central or influential.
Let’s say I call someone 100 times a day. Our relationship or edge weight is a 100. It definitely says that I am an active player and also a bit of a stalker, but it does not make me a hub. Conversely, if I call 20 people or have 20 people call me and even if I conduct only 2 calls with each of them, my relationship weight will only be 40, but I probably play a much more central role in the network than my creepy self.
It can be deduced from this example that the amount of connections a node has is a significant aspect to centrality, and it does sound very intuitive. Therefore, it is not by chance that this is the most popular measure of centrality in the field of networks, and it is called: The Degree of a node. In all the examples given so far in previous episodes, this was the main metric we used.
Let's demonstrate this on a star-shaped network which is a network comprised of a central node and all the other nodes are connected only to it.
In this network, the node in the middle has the highest Degree. All other nodes will receive a score of 1, the lowest, because they are connected to only one node, the central node.
In a directed network we can also measure the incoming links (In-Degree) and the outgoing links (Out-Degree).
If node X contacts 4 nodes and at the same time, 3 nodes contact it, then its out-Degree is 4 and its In-Degree is 3.
The rationale behind this metric is that the node is central, if it's linked to many players.
In school it would be the popular kid, in a computer network it would be the main server, in an organization it is expected to be the head of the bureau, or the main office.
There are probably hundreds of algorithms for identifying "hubs" in the network, but they can be summarized into three main categories that stem from different perceptions of what it means to be central in a network.
The first category, which we have already touched on, is the number of connections of the node.
The second category is to what degree does the node constitute a bottleneck or bridge in the network.
To be considered a bottleneck in the network does not necessarily mean lots of links. It means that the node is located at the network in such a manner that we often have to go through it in order to travel from one part of the network to another.
The leading metric in this category is the Betweenness Centrality measure, meaning the node is located between two parts of the network and constitute as a bridge between them. The mathematical definition of this metric is the number of shortest routes in the network that pass through the node.
If we'll take our star-shaped network and add to it two more star-shaped networks, with only one node connecting the three, then this special node has a Degree of only 3, but it is the sole connector to the 3 networks so its Betweenness score would be high.
The Betweenness score will usually be normalized between 0 and 1. For example, a node's score of 0.5 means that about half of the shortest routes in the network must pass through it. Such a score means that this node literally divides the network in two and any movement from one side of the network to the other must pass through it. In large networks this is a rather rare phenomenon and we will probably encounter much lower scores.
In organizational consulting, which uses ONA (Organization Network Analysis), these bridges (or bottlenecks) are marked as the filters through which new ideas flow to the organization or are sometimes blocked. This is the reason why in Organization Network Analysis, finding bottlenecks is important in order to find out where processes get stuck or may get stuck. Professor Rob Cross, who uses network analytics to understand organizations, used this metric to optimize creative thinking teams in the organization he was consulting to.
In order to create these teams, the organization he was working with chose key people in each department, i.e., the ones with the high Degree, and put them together to create synergy and produce new ideas. The disadvantage of this method was that these people were very busy with the affairs of their own department, hence their high Degree, and greatly defended or promoted the interests of their department. Members with high Betweenness , on the other hand, were exposed to more areas of the organization and were more open to promote interdisciplinary ideas.
So, we are left with the third category with its own definition of centrality. If the first deals with the number of connections and the second deals with mediation between parts of the network, the third deals with location: to be located at the heart of the network is to be central.
The best-known metric in this category is the Closeness Centrality measure. That is, the node is central if it is closer to the other nodes.
Let's demonstrate it with a real-life example and examine a pupil that sits in the center of the class. Even if she or he does not have many friends and does not mediate between groups, the position in the center of the class allows the pupil to hear all the noise and gossip during the lesson. The pupil's location in the center allows information to trickle in his or her direction, thus making the pupil central in the classroom network.
What does this mean to be the closest to other nodes in a network? It does not refer to physical distance but to the number of nodes or steps that we must go through in order to reach our node.
The mathematical definition for Closeness is the node that has the lowest average distance from the other nodes.
A possible application for using Closeness Centrality in the field of intelligence, for example, is in source recruitment. The advantage of recruiting a source with a high Degree is clear: it has access to many places and people. The downside is that this target is probably in high visibility and will be difficult to approach. Its position in the organization may also make it difficult for it to switch sides. On the other hand, someone who does not necessarily have a lot of connections but located in the core of the organization, may be able to get to places of interest with lower visibility. Instead of recruiting the CO or it's head of chamber, we will try to recruit the assistant who sits in the office next door. This potential source might have a low Degree but probably high Closeness.
Sure, there are naïve or less experienced listeners thinking right now to themselves that the office cleaner should have been the obvious choice for recruitment, but as everyone knows, a good cleaner is hard to find.
Another example to the importance of Closeness can be found in the field of epidemiology, the study of diseases. In these Corona days, epidemiology takes center stage with an emphasis on studying the spreading of the virus and mapping patients contacts. Covid-19 sure made it easy for SNA to gain popularity once again.
When we map the contact networks through which the virus has spread, then the node with the highest Closeness will mark for us patient zero.
A node with a high Closeness score can also be the target of cyber-attacks that seek to target a central node that is close to many parts of the networks, and so makes it easier to spread in it.
Ok, so it's time to summarize the three metrics, and we'll do so by applying them in a practical way, and in a cool way.
In the practical example, we will use Facebook's network and I'll just remind ourselves that it consists of approximately 2.5 billion active users:
Let’s say I have 500 friends on Facebook. This means that my Degree on Facebook is 500.
Suppose all my friends are in Israel. What does this say about my Betweenness score on the network?
Probably a low score, since I do not bridge between different areas in the network because all my links are concentrated in a specific place. But - if I had a friend in the US and a friend in Brazil and a friend in Japan and a friend in Africa, then my Betweenness score would probably jump higher, as I become a bridge between remote areas in the network.
So, what's my Closeness ? I have no idea. To find out, I will have to calculate all the distances between nodes in Facebook's network. This algorithm takes a long time to compute on large networks, but I believe sometimes it's worth it because it's one of the more interesting ones.
I guess it's time for full disclosure: I'm a sucker for Closeness centrality. I don't need a reason for it but if I have to say then it is both because it is the least intuitive metric and also because it captures centrality on a network's scale, unlike the Degree centrality, for example, that is by nature, reflecting a more localized aspect of centrality.
For the cool example we will use the "Game of Thrones" series dataset. The links or edges in the network were created based on which character appears with which character in the scene. To avoid spoilers from the five people who have not yet watched and plan to watch it sometime, we will settle on analyzing only the first season.
So, let's do a little quiz: Who in the first season of "Game of Thrones", leads the centralities' scoreboard?
The leading character in the Degree centrality in Game of Thrones' first season is...
This endearing character appears in many scenes, but it is not enough. In order to get a high Degree score, it has to appear with a lot of different characters. In the first season, Tyrion travels throughout the kingdom and therefore makes lot of connections.
And now, the leader in the Betweenness centrality is...
Varys the Counselor.
Recall that high Betweenness means that the character connects or bridges between different parts of the network and constitutes a bottleneck, in this case, of information.
Varys's nickname is "the spider" and the spy network he weaves extends to the furthest parts of the land, bridging kingdoms and continents.
And last, the leader in the Closeness centrality is... they... are......
Ned Stark. The ultimate Protagonist with a capital P.
And here lies the reason why I like Closeness centrality so much. If we were to ask the honest and naive Ned who is the most central figure in the kingdom, Honest Ned would probably answer "What the hell do you mean? The King, of course!". But what he does not understand is that he is at the heart of the plot, as Closeness will testify, at least until he encounters a serious Betweenness problem towards the end which we we'll not elaborate on.
Well, so how many did you get right?
As you can see, the centrality measures stimulate us to tell a story about the data by making us explain to ourselves why the central nodes are central. The different logic behind different centralities helps us to fit the story to the data.
There are many more metrics and new ones are added from time to time, so let's just address in short, two more common metrics:
As a use case, let’s just say that making our small website point to a major or central page won't make ours a central website.
PageRank is named after Larry Page (pun intended), one of the founders of Google, and it was used in the 1990's to rank pages.
In simple terms, the idea behind PageRank is that every page that points or directs to another page, lends it a score. At the beginning of the analysis, each page or node has the same score, 1 divided by N, N being the number of nodes in the network.
The algorithm performs several iterations and in each of them nodes give part of their score to the nodes they are pointing to.
This algorithm has undergone changes and updates over the years to make it especially suitable for finding central pages on the Internet.
One significant change was increasing the score of a page when pointed to by a Seed page. Seeds are pages that serve as clear indicators of quality, such as sites with a GOV extension, or university sites that can help rank other nodes.
The PageRank algorithm stemmed from the first category of centralities we talked about which defines centrality as the amount of links that the node has, or in the case of PageRank, the amount of links that the node's neighbors have.
The common feature that all the centrality measures share is that they are distributed into a Power Law. In each metric there will be a few who will get a high score, or rank, and the majority will get very low scores. How does this fact help us?
Usually when we think of a central figure in an organization, what pops into mind, obviously, is the head of the organization. The head is the one that makes the important decisions and leads the organization. But what about the deputy? Some might say that the deputy is the one who really does the heavy lifting, in touch with the employees, etc. All the above insights are a result of scenario-driven perspective, not data. these are insights that we derive from our past experiences, from what we have been taught and from our intuitions. We know that a manager is the one who makes decisions and the deputy manager is the one who is in charge on daily activities. Still, have we not encountered in the past a manager who can't manage? Is the deputy manager always so dominant? And when they are on sick leave or just on a holiday, what happens then? To see what's really going on, we'll need a data-driven approach, because I guess no one will say that the central figure in an organization is the safety officer. Right?
But this was exactly the case in a network analysis made by Barabashi, the famous network researcher we mentioned in the previous episode. Barabashi performed an analysis of a factory whose management wanted to understand why their messages did not trickle down to the workers on the assembly line. Instead of using the organigram of the organization, Barabashi analyzed the network of contacts of all the employees to understand where they get their information from. What he discovered was that the most central figure did not sit in management, rather it was the safety officer.
It turned out that this guy walked around a lot in the factory and was very sociable, so he made lots of connections and so became a useful tool in disseminating information.
Contrary to the initial intuition of organizations to fire someone who is more central than the manager, Barabashi actually offered to invite this guy to the head office for a cup of coffee, tell him what the management was planning, and let him convey the message.
Centrality measures also help us tell a story about data. Data storytelling is a major challenge for the data analyst. Anyone can look at the data. But to draw from it conclusions and insights is the real challenge.
As a former Intelligence officer, I was required to tell a story about the enemy. But what to do when no one tells me what to tell?
In that context, a very senior officer once gave me the following advice, roughly translated to: "Don't panic. The truth is only second to confidence." (Sounds better in Hebrew), which is the equivalent in English to "When in doubt – shout".
A more data-oriented tip I've got was to use centrality measures. Here's an example for such a story derived from centralities:
Suppose that in the network of an organization, which is made up of several divisions, the most central nodes are labeled as logistics. What story can we tell about the organization?
Will need to ask ourselves, what makes someone central?
Using Degree Centrality, we can say, for example, that's because many turns to it.
And What does this mean if a lot of people turn to logistics?
Here we can tell plausible stories that revolve around the idea that logistics is a center of gravity, as of the time of the analysis:
For example, that the organization is very dependent on logistics or that logistics is a bottleneck in the organization, or that they encountered a logistical problem or just maybe that there is a plan in the making for a surprise party for the VP of logistics.
All these scenarios are plausible, except one. No one ever organizes surprise parties for the VP of logistics.
But what is beautiful about the data-oriented method is that it is very easy to test our hypothesis. No need to go through all the data or test the whole network. It's enough to verify the hypothesis by a qualitative research of just the few that lead the Power Law. Because of the nature of this distribution, they are the few that tell the network's story and through them we can test to see if we got our story right.
To sum this up, Yuval Noah Harari, in his book "Sapiens: A Brief History of Humankind", explains that in order to sustain human society, human beings were organized on the basis of imaginary ideas. An organigram, for example, is an imaginary idea. It is used to decide who pays who and how much but it does not necessarily describe how the organization really works and there is nothing in it that will tell us what is happening in the organization at the present time.
To this end, we have the centrality measures. Contrary to organigrams, they are not fixed and can vary according to the occurrences. For example, given that the logistical bottleneck we described earlier has been resolved, the organization's center of gravity may shift to the organization's management, when planning a new strategy or to another part of the organization that's just turned out as a new bottleneck.
Pheewww… Hope I didn't panic, shout or lacked in confidence. Now let's dive deeper into the subject at hand because it cannot be that simple:
To tell the network's tale, it is not enough to check which nodes have the highest score, since many times our data is noisy. For example, in many networks there are nodes that are not "players" in the network but exist there for technical or other reasons which might not be relevant for our analysis. these nodes have tendency to create fictitious centers of gravity or relationships between actual players. In an email network, this could be, for example, a spam email or an error email sent from the server. Sometimes these phenomena might be of interest to our analysis but many times they aren't.
And sometimes, even if the major node in the network is a real player in the network, it will not necessarily be our focal point for the purpose of the study. It depends on the context, and as an example we'll use an American study of the "Arab Spring" revolution in 2011.
Quoting foreign sources, there is a widespread use of SNA by official American bodies, such as the military and the NSA, the National Security Agency.
During the campaigns in Iraq and Afghanistan, SNA even became part of the American military doctrine titled "Countering Threat Networks" and we will expand on this issue in the episode on intelligence and the network.
And so, in 2011, an American study was conducted on the Twitter network in Egypt to identify the leading factors in the "Arab Spring" revolution. Twitter and Facebook were the leading social networks in Egypt that allowed the masses to organize and coordinate demonstrations.
The Americans assumed that finding the centers of gravity on the network would make it possible to find out who was behind the events and leading them.
Much to their surprise, a close examination of the node with the highest Degree revealed that it was...
No disrespect, the guy's doing his thing and that's ok, but why's Justin and what's his connection to the revolution?
This is because celebrities can have tens of millions of followers that connect to them and so they will almost always overshadow the rest of the network. It is enough that a celebrity will tweet using a current hashtag and they will turn 1st place in Degree score easily.
However, further research of the Egyptian network has shown that despite Justin Bieber's network centrality, the "echoes" to the content he tweeted were weaker than the "echoes" to the actual revolutionary leaders' messages. How can one see it?
As we have mentioned earlier, a high score in the Degree measure alone is not enough (especially in large networks, due to the Degree's local nature) and it is necessary to create context for it as well.
If the no. 1 law of the network is that the network is distributed as a Power Law then in this case we will be required to use the no. 2 law of the network: Networks congregate to communities, that is, the network consists of clusters, each of which has its own reason to congregate and its own centers of gravity. Understanding which is the relevant community for our analysis will help us find the relevant center of gravity but this will be covered in the next episode dedicated to communities in the network where we will also crack Justin Bieber's mystery affair with Tahrir Square, where the masses in Egypt were gathered during the revolution.
So, in the meantime, a few might say: What's the problem? Let’s just ignore a node that has a high out-Degree , meaning that all its edges are outgoing links. Intuition has it that such a node should be considered a "network spammer". Right?
First all, it depends on the context of our study of the network. There will be use cases where such nodes will serve us well, for example when we want to spread in the network. Also, on a social level, maybe this node has an important role in disseminating information? For example, in the field of advertising.
In this field, celebs are widely used on social networks, trying to gain from their great popularity and exposure as a result of the multitude of edges connected to them (i.e. followers).
Large sums exchange hands for a network hub to post a product on social media.
To find such hubs, also known as influencers, companies can use centrality measures, and some do.
So, let's analyze such a case from 2019, in which an Israeli fashion company, Castro, has launched a major campaign for designer glasses in the United States using the mega-influencer Kim Kardashian.
Kim, as I like to call her, has about 145 million followers on Instagram, and Castro's assumption was that a campaign led by her would bring about 10% of her followers to enter Castro's website and then converge 10% of them to make a purchase. That is, the expected sales were of 1.4 million pairs of glasses.
The campaign was finally declared a failure after sales turned out to be two orders of magnitude smaller than estimated. Why's that?
In the summer of 2019, GQ magazine led an investigation that revealed that 44% of Kim's followers were actually fake users.
Sounds like a lot and it might be the first reason that pops to mind when trying to explain the failure of the campaign. But even after clearing the data it still leaves more than 80 million genuine followers, which is relatively not bad and still places her high in the Power Law distribution of the network. But if we follow the company's sale forecast calculations, at least 800,000 pairs of glasses should have been sold.
In practice, the number of people registered on the company's site was about 90,000 and sales were around the order of only ten thousand pairs.
So, what happened here?
There is a difference between advertising and influence. A high score in the Power Law does not necessarily guarantee an impact but it does guarantee noticeability and a Power Law result. For example, the number of visitors to the site aligns with the 1% of the (real) number of followers of Kim Kardashian, which is a classic Power Law (1 in a 100).
And we'll dedicate a future episode to deal with network influence and advertisement.
OK, so now for the big question: Which Centrality Measure should we use? What is the best metric?
Thought I would say Closeness , what?
Seemingly, it depends on what you are looking for. But why seemingly? Because the centrality measures have another common feature:
Their results will be very similar.
But how can that be? Each centrality measure has a different mathematical formula that stems from a different logic of what is takes to be central.
So, let's try to figure this out by using our star-shaped network. The center node will have the highest Degree score. Easy.
We will apply Betweenness centrality to see which node is the most traveled through in order to get from each node to each node and again we will get our ol' friend the center node.
When we use Closeness centrality to see which node is closest to the center of the network – well. You got the idea - we get the same node again.
It is clear, that in large networks the picture is a little more complex, but still, the correlation between the measures stays pretty strong.
There is no unequivocal answer, but it can be estimated that this correlation between centrality scores ranges from 70 to 90 percent or more.
But wait, on which kind of network are we talking about? A directed network, where we will also refer to the direction of the edges or links or to an undirected network, which assumes that all links are mutual?
A network is a network, and contrary to intuition, the differences between the results in a directed network and an undirected network are not so dramatic.
I will tread carefully here and say that we're probably going to see a very high correlation (over 90 percent) between Degree and Eigenvector Centrality for example, because they stem from the same logic or family (the quantitative family).
Even more cautiously I will say that as the network grows and expands, the correlation between Degree and Closeness may diminish. Why?
Let's try to picture the growth of a network as the Big Bang. This is not unreasonable, since network science and astronomy share similarities (and did we mention Power Law?).
Now imagine the galaxies of the network moving further apart from each other. Will the largest star with the most planets orbiting it, representing high Degree, necessarily be the closest to the center of the universe, representing high Closeness? The chances for a big star to be in the center becomes smaller because there are very few big stars but there are many small planets.
Okay, so we've realized that correlation between centralities is high, so why do we need so many of them?
There are 2 answers: the simple one and the good one.
Let's start with simple: No need. If the correlation is high, then let's use the DEGREE centrality. It is intuitive, fast to compute, everyone does it and it gives us an 80/20 or Pareto solution.
Now let's move on to the good answer: Correlation should not frustrate us. On the contrary. It is what helps us find what's interesting in our network's data. And what is interesting? The anomalies we find in the correlation.
Think about it. Let's analyze an imaginary network of a thousand nodes by using Degree and Closeness. Unsurprisingly, we'll find that in the TOP10, eight share high scores in both measures. Excellent - we found eight nodes that are very central. But what about the other two?
Let's say the first has a high Degree and a low Closeness score. Boring. Looks like a node that spams the furthest part of our network with useless edges.
The second one has high Closeness and low Degree . Voila! interesting. Why?
Because this node managed, although it has only a few edges, to locate itself in the heart of the network. That is, even though the node has invested less energy, it is in a central position. Apparently, it has an interesting trait that we would like to understand.
Thus, comparing the measures helps us tell a story about the data.
As you may have noticed, the Closeness measure can be used as a substitute for network visualization. For example: in a community with several high-Degree nodes, using the Closeness measure for comparison purposes, can indicate which of those nodes is at the "heart" of the network (and therefore more significant) compared to a node located at the "far ends" of the network and therefore less influential (even if ranked as hub using the Degree centrality).
I told you, nothing beats the coolness of Closeness .
And now, a word of caution:
A common mistake is to try and fuse centralities with each other (for example by multiplying them), or the so-called, blender-blunder method.
The idea behind the blender-blunder method, is a simplistic view that if we do not know how to decide what the best centrality is, we will just throw them together in the blender and see what comes out. Most often, the result will be a gooey green muck. Because there is a high correlation between centrality measures, such a multiplication will only strengthen the strong and weaken the weak and even worse, we will lose the insights that a comparison between the measures can give us.
So, a quick recap:
Although Degree, Betweenness and Closeness stem from different logics about what it means to be central in a network, their score is usually correlated. For a qualitive research purposes, it's advised to look for the anomalies, meaning the nodes that their centralities do not correlate, because they might tell us something interesting.
So far about the biases we may create in the data but what about the biases in the data itself?
Biases may be formed, for example, from an incomplete view of the network. In many cases, when there is limited access to the network data, we will sample it, and the partial results might tell us a biased story.
So first all, don't panic. In the absence of information, even a partial answer is better than nothing. The conclusions we'll gather from the partial study might be local in nature rather than global, but I prefer this to giving up on the analysis altogether. Of course, the more information we gather on the network, the better our answers will be.
So, let's discuss network sampling for a moment.
One way to collect network data is by using a snowball sampling. What does this mean?
We select a node or multiple nodes from which we access the network and do scraping or harvest the nodes that are connected to our seed nodes. We can further collect the nodes connected to the connected nodes and so on. This method is also known as "circle harvesting", meaning you can collect or harvest the first circle of the node, i.e. the friends connected to it, then the second circle, which is the friends-of-the-friends and so on.
So where do we stop? How can we know that we have collected enough? We will talk about this more in the episode on tips and best practice for network analysis. No spoilers.
Let's be content for now with what we have, and note that the partial collection of the snowball method may skew the centrality measures in favor of the seed nodes meaning the nodes from which we started to expand in the network.
The Closeness measure is particularly sensitive to this method, since the seed nodes are by definition in the core of the network because they are the "patient zero" of our research.
So, what can be done?
Beyond the required caution with respect to any centrality measure, in such a case, it is advisable to implement comparison between the measures as we mentioned earlier. For example, to find the nodes with high Closeness / Betweenness and low Degree (assuming these are not our seed nodes) and investigate them or expand the network through them. Why? Because as noted earlier, these nodes achieved a central place in the network with less “effort” (i.e., fewer connections) despite the gaps in the data.
So, let's conclude:
There are three main methods or metrics to find hubs or key players in the network:
How many links or edges does a node have, which is the Degree centrality.
To what Degree does the node constitute a bottleneck or bridge in the network, which is the Betweenness centrality.
How close is the node to the network's core, which is the Closeness centrality.
Comparing the different centrality measures gives us additional insights.
Many times, in small networks, the human eye will suffice to find the central nodes by intuition and allows meaningful insights just from looking at the network.
Larger networks will require the use of algorithms meaning Centrality measures, that are available in any network analysis application and in open source libraries. But first and most it's important to understand the logic behind them in order to use them wisely.
This episode was sponsored by Power Law without whom centrality measures just wouldn't be the same. Did you enjoy, and want to share? Have you suffered and you do not want to suffer alone?
Tell your friends or rate us here. Thank you! Much appreciated! The music is courtesy of Compile band. Check them out! See you in the next episode of NETfrix (: #Network_Science #SNA #Social_Network Analysis #Graph_Theory #Data_Science #Social_Physics #Computer_Science #Statistics #Mathematics #Social_Science #Physics #Facebook #Podcast