World’s Biggest Data Breaches

Rarely does a week go by without a government agency or large company announcing a data breach. For example, while writing this blog I learned a large-scale cyber attack hits nearly 100 countries and thousand of computers ransom throughout a day(Click here to know more). The risks, cost and threats are increasing. A data breach means that data was accessed by individuals who should not have been able to access it. It also means that account protection of the data failed. The data can represent personal information, such as health records, email conversations, online transactions and banking records, or corporate data, which is most often customer information or hosted applications.

While exploring different kinds of data breaches happened over the past years, I came across “World’s Biggest Data Breaches” created by Information Is Beautiful site. In this post, I discuss what I liked, what I didn’t like and based on the raw data, how I would create the graphs to get better insights.

Things I like

  • Aesthetics: Interactive and dynamic visualization creates a rich experience that makes it easier for the user to navigate and analyze based on their interest.
  • Validation: When clicked on a bubble, the visualization also provides additional details of each breach for context and sources from which the user can validate the numbers and related information.
  • Multiple filters: Users are able to see data breaches based on Method of Leak, Number of Records Stolen and Data Sensitivity and Organization.
  • Colors: Use of different colors to distinguish leaks based on year or method of leak

Things I didn’t like

  • Misleading bubbles on the graph: As shown in Image 1, tiny bubbles appear when you change the filter and it is hard to say whether those are bubbles for security breaches or not since it does not show a tooltip or a label. If those bubbles show security breaches, tooltip is required to identify the company name and related information

    Misleading small bubbles among bubbles with valid breach
    Image 1: Misleading small bubbles among bubbles with the valid breach and same size of bubbles for different size of the breach.
  • Inaccurate depiction of information: If you observe the bubbles in Image 1, Slack and Uber has same bubble size though they have a huge difference in the number of records stolen. 500,000 records got stolen from Slack while only 50,000 records were stolen from Uber (refer Image 4). This behavior in a graph is clearly wrong and conveys inaccurate information to the user.
  • Poor choice of information representation: The data is shown from 2004 to 2017 and there are multiple flaws in it. Bubble lies over multiple years which makes it confusing to decide when the breach happened. Because of the style of the graph where years are shown vertically, it is impossible to do year over year analysis.
  • Weak KPIs: This visualization can be used for exploratory analysis but does not contain a claim or insights. The data could be used to see stronger KPIs. For example, which type of attacks over years has occured more, year over year analysis of total attacks and which year had a maximum number of attacks.

How would I improve

Luckily I found dataset for this visualization and decided to create few graphs to show how we can improve.

  • I fixed misleading bubble size by using correct aggregation fact and adding size axis to the graph. Utilizing X axis for size and Y axis for year helped differentiate data by size and time. Below is a sample graph fixing what was wrong in Image 1.

    Image 2: Fixing the size of bubbles
  • To resolve the incomplete use of historical data, I used the same bubble format but added the size of data axis as well. The result was much clear and less confusing visualization with similar filters that serves the purpose of exploratory analysis.

    Image 3: Exploratory Analysis of World’s biggest data breach
  • I implemented one of the strong KPI that would show the number of data breach over the years by the method of leak. We can gain many insights from this graph – 1. If the number of data breach is increasing over year or decreasing, 2. What type of leak has happened most over the years, 3. What type of leak is increasing which means we need to create more safeguards for such breach 4. What type of leak is decreasing which means we created good enough safeguards for such breach. Using the “Method of Leak” filter, we can also see the trend of specific types of breach.

    Image 4: Deriving better insights by identifying strong KPIs

Conclusion

In data visualization, it is very important to choose the right graph and KPIs to gain useful insights. Especially for the activities like data breaches which are very crucial to any company in terms of user security and public relation. While making visualization we should make sure that information provided in the graph is correctly rendered and valid type of graph is chosen based on the available information in the dataset.

Reference:

  1. World’s biggest data breaches: http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
  2. Raw Dataset: https://docs.google.com/spreadsheets/d/1Je-YUdnhjQJO_13r8iTeRxpU2pBKuV6RVRHoYCgiMfg/edit#gid=3