An Atlas of Suffering: Barriers to the American Dream

While looking for data for an individual project I came across this interesting project done by the students of UC Berkeley. I found this project quite interesting as we all know that certain group of American have different access to different set of resources but we don’t know how much is the inequality. The visualizations mentioned in the blog shows areas of unequal access and which gender and ethnicity are affected more. The data used in graphs is grouped by race, black/white and by sex, male/female.

Let me explain how the Radar/Spider chart works. Charts mentioned below demonstrate the degree to which one group is suffering in relation to the others. Larger the score in the circular chart, more is the suffering. Score 1 on the chart represents that the group has the largest barrier to success in this category, while a score of zero means that the group is least likely to suffer. For example, in the below graph, black men the as highest number of barriers to realize the American dream as the size of the polygon is large and “Bigger the polygon, more are the barriers”.  Life expectancy and College Degree values are 1 which means they have a very low life expectancy, least likelihood of earning a college degree while suicide is around 0.32 which means they are less likely to commit suicide.

Barriers for black men - Radar Chart
Barriers for black men – Radar Chart

Things I like about visualization:

  • Use of Radar Chart to show functionality: Radar Chart is a good way of comparing multiple quantitative variables, which makes it useful for seeing which variables have similar values or if there are any outliers amongst each variable. Radar chart provides functionality to the visualization and shows how different people suffering in different categories.
  • Use of multiple variables making the graphs truthful: In this visualization, the author had considered multiple variables to support the claim that what are the barriers to the different people and how they are suffering in all the given category.
  • Use of unique colors resulting in beautiful visualization: Use of unique colors which is not making them difficult to identify when we stacked them on top of each other in the combined graph and making visualization beautiful.
  • Insightful and Enlightening: Based on the information given in combined visualization which is the degree to which one group is suffering compare to other groups and how much is the suffering in each category, we can take effective steps to prevent the suffering and minimize the barrier to the American dream.
Combined Radar Chart
Combined Radar Chart

Things I didn’t like about the Visualization:

  • Comparison of positive and negative dimension in the same fashion/manner: In the radar map, many dimensions are getting displayed, for example, poverty and life expectancy. But poverty is a negative factor while life expectancy is a positive factor. If we compare both the factor on same radar and both has same value for e.g. 1, it interprets like it has very good life expectancy and high poverty but in facts, it shows very low life expectancy and high poverty.
  • Misleading information: The displayed score is on the scale of 0 to 1, 1 means that the group has the largest barrier to success in this category, for example for White man, based on the graph we can say overdose and suicide are equal barriers. But is it true?

How the visualization can be improved:

  • A broad range of values: Instead of showing 0 to 1 values in radar chart, we should show actual value or scaled value in range 1 to 100 to Cleary identify the difference in the value in each category.
  • Additional chart to support the claim: Radar chart is not so good for comparing values across each variable so along with radar chart we should show bar graph which provides the comparison among each category.

    Supporting chart for better insights
    Supporting chart for better insights
  • Renaming of the confusing categories: Renaming of some of the categories which are difficult to identify as the positive or negative effect on the barrier. For example, life expectancy can be rewritten as poor life expectancy.

Conclusion: Radar Charts are useful for seeing which variables are scoring high or low within a dataset, making them ideal for comparing performance across multiple dimensions. But we should limit the number of polygon and variables in radar chart as having multiple polygons in one Radar Chart makes it hard to read, confusing and too cluttered. Many variables create many axes and also make the chart hard to read and complicated. So, it’s good practice to keep Radar Charts simple and limit the number of variables used and provide additional graphs along with radar chart.

References:

  1. Blog Reference: https://ikesmith.github.io/Priv_Git_Smith/index.html
  2. Good example of radar chart: https://www.tableau.com/about/blog/2015/7/use-radar-charts-compare-dimensions-over-several-metrics-41592
  3. Effective use of radar chart: http://www.msktc.org/lib/docs/KT_Toolkit/Charts_and_Graphs/Charts_and_Graphics_Radar_508c.pdf

 

 

World’s Biggest Data Breaches

Rarely does a week go by without a government agency or large company announcing a data breach. For example, while writing this blog I learned a large-scale cyber attack hits nearly 100 countries and thousand of computers ransom throughout a day(Click here to know more). The risks, cost and threats are increasing. A data breach means that data was accessed by individuals who should not have been able to access it. It also means that account protection of the data failed. The data can represent personal information, such as health records, email conversations, online transactions and banking records, or corporate data, which is most often customer information or hosted applications.

While exploring different kinds of data breaches happened over the past years, I came across “World’s Biggest Data Breaches” created by Information Is Beautiful site. In this post, I discuss what I liked, what I didn’t like and based on the raw data, how I would create the graphs to get better insights.

Things I like

  • Aesthetics: Interactive and dynamic visualization creates a rich experience that makes it easier for the user to navigate and analyze based on their interest.
  • Validation: When clicked on a bubble, the visualization also provides additional details of each breach for context and sources from which the user can validate the numbers and related information.
  • Multiple filters: Users are able to see data breaches based on Method of Leak, Number of Records Stolen and Data Sensitivity and Organization.
  • Colors: Use of different colors to distinguish leaks based on year or method of leak

Things I didn’t like

  • Misleading bubbles on the graph: As shown in Image 1, tiny bubbles appear when you change the filter and it is hard to say whether those are bubbles for security breaches or not since it does not show a tooltip or a label. If those bubbles show security breaches, tooltip is required to identify the company name and related information

    Misleading small bubbles among bubbles with valid breach
    Image 1: Misleading small bubbles among bubbles with the valid breach and same size of bubbles for different size of the breach.
  • Inaccurate depiction of information: If you observe the bubbles in Image 1, Slack and Uber has same bubble size though they have a huge difference in the number of records stolen. 500,000 records got stolen from Slack while only 50,000 records were stolen from Uber (refer Image 4). This behavior in a graph is clearly wrong and conveys inaccurate information to the user.
  • Poor choice of information representation: The data is shown from 2004 to 2017 and there are multiple flaws in it. Bubble lies over multiple years which makes it confusing to decide when the breach happened. Because of the style of the graph where years are shown vertically, it is impossible to do year over year analysis.
  • Weak KPIs: This visualization can be used for exploratory analysis but does not contain a claim or insights. The data could be used to see stronger KPIs. For example, which type of attacks over years has occured more, year over year analysis of total attacks and which year had a maximum number of attacks.

How would I improve

Luckily I found dataset for this visualization and decided to create few graphs to show how we can improve.

  • I fixed misleading bubble size by using correct aggregation fact and adding size axis to the graph. Utilizing X axis for size and Y axis for year helped differentiate data by size and time. Below is a sample graph fixing what was wrong in Image 1.

    Image 2: Fixing the size of bubbles
  • To resolve the incomplete use of historical data, I used the same bubble format but added the size of data axis as well. The result was much clear and less confusing visualization with similar filters that serves the purpose of exploratory analysis.

    Image 3: Exploratory Analysis of World’s biggest data breach
  • I implemented one of the strong KPI that would show the number of data breach over the years by the method of leak. We can gain many insights from this graph – 1. If the number of data breach is increasing over year or decreasing, 2. What type of leak has happened most over the years, 3. What type of leak is increasing which means we need to create more safeguards for such breach 4. What type of leak is decreasing which means we created good enough safeguards for such breach. Using the “Method of Leak” filter, we can also see the trend of specific types of breach.

    Image 4: Deriving better insights by identifying strong KPIs

Conclusion

In data visualization, it is very important to choose the right graph and KPIs to gain useful insights. Especially for the activities like data breaches which are very crucial to any company in terms of user security and public relation. While making visualization we should make sure that information provided in the graph is correctly rendered and valid type of graph is chosen based on the available information in the dataset.

Reference:

  1. World’s biggest data breaches: http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
  2. Raw Dataset: https://docs.google.com/spreadsheets/d/1Je-YUdnhjQJO_13r8iTeRxpU2pBKuV6RVRHoYCgiMfg/edit#gid=3

Visualizing causes of death over age

According to a leading newspaper, out of the 56.4 million deaths worldwide in last decade, more than half (54%) were due to the health diseases. Heart disease and stroke are the world’s biggest killers, accounting for a combined 15 million deaths in 2015. These diseases have remained the leading causes of death globally in the last 15 years.

The visualization below shows statistics of people died between 2009 and 2014 with causes of death in terms of categories of diseases. The graph shows % of people died due to a cause at a particular age. They have also segregated the data based on gender and ethnicity. The Centers for Disease Control and Prevention classifies the different causes of death into 113 causes, which are grouped into 20 categories of disease and external causes for make it less complex.

Causes of death over age
Causes of death over age

What I liked:

The stacked area chart shows data based on gender and ethnicity like White, Asian Etc.  As shown in image below, when you click on the different color band/area on the graph, it displays the age and percentage of people affected by the disease.

Showing a specific disease by clicking on the area
Showing a specific disease by clicking on the area

This makes it easier to see the impact of a disease group as age progresses. The graph also has an option to move to next and previous section using buttons and back to normal graph using show all button. Filters of gender and ethnicity are well thought to provide insight for a targeted group.

What can be improved:

  1. As mentioned earlier, the chart show data from 2005 to 2014 but if we look at the chart and hover around, it is difficult to say what percentage of people died, at what age due to a specific disease group.
  2. The chart shows combined data between 2005 through 2014. An additional graph showing the trend over time period would be a good enhancement that can indicate which type of disease is causing more deaths over time.
  3. If we want to compare the cause of death between different ethnicity or gender is it not possible in this chart. To see the cause of death in men and women, a filter needs to be applied but if we want to compare the causes of death between men and women, it is not possible to do in given graph.

How to improve to derive better insights:

  1. Tooltip can be added to existing graph that shows exact percentage and age for a specific cause. This will help users who are not only looking at trends but seeking precise facts. This example of Texas Oil Rigs(Click Here)shows how tooltip can be used to extract precise information from a chart.
  2. Add functionality to compare trends between gender or ethnicity. This can be achieved either by adding a multi-select filter in the existing graph or creating additional graphs showing comparisons between male – female and between different ethnicities.
  3. As mentioned in the second point of above section, the trend over the period of time is not shown in the graph. It would be a good idea to add time series animation to see trends over a time which is inclusive of percentage, age and year. This example showing Wealth and Health of Nations (Click Here) shows how time series animation can supercharge analysis when two dimensions other than time are more important.

The stacked area chart being a good way to visualize given problem, looking from a different perspective, it can be improved in many ways as mentioned above to give better insights to a viewer. Sometimes, looking at the same data from different perspectives can expose hidden facts residing in data as we can improvise above visualization by adding trends over time.

References:

  1. https://flowingdata.com/2016/01/05/causes-of-death/
  2. Wealth and Health of Nations http://goo.gl/9nPEUC
  3. Using tool-tip https://public.tableau.com/en-us/s/gallery/texan-oil-rigs
  4. Using multi-select filters https://public.tableau.com/en-us/s/gallery/ice-melting (Years drop-down is multi-select)

Discover beer and say cheers!

– Ekta Ratanpara

I am very fond of beers and like to try out different kind of beers. While doing some research on which beer I should try next, I came across “Beerviz” site which is created by students of UC Berkley.

Chord Graph showing similarity relations between beers
Chord Graph showing similarity relations between beers

The site displays interesting Chord graph showing similarities between different brands and types available throughout the world. It also has some graphs displaying how the data is distributed and top five beers by type i.e. Dark, Medium and Light. While I loved a lot of features they incorporated in the visualization but few factors are misleading as well. Below image shows the high-level analysis shown on beer popularity.

High level analysis of beer popularity
High-level analysis of beer popularity

I will try to summarize what I believed works well and what could be improved.

What works well:

  1. Choice of the graph to display similarities between beers: Chord graph works pretty well when inter-relationships between values of multiple types of data points needs to be visualized. It makes easy for the viewer to see relationships between different types of beers and their popularity.
  2. Categorization and Filters: Two level categorization of beers is really helpful to narrow down the exact kind and type of beer you want to explore and its similarities. The website asks the user to select the malt of beer and type of beer is shown in the legend for a user to identify which color is related to which type. And to further narrow it down, they have given filters as attributes of beer like appearance, taste and aroma of the beer.
  3. Graphs showing high-level analysis of data: In addition to showing similar beers in chord graph, they also have few graphs showing ratings by attributes, popularities, and top beers which adds further value to the overall analysis by providing user instant choices and help explore the similarity graph.

What can be improved and how:

  1. Factors to decide popularity: Instead of only number of ratings, a combination of number of ratings and average rating should be used based on which user can make an informed decision. There are a couple of problems when showing popularity based on the number of ratings or average rating. This post on xkcd website sums it up best (click here). For example, beer A has 10,000 ratings but average rating is 1.5/5 and beer B has 5 ratings but average rating is 4.9/5. In both cases, using only number of ratings or average rating will lead to an incorrect conclusion. In this case, if number of ratings is used, beer A is better while if average rating is used, beer B is better. Instead of this approach, I would use a weighted average in addition to adding smoothing factor or a constraint of having a minimum number of ratings that can reduce the misleading factors of a ‘5-star’ rating system.
  2. Top 5 beers based on a combination of the number of ratings and average ratings. In the “About the Data” section, the top 5 beer graphs are based on a number of ratings. As I mentioned in point 1, number or ratings can not be a deciding factor to identify popularity and the same alternative can be applied here as well.
  3. The size of chord graph: Some of the names on the graph are not displayed in full and is cut in the UI which creates the negative user experience. When doing testing, this issue should have been resolved or a drill up – drill down approach should be taken where on selecting a beer, a new graph will show relations of only selected beer with other beers.

Overall, the visualization is quite attractive but if above-mentioned things are implemented, it can drastically increase the usability of the dataset and information provided through the graphs.

Reference:

  1. Beerviz | Discover Beer & Say Cheers!
  2. Beerviz – Work Report 
  3. XKCD Comic on Problems with averaging star ratings