Word Cloud as a tool for visualization

Source:  https://www.census.gov/dataviz/visualizations/007/

Description: This word cloud visualization presents the information about the cities that have ever been listed as top 20 most populous cities in the country, since 1790. The size of each city name reflects the number of times that city has been ranked in the top 20.

What are the pros of using word cloud as a tool for visualization:

1. The good thing about word cloud is that it reveals the essential, that is the key words pop, and a reader can easily visualize it.

2. Making word clouds are easy and fast to make as compared to the other form of visualizations.

3. Word clouds are very engaging because their visual representation of data tends to make an impact and generate interest among its audience.

What are cons of using word cloud as a tool for visualization:

1. One of the major drawback of using word cloud is that its display emphasizes on frequency of words, not necessarily their importance.

2. Word cloud generally categorizes the words by making difference in their size, or their frequency of occurrence, but the design of the words such as white space between the characters, or use of bold font can make it appear more or less important relative to others in the cloud. This can mislead the viewer’s perspective.

Let’s move on to the critical analysis of this visualization:

1. Does the visualization fulfilling its purpose? – Since the goal of the visualization is to show the number of times these cities ranked in top 20 most populous cities in US from 1790, the use of the primary feature of the visualization, i.e., the font size, to convey that information makes sense. But, there must be a way to convey another important information about the cities which is their current population, to remove the above confusion. At the end , I have describe some ways to address this.

2. Audience:  I guess it is meant for common public living in USA. If it is meant to serve US government or to help any survey like CPS, it is defeating in its purpose because of the weaknesses mentioned below.

3. Claim: This visualization claims to present the top 20 most populous cities in United States since 1790 to 2010.

4. Rebuttal includes the following points:

Misleading font sizes: Counter-intuitive relative font sizes of different cities. Explained by the following  examples:

1. City with smaller font size is more populated than cities with larger font size: Even though Los Angeles is the second most populous city in US, its font size is much smaller than that of the cities like Baltimore and Boston whose population is one-sixth of that of Los Angeles. The reason is LA’s population grew recently while the data presented is historic (1870-2010). Hence, the plot does not take into account the current (or recent) population numbers of these cities. This makes viewers think that Baltimore and Boston are much more populous than LA even though the case is exactly opposite. Hence, it cannot convince its viewers.

2. Similar font sized cities differ enormously in their population: Likewise, similar sizes of New York, Baltimore and Boston give an impression that the population of these cities are comparable. However, New York is approx. twelve times more populous than both the cities. Hence, the font sizes are completely uncoordinated with the current population numbers of these cities.

Hence, I would not categorize this visualization as truthful because it is deceptive, also this visualization cannot be counted as insightful or enlightening  because due to the ignorance of above mentioned details, neither it provides any new information to the audience nor it can initiate any change. 

What could be done better?

The visualization has two main features:

  1. Font size of the cities.
  2. Font color of the cities.

As can be seen from the visualization, both above features are used to convey the primary goal/statistic which is the number of times these cities ranked in top 20 most populous cities in US from 1870. Cities with higher value of the primary statistic have bigger font and darker shade of green and vice versa for the cities with lower value of the primary statistic. For example, Los Angeles is smaller and lighter compared to Baltimore.

My suggestion:

Instead of using both features (font size and font color) to convey the same information, i.e., the primary statistic, use one of them to convey the recent population numbers of the cities. I would like to use font color to convey recent population numbers using a visualization somewhat like Figure 1. In the figure, font colors are varying shades of red. Darker the font color, more populated a city is. Larger the font size, greater is the primary statistic for that city. Some quick observations from the figure:

1. Los Angeles has the lowest primary statistic; hence, it has the smallest font size.

2. Baltimore has the least population; hence, it has the lightest shade of red.

3. The population of New York and Los Angeles are closest to each other compared to any other city-pair; hence their font colors are very similar. But, since, the two cities have the highest difference in their primary statistics values, their font sizes are the most differing than any other city-pair.

The advantage of this visualization scheme is that it effectively ties the primary statistic of all the cities with their recent population numbers. This puts a check on the confusion arising out of the font sizes uncoordinated with the current/recent population numbers. So, on one hand, the user sees the contrasting difference in font sizes of New York and Los Angeles and infers about the high difference in the corresponding primary statistics. At the same time, the viewer notes the striking similarity between the font colors of the two cities which should prompt him to think for a while (and read the description) to infer the closeness in the population numbers of the two cities. All these changes if implemented, can make the visualization more convincing, truthful, enlightening, and insightful.

Conclusion:

Simple is not always better: When I noticed the weaknesses of this cloud map, I realized that simple is not always better. Though the word clouds are easy to make and can be easily interpreted, but mistakes such as using two features (size and color) to represent only one dimension, may mislead the viewer’s interpretation. 

 

 

 

 

Native vs foreign born Americans without a high school diploma.

Blog link: https://www.census.gov/dataviz/visualizations/035/

Introduction:

This visualization represents the comparative analysis of percentage of people of different race groups, without a high school education among native and foreign born Americans spread across various regions in USA.

Evaluating Bar-graphs as a tool for visualization:

  1. Bar-graphs is an excellent tool for showcasing the comparison among various segments. One thing I liked about this chart is its simplicity. It showcases the comparison across four segments that are nativity, gender, region and ethnicity without making the visualization cluttered and complex.
  2. The general quality of bar-graphs is that are accessible to a wide audience and they permit visual guidance on accuracy and reasonableness of calculations. However, in this visualization, there are certain weaknesses, which makes it difficult for its audience to interpret. There is no information provided about how the percentages of population are calculated.
  3. Use of different colors makes trends easier to highlight and interpret, however purple and blue both have lighter shades of their color and a person having a color blindness could easily misunderstood the categories. 
  4.  It is difficult to see the differences because there are too many graphs plotted. It is difficult to interpret the difference between the two far away groups.
  5. One of the most basic requirement missing in this visualization is the absence of timeframe for which the bar-graphs are plotted. I am unable to figure out whether the graph plot is a snapshot of these statistics on a particular time or aggregated/averaged over a period of time. The visualization completely misses the timeframe information.

Moving on to the critical analysis….

  1. Goal: This visualization clearly displays that foreign-born population irrespective of gender and ethnicity, lags behind native Americans by a substantial margin in attaining high school education.  However, various sources [1] suggest that the percentage of foreign population in attaining college level education is higher than the corresponding percentage in native Americans and that Immigrants in USA are considered to play a vital role in its economic success. It is very enlightening for the viewers to know that despite this, the foreign-born Americans lag behind in the high school education.
  1. Audience: I am unable to find the target audience for this visualization. This visualization may not be sufficient to help any category of audience (native or foreign born) because it fails to expose key assumptions, causes, and impact on the life of people who either have attended the high school or have not attended the high school. For example, the high percent of non-high school diploma may be because of poor living standard or other social conditions specific to the region/race. The visualization does not provide any insight about these details. Therefore, I would not categorize it as insightful because it lacks details.
  1. Claim: This visualization does not present any claim either about the Native Americans vs foreign born, about their work-life success or their earnings. As there is no explicit claim, there could not be any warrant backing the claim.
  1. Rebuttal: As there is no explicit claim or arguments, one cannot throw any counter arguments. The only evident information appearing from the graphs is that the overall high school attendees in native born America is much higher than foreign born. If the designer wants to conclude from the above information and present an argument that natives are more successful then I can come up with counter arguments against it. First, there is no information about the population in the data source for this visualization. We don’t know the proportion of population of native Americans and immigrants at the point of time when this graph was plotted. Secondly, the number of immigrants in USA is increasing at a very rapid rate as compared to the population of native born Americans. So, I cannot categorize this as either insightful or convincing.
  1. Key performance indicator: There are four segments taken into the consideration in this chart that are nativity, gender, region, and ethnicity. But this does not provide any information about the set of quantifiable measures that can be used to gauge of any indicator’s performance over time. For example, the performance of any ethnicity group such as Hispanics or Blacks over time. Is the percentage of population without any education is widespread among particular group of ethnicity or region.
  1. No information about any particular location in a region: The visualization provides statistics at the level of geographical regions (south, mid-west, northeast and west), which is at a very high level. To get a better insight, the numbers should have been provided for state/county levels. The reason is the population of certain races may be concentrated in few states/counties (of a region) and the high percent of non high school diploma people for those races may be because of less number of schools in those areas or some other factors. Having statistics at a more geographically granular level (i.e., states/counties) would un-earth such details and make statistics less prone to loosing such fine details due to adding/averaging the numbers across states/counties of a region.

What could have been done better:

  1. Firstly, using stacked bar graphs would have reduced the number of graphs. The most striking difference is seen between native and foreign-born Americans. So, the nativity dimension can be clubbed using stacked bar graphs to reduce the number of bar graphs from sixteen to eight as illustrated here. This would ease the comparison of percentages across native and foreign-born Americans as both statistics are on the same bar now. Reduction of graphs would help users to make sense of the information.
  2. Bubble chart could have been used to clearly showcase the %age of population in males and females in both the categories. This would help to identify if any particular region and ethnicity is most prone to less education.
  3. Use of multiple sources of data: To come upon certain critical comparisons, data should be captured from multiple sources, this increases the authenticity of the visualization designed.
  4. Aesthetics could have been made better by using very different colors rather than similar shades.

You can view my some more redesigns here:

https://us-east-1.online.tableau.com/#/site/magarwalscuedu/workbooks/57227/views

 

References:

  1. http://www.breitbart.com/education/2016/03/31/census-foreign-born-adults-less-likely-high-school-degree-native-born-likely-advanced-degree/

 

 

Visualizations that make you dumb!

Introduction:

This visualization- books that make you dumb  was featured on boston.com  in 2008- http://archive.boston.com/bostonglobe/ideas/brainiac/2008/01/books_that_make.html

The author obtains the average SAT scores from different universities and also pulls the top 10 books that the students at these universities recommend. For example, if your SAT scores are low, you are likely to get admitted to a mid-tier university where the fellow students around you are also following content that is not very intellectually compelling.

Using this, he tries to identify which books are read by students in the low SAT score bucket and otherwise. By doing this, the author takes an unconventional and interesting stab at tagging the books based on intellectual calibre rather than the converse approach where we tag intellectual calibre based on books(weird but interesting, yes!)

What is the authors claim?

To be able to understand the visualization better, it is imperative to understand the question the author is trying to answer.

So, I went on to define the objective dimension:

What does this visualization do ?

The visualization aims at using the average SAT score as a proxy measure to gauge the intellectual prowess and classify books based on how many intellectuals are reading it.

Who is it targeted at ?

The visualization was featured on boston.com and gawker and was possibly targeted at  the readers of these journals.

How does he do it?

He uses the average SAT scores from colleges and the top 10 books they recommend.

 

Analyzing the visualization from a subjective standpoint

So, for any visualization to be successful and serving well, we expect it to be –truthful, functional, beautiful, insightful & enlightening. 

Truthful-  So, there are a couple of things here –

                                Data + Assumptions–> Visualization 

Data – The visualizer pulls this data about average SAT scores and top 10 books recommended from all colleges on Facebook. So, he is typically looking at these books from an 17-18 year olds perspective.

The choice of books would have been very different if there were no age group restrictions. For example. Don Quixote is considered the greatest book of all time (based on – http://thegreatestbooks.org/) in the classic genre but, this book is practically not anywhere in the list. So, this list is heavily skewed in favor of the the preferences of 17- 18 year olds and is unlikely to convey any inputs to people from other age groups.

If it were to include to other genres, the distribution of genres would also be vey different with classics constituting only 13% of the total(Source: https://ebookfriendly.com/most-popular-book-genres-infographic/).

Also where did Shakespeare vanish ? He might be the most famous author of all time (Source: https://www.smashinglists.com/ten-most-famous-authors-of-all-time/2/). But, he definitely doesn’t seem to be on the list of many 17th year olds!

Another point of concern is that while SAT scores are descriptive of the whole population, the book recommendations are provided by a pool of ‘Active-On-Facebook’ students only.

Also, there seems to be a disconnect between the color coding on the graph and the genre in the underlying raw table. I wonder if some of the changes to the genre were made by the author. For eg. Lolita is classified as ‘Erotica’ in the above visualization while the underlying data classified it as a ‘Classic’.(Underlying data can be found here-

Assumptions– The author uses an assumption that the SAT score(not EQ  or IQ!) is a measure of intellectual capability.

Another assumption that he uses is that when people with high SAT scores(the smart & intellectual ones) read a book, it makes the book an intellectual one which I find quite questionable?!!

Functional-  I would expect a functional chart to convey something or answer a question.

So based on the authors analysis , if I were to understand which books are read by “intellectuals”, the top 2 that catch my eye are- hundred years of solitude and Lolita(really?!!)

Beautiful- The chart is very unwieldy and long with font sizes that do not appeal to my eyes.Also, the title- “books that make you dumb” is very misleading. It is just a catchy title and does not convey anything.

However, two commendable things are – the choice of colors(which is soothing) and the fact that the author has the books color coded by genre based on data from LibraryThing.com

Insightful- While the idea of relating books to intellectual ability is not new to the audience, how these play out with college freshers is! Their taste clearly is different from that of the broader group.

Enlightening- Calls for change? The above chart just describes the situation and does not include any call for action per se.

 

What would have made this visualization more rewarding ?

The analysis behind this visualization has a lot of depth and there is much that can be said. So, I decided to re-create this visualization using the same underlying data to specifically answer some questions that I had.

(I used Beautiful soup to fetch the data from the page and tableau for visualization)

What are the most common genres that students of this age group like and endorse?

https://drive.google.com/open?id=0B0buBv_pWnS4YUV2SG1SdWQyX2c

Which genres have the highest raw SAT  associated with them?

https://drive.google.com/open?id=0B0buBv_pWnS4WmpOZGgyYl95UTg

Which genre contributes the most to the top 100 books ranked by SAT score?

https://drive.google.com/open?id=0B0buBv_pWnS4bGFwR2hwN19hdTA

Last but not the least , which books are most endorsed by students ?

https://drive.google.com/open?id=0B0buBv_pWnS4TGl2ZmYwSEZNOUU

Looks like Harry Potter closely followed by The Bible make the top 2!

I strongly believe in the power of focussed dashboards and visualizations, aimed at answering questions than exploratory dashboards where the end-user is left to leverage his own imagination. After all, visualizations main goal is to help people understand what the data is telling them!

Last but not the least, I created a metric that is a mixture of the number of schools that endorse the book (popularity) and the SAT score( the proxy metric for intellectual ability) to recommend the top 10 books in the dashboard below with a call to action.

https://drive.google.com/open?id=0B0buBv_pWnS4d29vaWhXZHpyRkE

 

 

The cost of healthy eating – A comparison

This is a graph from the New York Times in May 2009 that was published to substantiate a claim that healthy food options were growing more expensive while junk food options were growing cheaper. It uses change in price of items relative to overall inflation  as the measure to substantiate this claim.

http://www.nytimes.com/imagepages/2009/05/20/business/20leonhardt.graf01.ready.html

As we all know, the visualizations beauty is in the eyes of the consumer!

The following are some keen observations from the end-users/consumers perspective.

Who is the end user of this visualization and what is the intent?

The end user of this visualization is the reader of the newspaper to whom the author is trying to convey a trend that the food industry is moving towards. The author uses the consumer price index as a proxy for measuring the cost and compares it relative to the overall inflation. The author does a good job at conveying that healthy foods are growing expensive more rapidly (higher slope) than the unhealthy options that are growing expensive at a “less-rapid” pace. (smaller slope).

However, please do not let the visualization fool you into believing that beer is growing cheaper 😛 Even if beer rises at 0.85 times the inflation, its price is still increasing, not falling!

The snippet in the top also states that the cost of unhealthy food has fallen in the last few years and can be misleading.

How does he do it?

The author uses a trend line to show the upward movement in the cost of healthy options in food choices and downward movement of the unhealthy options used.

The authors choice of graph to describe the year over year growth is good. However, his choice of point of comparison -overall inflation in goods is a little hard to perceive unless the person takes the time to understand the metric.

What makes a good metric ?

Since the consumer of this information is anybody who reads the news, the metric would be easier to assimilate if it was simple and stupid. So rather than contrast the value in comparison to overall inflation, the author could have used the absolute increase in prices as the metric.

It is because of the same reason that there might be a tendency for the consumer to perceive the downward slope as a drop in price rather than a less steeper increase in price.

What could have been done better in the visualization?

Trend rather than plot the absolute values: Rather than show the trend line, the author could have focused on the overall trend that shows whether the cost is moving up or down. This would have conveyed the same meaning and would have been simpler to understand.

Consistency in what you are representing : Comparing fresh fruits and vegetables to specific items in the “unhealthy” list provides a good comparison but, its clearly not an apples-to-apples comparison. Ideally if you are comparing two objects, we need to make sure we are comparing identical objects. In this case the comparison is between a basket of objects(fresh vegetables) and a single object like butter etc.

Also, the fresh fruits option had a percentage while the remaining items did not contain the percentage making it inconsistent.

Choice of colors: The choice of colors goes a long way in creating certain associations in a persons mind. Colors like green and shades of yellow are usually associated with positive things while colors like bright shades of red are associated with caution and danger. The choice of colors that have been used in the graph is consistent with what the author tries to prove using his argument.

Conclusion:

While the graph does  a good job of proving the point, when we look closer it is not conclusive to prove that healthy foods are getting more expensive and unhealthy foods are getting cheaper. The author could have done a better job simply by opting for a simpler metric to report and comparing similar objects.

 

 

 

 

 

 

 

 

 

Were you under wrong perception as a kid?

One of the few things I remember from my Geography class is my teacher showing me different countries on a world map. But as I look back now, I feel I had a very contorted image of the world as a kid and the reason is misrepresentation of world on flat map.
The Mercator Map projection which we all commonly use and are aware of converts circles of latitude and lines of longitudes into straight lines perpendicular to each other which completely distorts the shape and size of the countries especially when you move away from the equator and move towards both the poles.
Imagine a tube around the world:

pic1
For drawing a flat map, countries are projected on this tube. The poles which otherwise do not touch the tube are on purpose sketched on the tube.  Unrolling this tube results in projection of world on x-y plane which completely distorts the Y plane.
When a kid sees this map, he tends to imagine the world and size of each continent with respect to other as shown in the map and create a flawed mental picture.

pic3
For instance, Greenland (which lies on the North Pole) is interpreted approximately the same size as that of Africa but the fact is Greenland is just as big as Congo (which is just a small part of Africa). Moving Greenland to equator (as shown in below diagram) reveals that Africa is almost 14 times larger than Greenland.
image4It was only after research and travelling, I got to learn about real shapes and sizes of various continents but there might be many students who leave school with such wrong perception caused due to poor visualization.

References:
Google US. 2017. world map – Google Search. [ONLINE] Available at: www.google.com [Accessed 23 January 2017].
The Economist. 2017. Daily chart: Misleading maps and problematic projections | The Economist. [ONLINE] Available at: http://www.economist.com/blogs/graphicdetail/2016/12/daily-chart-1. [Accessed 23 January 2017].

 

Nokia – Microsoft acquisition. Worth it?

In early 2000, Nokia was at the most luminary position in terms of mobile phone market. However, within a decade, Nokia was uprooted from the mainstream market due to numerous reasons. In 2013, Microsoft acquired Nokia for $7.9 Billion, to provide Windows Operating System for the Nokia mobile hardware and it was hoped that Nokia shall revive its market position. Microsoft invested in Nokia with the aim of providing hardware for its Operating System. However, as rightly shown by the visualization shows below, Microsoft spent billions on a sinking ship.

Microsoft-Nokia

The figure depicts the Nokia mobile phone sales from 2010. When Microsoft announced the strategic partnership in 2011, the global sales of the Nokia mobile phones was 105M. When the partnership was materialized in 2013, it fell drastically to 61M. Further, towards the completion of the acquisition in 2014, Nokia’s ship had already sunk with the global sales to a meagre 40M only. This depicts a sales dip of 62% in just 3 years. Leading this, Microsoft announced acquisition impairment charges of $7.6B.

Microsoft is the leader the Desktop OS, but it completely failed to integrate its software to Nokia’s hardware. Hence, Microsoft could not revive Nokia’s sinking ship even after being the market leader in desktop OS.

Source:https://www.statista.com/chart/4848/nokia-and-microsoft-mobile-phone-sales/

 

Twitter Sentiments Analysis

Today, businesses want to know what buyers say about their brand and how they feel about their products and the best way to do so is social media where users can express their opinion freely. Whether it is a launch of a new product or feedback of an existing product, user opinions or tweets are used to gather critical feedback to provide Brand Management and Customer Satisfaction.

Blog1

This dashboard consists of US Airlines twitter data. First graph shows which US airlines received the best to worst sentiment as well as the most tweets. In this case, United Airlines have been most positively tweeted about. The second graph shows distribution of top reasons for negative sentiment across each airline. The third graph shows the most common reasons for negative sentiment which are Customer Service and Late Flight. The line graph shows sentiment over a period. The pie chart helps us to understand distribution of tweets by sentiments. The word cloud of Most Common Negative Words further supports the reasons for the high frequency of negative sentiment in the tweets in the data.

Hence, based on this dashboard, one can understand common reasons for customer dissatisfaction and take corrective measures.

Reference: http://www.heidislojewski.com/blog/