Word Cloud as a tool for visualization

Source:  https://www.census.gov/dataviz/visualizations/007/

Description: This word cloud visualization presents the information about the cities that have ever been listed as top 20 most populous cities in the country, since 1790. The size of each city name reflects the number of times that city has been ranked in the top 20.

What are the pros of using word cloud as a tool for visualization:

1. The good thing about word cloud is that it reveals the essential, that is the key words pop, and a reader can easily visualize it.

2. Making word clouds are easy and fast to make as compared to the other form of visualizations.

3. Word clouds are very engaging because their visual representation of data tends to make an impact and generate interest among its audience.

What are cons of using word cloud as a tool for visualization:

1. One of the major drawback of using word cloud is that its display emphasizes on frequency of words, not necessarily their importance.

2. Word cloud generally categorizes the words by making difference in their size, or their frequency of occurrence, but the design of the words such as white space between the characters, or use of bold font can make it appear more or less important relative to others in the cloud. This can mislead the viewer’s perspective.

Let’s move on to the critical analysis of this visualization:

1. Does the visualization fulfilling its purpose? – Since the goal of the visualization is to show the number of times these cities ranked in top 20 most populous cities in US from 1790, the use of the primary feature of the visualization, i.e., the font size, to convey that information makes sense. But, there must be a way to convey another important information about the cities which is their current population, to remove the above confusion. At the end , I have describe some ways to address this.

2. Audience:  I guess it is meant for common public living in USA. If it is meant to serve US government or to help any survey like CPS, it is defeating in its purpose because of the weaknesses mentioned below.

3. Claim: This visualization claims to present the top 20 most populous cities in United States since 1790 to 2010.

4. Rebuttal includes the following points:

Misleading font sizes: Counter-intuitive relative font sizes of different cities. Explained by the following  examples:

1. City with smaller font size is more populated than cities with larger font size: Even though Los Angeles is the second most populous city in US, its font size is much smaller than that of the cities like Baltimore and Boston whose population is one-sixth of that of Los Angeles. The reason is LA’s population grew recently while the data presented is historic (1870-2010). Hence, the plot does not take into account the current (or recent) population numbers of these cities. This makes viewers think that Baltimore and Boston are much more populous than LA even though the case is exactly opposite. Hence, it cannot convince its viewers.

2. Similar font sized cities differ enormously in their population: Likewise, similar sizes of New York, Baltimore and Boston give an impression that the population of these cities are comparable. However, New York is approx. twelve times more populous than both the cities. Hence, the font sizes are completely uncoordinated with the current population numbers of these cities.

Hence, I would not categorize this visualization as truthful because it is deceptive, also this visualization cannot be counted as insightful or enlightening  because due to the ignorance of above mentioned details, neither it provides any new information to the audience nor it can initiate any change. 

What could be done better?

The visualization has two main features:

  1. Font size of the cities.
  2. Font color of the cities.

As can be seen from the visualization, both above features are used to convey the primary goal/statistic which is the number of times these cities ranked in top 20 most populous cities in US from 1870. Cities with higher value of the primary statistic have bigger font and darker shade of green and vice versa for the cities with lower value of the primary statistic. For example, Los Angeles is smaller and lighter compared to Baltimore.

My suggestion:

Instead of using both features (font size and font color) to convey the same information, i.e., the primary statistic, use one of them to convey the recent population numbers of the cities. I would like to use font color to convey recent population numbers using a visualization somewhat like Figure 1. In the figure, font colors are varying shades of red. Darker the font color, more populated a city is. Larger the font size, greater is the primary statistic for that city. Some quick observations from the figure:

1. Los Angeles has the lowest primary statistic; hence, it has the smallest font size.

2. Baltimore has the least population; hence, it has the lightest shade of red.

3. The population of New York and Los Angeles are closest to each other compared to any other city-pair; hence their font colors are very similar. But, since, the two cities have the highest difference in their primary statistics values, their font sizes are the most differing than any other city-pair.

The advantage of this visualization scheme is that it effectively ties the primary statistic of all the cities with their recent population numbers. This puts a check on the confusion arising out of the font sizes uncoordinated with the current/recent population numbers. So, on one hand, the user sees the contrasting difference in font sizes of New York and Los Angeles and infers about the high difference in the corresponding primary statistics. At the same time, the viewer notes the striking similarity between the font colors of the two cities which should prompt him to think for a while (and read the description) to infer the closeness in the population numbers of the two cities. All these changes if implemented, can make the visualization more convincing, truthful, enlightening, and insightful.

Conclusion:

Simple is not always better: When I noticed the weaknesses of this cloud map, I realized that simple is not always better. Though the word clouds are easy to make and can be easily interpreted, but mistakes such as using two features (size and color) to represent only one dimension, may mislead the viewer’s interpretation. 

 

 

 

 

Native vs foreign born Americans without a high school diploma.

Blog link: https://www.census.gov/dataviz/visualizations/035/

Introduction:

This visualization represents the comparative analysis of percentage of people of different race groups, without a high school education among native and foreign born Americans spread across various regions in USA.

Evaluating Bar-graphs as a tool for visualization:

  1. Bar-graphs is an excellent tool for showcasing the comparison among various segments. One thing I liked about this chart is its simplicity. It showcases the comparison across four segments that are nativity, gender, region and ethnicity without making the visualization cluttered and complex.
  2. The general quality of bar-graphs is that are accessible to a wide audience and they permit visual guidance on accuracy and reasonableness of calculations. However, in this visualization, there are certain weaknesses, which makes it difficult for its audience to interpret. There is no information provided about how the percentages of population are calculated.
  3. Use of different colors makes trends easier to highlight and interpret, however purple and blue both have lighter shades of their color and a person having a color blindness could easily misunderstood the categories. 
  4.  It is difficult to see the differences because there are too many graphs plotted. It is difficult to interpret the difference between the two far away groups.
  5. One of the most basic requirement missing in this visualization is the absence of timeframe for which the bar-graphs are plotted. I am unable to figure out whether the graph plot is a snapshot of these statistics on a particular time or aggregated/averaged over a period of time. The visualization completely misses the timeframe information.

Moving on to the critical analysis….

  1. Goal: This visualization clearly displays that foreign-born population irrespective of gender and ethnicity, lags behind native Americans by a substantial margin in attaining high school education.  However, various sources [1] suggest that the percentage of foreign population in attaining college level education is higher than the corresponding percentage in native Americans and that Immigrants in USA are considered to play a vital role in its economic success. It is very enlightening for the viewers to know that despite this, the foreign-born Americans lag behind in the high school education.
  1. Audience: I am unable to find the target audience for this visualization. This visualization may not be sufficient to help any category of audience (native or foreign born) because it fails to expose key assumptions, causes, and impact on the life of people who either have attended the high school or have not attended the high school. For example, the high percent of non-high school diploma may be because of poor living standard or other social conditions specific to the region/race. The visualization does not provide any insight about these details. Therefore, I would not categorize it as insightful because it lacks details.
  1. Claim: This visualization does not present any claim either about the Native Americans vs foreign born, about their work-life success or their earnings. As there is no explicit claim, there could not be any warrant backing the claim.
  1. Rebuttal: As there is no explicit claim or arguments, one cannot throw any counter arguments. The only evident information appearing from the graphs is that the overall high school attendees in native born America is much higher than foreign born. If the designer wants to conclude from the above information and present an argument that natives are more successful then I can come up with counter arguments against it. First, there is no information about the population in the data source for this visualization. We don’t know the proportion of population of native Americans and immigrants at the point of time when this graph was plotted. Secondly, the number of immigrants in USA is increasing at a very rapid rate as compared to the population of native born Americans. So, I cannot categorize this as either insightful or convincing.
  1. Key performance indicator: There are four segments taken into the consideration in this chart that are nativity, gender, region, and ethnicity. But this does not provide any information about the set of quantifiable measures that can be used to gauge of any indicator’s performance over time. For example, the performance of any ethnicity group such as Hispanics or Blacks over time. Is the percentage of population without any education is widespread among particular group of ethnicity or region.
  1. No information about any particular location in a region: The visualization provides statistics at the level of geographical regions (south, mid-west, northeast and west), which is at a very high level. To get a better insight, the numbers should have been provided for state/county levels. The reason is the population of certain races may be concentrated in few states/counties (of a region) and the high percent of non high school diploma people for those races may be because of less number of schools in those areas or some other factors. Having statistics at a more geographically granular level (i.e., states/counties) would un-earth such details and make statistics less prone to loosing such fine details due to adding/averaging the numbers across states/counties of a region.

What could have been done better:

  1. Firstly, using stacked bar graphs would have reduced the number of graphs. The most striking difference is seen between native and foreign-born Americans. So, the nativity dimension can be clubbed using stacked bar graphs to reduce the number of bar graphs from sixteen to eight as illustrated here. This would ease the comparison of percentages across native and foreign-born Americans as both statistics are on the same bar now. Reduction of graphs would help users to make sense of the information.
  2. Bubble chart could have been used to clearly showcase the %age of population in males and females in both the categories. This would help to identify if any particular region and ethnicity is most prone to less education.
  3. Use of multiple sources of data: To come upon certain critical comparisons, data should be captured from multiple sources, this increases the authenticity of the visualization designed.
  4. Aesthetics could have been made better by using very different colors rather than similar shades.

You can view my some more redesigns here:

https://us-east-1.online.tableau.com/#/site/magarwalscuedu/workbooks/57227/views

 

References:

  1. http://www.breitbart.com/education/2016/03/31/census-foreign-born-adults-less-likely-high-school-degree-native-born-likely-advanced-degree/

 

 

Visualization showcasing death rates from air pollution.

Dashboard – https://ourworldindata.org/

Description:

This area chart visualization presents the death rates across the world caused by air pollution from three sources namely indoor solid fuels, particulate matter, and ozone. The death rate numbers shown are per hundred thousand from 1990 to 2015 in steps of five years.

What I like about this dashboard:

  1. Area chart is effective for visualizing magnitudes of connected-series dataset as visible because of a filling between the line segments and the x-axis. So, a person can observe the change in growth effectively. This observation cannot be visualized so effectively in other visualization tools such as line graphs.
  2. This dashboard presents both the absolute and relative trend of death rate either in world or in any country.
  3. It is interactive in nature and includes a drop-down menu of several countries in the world. Either one can visualize the pattern of entire world or can also view the pattern in any single country by just select it from a drop-down menu.
  4. Another good feature it includes is that one can view the magnitude of death rate for any individual source of air pollution or in a combination of two or three. Such as visualizing death rate pattern only from ozone or in a combination of two sources such as particulate matter and solid fuels.
  5. It also gives the information of death rate in an absolute as well as in relative to the other regions also.

Cons of visualization tool used:

  1. Data in one segment is hidden behind the data in another segment. When I visualized the death rate pattern only from air pollution from ozone, the width of the green area was thicker as compared to its width when all the three areas are enabled. This is undesirable as it does not convey the actual trend.
  2. Generally, to get a value for a point on a curve, we look at its Y coordinate. However, in this case, to get a value, we need to subtract the upper and the lower Y coordinates of a point on an area. This makes it difficult to visualize the relative values of the three areas in first go.
  3. For the absolute trend, as one moves from left to right in the chart, the y coordinate of the green area (death from ozone) decreases which gives an impression that the actual value of the deaths from ozone is decreasing from 1990 to 2015. However, the deaths from ozone do not change over the years as reflected by the (almost) not-changing width of the green area. This is highly confusing. The confusion also arises from the fact that the green area is very thin making it appear somewhat similar to a line curve. Hence, change in y-coordinate gives an impression of change in magnitude.
  4. The shape of the entire chart (group of three areas) depends on the order in which the three areas (green, red, blue) are stacked vertically. Hence, a change in order will significantly alter the shape of the chart. For example, if the green area which is almost constant in width is kept at the bottom, the entire chart will look more stable. This is undesirable because visualizations for same data should look similar.

Let’s move on to the critical analysis:

  1. Does this visualization carry any goal, does it have any purpose?I believe that the visualization severely lacks a purpose and its goals are quite unclear. Hence, I would not categorize it as enlightening.
  2. Considering the domain, two things came in my mind, audiences and the needs. 

Audience – I am unable to identify the target audience for this visualization. If it is intended for the worldwide social or environmental agencies, I do not think that the information provided is sufficient to fulfill their needs or can help them to decrease the level of air pollution caused by any of the sources.

Let’s take an example:

Knowing the trend of deaths rate from particulate matter over the years does not solve any purpose unless it provides further information about the types/categories of the particulate matter causing deaths such as if they are man-made or natural or both, and their respective proportions in the deaths. There can be various types of particulates like the ones resulting from dust storms, volcanic eruptions, or chemicals such as oxides, nitric acids, etc. Similarly, nearly half of the world’s population still relies on burning solid fuels such as wood, animal dung, crop residue and coal for their day-to-day household needs. Therefore to get a better picture and to know the root cause of air pollution, further details providing death trends from these sub-types of particulate matter should have been included. Hence, I feel that the visualization is not insightful.

  1. Claim: This visualization does not showcase any claim, either about any particular air pollution source or any region most adversely affected by any kind of air pollution. As there is no claim, there is no warrant that provides any reasoning behind the arguments.
  2. Rebuttal: The viewers cannot throw any counter argument as there are no arguments presented in the visualization. If the designer of this visualization thinks that this dashboard is sufficient to work on the death rate numbers for any health or environmental organizations, my rebuttal would be that no this not sufficient as evident from the points listed in this blog post.
  3. The data is in the scale difference of five years, so one cannot get the actual information about the condition in intermediate years. Subjects like death rates require continuous data to analyze the situation across years.
  4. The authors have completely missed the connection between sources of air pollution and death. And, that connection is a “disease”, which is caused by air pollution. Air pollution leads to death of a person through a disease. People just do not die by inhaling harmful particles. Air pollution caused by any of the listed sources can result into a lung failure, heart problems, etc. Hence, I do not find the visualization numbers “convincing”. 

What could have done better –

  1. Use of multiple sources of data: Death rate is a very sensitive subject so designing a visualization from only one data source makes it less effective as compared to the visualization designed from using multiple data sources which includes root causes, effects and continuous information from 1990 to 2015. The continuous data would also help to make any prediction in coming years, which cannot be made currently.
  2. This visualization does not give any comparative analysis of the effect of air pollution in various regions. So, Bar graphs could have been used for this purpose.

Below are the links showcasing some similar visualizations in air pollution. I would not say that these visualizations are a perfect substitute or they address every weakness raised above, but it appears that they carry a purpose and can be helpful for fulfilling the goal of their audiences.

Redesign:

As this visualization does not carry any specific goal and any specific actions to meet that goal, it cannot be categorized in the category of visual confirmation. It can fit in a visual exploration quadrant though. I came upon some useful visualizations from the data provided in this chart.

Comparing similar visualizations:

  1. http://www.scoop.it/t/classroom-geography/p/4018472031/2014/03/27/infographic-deadly-air-pollution-where-and-how
  2. http://www.wri.org.cn/en/node/41165
  3. You can view my redesigned part here – https://docs.google.com/a/scu.edu/document/d/1X1XZyh1MgFW0B3VsvxKV1aehY2M391skNTudTDfAKg8/edit?usp=sharing
  4. Tableau public work: https://us-east-1.online.tableau.com/#/site/magarwalscuedu/workbooks/46729/views

 

 

Visualizing “Disasters” through the lens of interactive dashboards!

Mishita Agarwal

Dashboard: https://www.fema.gov/data-visualization-summary-disaster-declarations-and-grants

Introduction
This interactive dashboard presents visualizations of federally declared disasters in the United States since 1953. It also visualizes the disaster assistance and preparedness grants from Federal Emergency Management Agency (FEMA) released in these disasters since 2005. All the information is provided for national as well as state level.

What is most appealing….
Interactive dashboards are the best way to showcase visualizations when huge amount of data is to be shown across wide span of regions.
The most appealing feature of this interactive dashboard is its user-friendly interface. It provides the right amount of information at every step without overcrowding the page by letting viewer click for additional more granular information. It enables viewer to filter the information based on categories and sub-categories by just one click.

The highly interactive nature of this dashboard makes very easy to dig deep inside the data and to observe pattern of disasters in various states, which otherwise can become very complicated to observe in static dashboards. Overall, this feature can simplify search to a great extent and can make a user experience very pleasant.

Geographical map given on the top is a very good way of showcasing states. One could select a state and visualize the disaster patterns. It also makes the visualization of disaster easy across regions such as I could find the disaster patterns in the Eastern most state or Northern most state, which otherwise could become difficult if I were to select a state from a drop-down list.
Bar-graph is an effective tool to visualize the disaster type and the frequency of each disaster in a selected state. It makes visualization furthermore easier by providing information of declared disasters across counties in a state.

As we select any state, the line-graph chart of declared disasters across years gets updated automatically. Line-graph is a very effective medium to analyze the temporal pattern of disasters.
Finally, when the viewer scrolls down, the information of grants provided in the disaster categories is shown by the bar-graph through which one can easily analyze the amount spent in every category such as fire incidents, preparedness, etc.

But still there is a lot of space for improvement……
I think the data provided in the excel sheets does not match the visualization numbers. For example, as per the excel sheets 2820 Fire incidents were reported nationally, however, visualization shows only 989 Fire incidents. Similar differences are there for other categories also.

Next, as states acronyms are not provided on the map, a person who is not aware of USA map has to move the cursor around the map to find a desired region. Supplementing map with state acronyms would make the filtering process easier and quicker.

Also, I found that though each category of disaster was divided into subcategories, which helps us to know more in detail about the type of a disaster, the description given of that sub-category given is very complex, and it does not give us any information about the actual reason of that disaster. For example, the disaster “Fire” is further segmented into “Angel Fire”, “Eighty-two fire”, etc. It seems that all these terms are tied with some special characteristics like place of an incidence, type of resource etc. But the visualization does not give a clear picture of reasons behind a disaster; such as Fire incidents can be “household-fire”, “human-caused fire”, etc. Tying incidents with their cause can be very useful for preparing to work on the root cause of the incidents and prevent their occurrence in future.

The year wise map becomes very cluttered when we keep on clicking on “+” sign to go to a more granular level such as quarters and months. So, there should be a separate bar graph to show the month-wise pattern of disasters. The benefit of this would be that it would help FEMA to identify if there is any relation between any type of disaster and particular month. For example, the very high frequency of non-severe storms in any month would alert the agencies to take immediate measures to prevent any major disaster to take place.

I also feel that the visualization could have been better if colors were used to depict certain properties. In general colorful visualizations tend to be more attractive and effective than black-n-white ones. For example, the five types of grants could be shown with different colors on the bar graph that plots yearly total grants with each bar containing five color regions corresponding to the five grants and the length of each region proportional to the amount of the grant. That would help a viewer to compare all the five grants with each other for a particular year on a single plot.

Though the disasters are given from 1953, the grant data is available only since 2005. So, does it means that FEMA, which was established around 34 years ago, started providing grants only from last twelve years. This is important to know because there could have been several states which were most vulnerable to disasters but did not receive grant in any segment.

Conclusion:
Overall this interactive dashboard is very useful and provides a complete story of disasters and the grants. The icons used at the end are effectively conveying the message of supporting the community’s emergency management efforts. However, the weaknesses should be addressed to make it more effective as this dashboard is from a government website, and lot many people refers to visualizations posted on their websites.

“A disaster is a natural or man-made event that negatively affects life, property, livelihood or industry often resulting in permanent changes to human societies, ecosystems and environment.”

Screenshots to visualize some errors and plot of additional visualizations:

https://docs.google.com/a/scu.edu/document/d/1OoptzYIJp6VE1zDJtxiVI_TDjI6pBPl-j5Y3-UZKRNw/edit?usp=sharing