Google Scholar: Friend or Foe?

Homepage for Google Scholar

Homepage for Google Scholar

Scholars and users have a vested interest in understanding the relative authority of publications they have either written or wish to cite to form the basis of their research. Although the literature search, a common topic in library instruction and research seminars, can take place on a huge variety of discovery tools, researchers often rely on Google Scholar as a supporting or central platform.

The massive popularity of Google Scholar is likely due to its simple interface, which bears the longtime prestige of Google’s search engine; its enormous breadth, with a simple search yielding millions of results; its compatibility and parallels with other Googles Chrome and Books; and its citation metrics mechanism.

This last aspect of Google Scholar, which collects and reports data on the number of citations a given publication receives, represents the platform’s apparent ability to precisely calculate the research community’s interest in that publication. But, in the University Library’s work on the Illinois Experts (experts.illinois.edu) research and scholarship portal, we have encountered a number of circumstances in which Google Scholar has misrepresented U of I faculty members’ research.

Recent studies reveal that Google Scholar, despite its popularity and its massive reach, is not only often inaccurate in its reporting of citation metrics and title attribution, but also susceptible to deliberate manipulation. In 2010, Labbé discusses an experiment using Ike Antkare (AKA “I can’t care”), a fictitious researcher whose bibliography was manufactured with a mountain of self-referencing citations. After the purposely falsified publications went public, Google’s bots didn’t differentiate Antkare’s research from his real-life peers during their crawling of his 100 generated articles. As a result, Google Scholar reported Antkare as one of the most cited researchers in the world, with a higher H-index* than Einstein.

Ike Antkare “standing on the shoulders of giants” in Indiana University’s Scholarometer. Credit: Adapted from a screencap in Labbé (2010)

Ike Antkare “standing on the shoulders of giants” in Indiana University’s Scholarometer. Credit: Adapted from a screencap in Labbé (2010)

In 2014, Spanish researchers conducted an experiment in which they created a fake scholar with several papers making hundreds of references to works written by the experimenters. After the papers were made public on a personal site, Google Scholar scraped the data and the real-life researchers’ profiles increased by 774 citations in total. In the hands of more nefarious users seeking to aggrandize their own careers or alter scientific opinion, such practices could result in large-scale academic fraud.

For libraries, Google’s kitchen-sink-included data collection methods further result in confusing and inaccurate attributions. In our work to supplement the automated collection of publication data for faculty profiles on Illinois Experts using CVs, publishers’ sites, journal sites, databases, and Google Scholar, we frequently encounter researchers’ names and works mischaracterized by Google’s clumsy aggregation mechanisms. For example, Google Scholar’s bots often read a scholar’s name somewhere within a work that the scholar hasn’t written—perhaps they were mentioned in the acknowledgements or in a citation—and simply attribute the work to them as author.

When it comes to people’s careers and the sway of scientific opinion, such snowballing mistakes can be a recipe for large-scale misdirection. Though much research exists that shows that, in general, Google Scholar currently represents highly cited research well, weaknesses persist. Blind distrust of any dominant proprietary platform is unwise, and using Google Scholar requires particularly careful judgment.

Read more on Google Scholar’s quality and reliability:

Brown, Christopher C. 2017. “Google Scholar.” The Charleston Advisor 19 (2): 31–34. https://doi.org/10.5260/chara.19.2.31.

Halevi, Gali, Henk Moed, and Judit Bar-Ilan. 2017. “Suitability of Google Scholar as a Source of Scientific Information and as a Source of Data for Scientific Evaluation—Review of the Literature.” Journal of Informetrics 11 (3): 823–34. https://doi.org/10.1016/j.joi.2017.06.005.

Labbé, Cyril. 2016. “L’histoire d’Ike Antkare et de Ses Amis Fouille de Textes et Systèmes d’information Scientifique.” Document Numérique 19 (1): 9–37. https://doi.org/10.3166/dn.19.1.9-37.

Lopez-Cozar, Emilio Delgado, Nicolas Robinson-Garcia, and Daniel Torres-Salinas. 2012. “Manipulating Google Scholar Citations and Google Scholar Metrics: Simple, Easy and Tempting.” ArXiv:1212.0638 [Cs], December. http://arxiv.org/abs/1212.0638.

Walker, Lizzy A., and Michelle Armstrong. 2014. “‘I Cannot Tell What the Dickens His Name Is’: Name Disambiguation in Institutional Repositories.” Journal of Librarianship and Scholarly Communication 2 (2). https://doi.org/10.7710/2162-3309.1095.

*Read the library’s LibGuide on bibliometrics for an explanation of the h-index and other standard research metrics: https://guides.library.illinois.edu/c.php?g=621441&p=4328607

Wikidata and Wikidata Human Gender Indicators (WHGI)

Wikipedia is a central player in online knowledge production and sharing. Since its founding in 2001, Wikipedia has been committed to open access and open editing, which has made it the most popular reference work on the web. Though students are still warned away from using Wikipedia as a source in their scholarship, it presents well-researched information in an accessible and ostensibly democratic way.

Most people know Wikipedia from its high ranking in most internet searches and tend to use it for its encyclopedic value. The Wikimedia Foundation—which runs Wikipedia—has several other projects which seek to provide free access to knowledge. Among those are Wikimedia Commons, which offers free photos; Wikiversity, which offers free educational materials; and Wikidata, which provides structured data to support the other wikis.

The Wikidata logo

Wikidata provides structured data to support Wikimedia and other Wikimedia Foundation projects

Wikidata is a great tool to study how Wikipedia is structured and what information is available through the online encyclopedia. Since it is presented as structured data, it can be analyze quantitatively more easily than Wikipedia articles. This has led to many projects that allow users to explore data through visualizations, queries, and other means. Wikidata offers a page of Tools that can be used to analyze Wikidata more quickly and efficiently, as well as Data Access instructions for how to use data from the site.

The webpage for the Wikidata Human Gender Indicators project

The home page for the Wikidata Human Gender Indicators project

An example of a project born out of Wikidata is the Wikidata Human Gender Indicators (WHGI) project. The project uses metadata from Wikidata entries about people to analyze trends in gender disparity over time and across cultures. The project presents the raw data for download, as well as charts and an article written about the discoveries the researchers made while compiling the data. Some of the visualizations they present are confusing (perhaps they could benefit from reading our Lightning Review of Data Visualization for Success), but they succeed in conveying important trends that reveal a bias toward articles about men, as well as an interesting phenomenon surrounding celebrities. Some regions will have a better ratio of women to men biographies due to many articles being written about actresses and female musicians, which reflects cultural differences surrounding fame and gender.

Of course, like many data sources, Wikidata is not perfect. The creators of the WHGI project frequently discovered that articles did not have complete metadata related to gender or nationality, which greatly influenced their ability to analyze the trends present on Wikipedia related to those areas. Since Wikipedia and Wikidata are open to editing by anyone and are governed by practices that the community has agreed upon, it is important for Wikipedians to consider including more metadata in their articles so that researchers can use that data in new and exciting ways.

An animated gif of the Wikipedia logo bouncing like a ball

New Uses for Old Technology at the Arctic World Archive

In this era of rapid technological change, it is easy to fall into the mindset that the “big new thing” is always an improvement on the technology that came before it. Certainly this is often true, and here in the Scholarly Commons we are always seeking innovative new tools to help you out with your research. However, every now and then it’s nice to just slow down and take the time to appreciate the strengths and benefits of older technology that has largely fallen out of use.

A photo of the arctic

There is perhaps no better example of this than the Arctic World Archive, a facility on the Norwegian archipelago of Svalbard. Opened in 2017, the Arctic World Archive seeks to preserve the world’s most important cultural, political, and literary works in a way that will ensure that no manner of catastrophe, man-made or otherwise, could destroy them.

If this is all sounding familiar to you, that’s because you’ve probably heard of the Arctic World Archive’s older sibling, the Svalbard Global Seed Vault. The Global Seed Vault, which is much better known and older than the Arctic World Archive, is an archive seeds from around the world, meant to ensure that humanity would be able to continue growing crops and making food in the event of a catastrophe that wipes out plant life.

Indeed, the two archives have a lot in common. The World Archive is housed deep within a mountain in an abandoned coal mine that once served as the location of the seed vault, and was founded to be for cultural heritage what the seed vault is for crops. But the Arctic World Archive has made truly innovative use of old technology that makes it a truly impressive site in its own right.

A photo of the arctic

Perhaps the coolest (pun intended) aspect of the Arctic World Archive is the fact that it does not require electricity to operate. It’s extreme northern location (it is near the northernmost town of at least 1,000 people in the world) means that the temperature inside the facility is naturally very cold year-round. As any archivist or rare book librarian who brings a fleece jacket to work in the summer will happily tell you, colder temperatures are ideal for preserving documents, and the ability to store items in a very cold climate without the use of electricity makes the World Archive perfect for sustainable, long-term storage.

But that’s not all: in a real blast from the past, all information stored in this facility is kept on microfilm. Now, I know what you’re thinking: “it’s the 21st century, grandpa! No one uses microfilm anymore!”

It’s true that microfilm is used by a very small minority of people nowadays, but nevertheless it offers distinct advantages that newer digital media just can’t compete with. For example, microfilm is rated to last for at least 500 years without corruption, whereas digital files may not last anywhere near that long. Beyond that, the film format means that the archive is totally independent from the internet, and will outlast any major catastrophe that disrupts part or all of our society’s web capabilities.

A photo of a seal

The Archive is still growing, but it is already home to film versions of Edvard Munch’s The Scream, Dante’s The Divine Comedy, and an assortment of government documents from many countries including Norway, Brazil, and the United States.

As it continues to grow, its importance as a place of safekeeping for the world’s cultural heritage will hopefully serve as a reminder that sometimes, older technology has upsides that new tech just can’t compete with.

Exploring Data Visualization #9

In this monthly series, I share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them. You can find previous posts by looking at the Exploring Data Visualization tag.

Map of election districts colored red or blue based on predicted 2018 midterm election outcome

This map breaks down likely outcomes of the 2018 Midterm elections by district.

 

Seniors at Montgomery Blair High School in Silver Spring, Maryland created the ORACLE of Blair 2018 House Election Forecast, a website that hosts visualizations that predict outcomes for the 2018 Midterm Elections. In addition to breakdowns of voting outcome by state and district, the students compiled descriptions of how the district has voted historically and what are important stances for current candidates. How well do these predictions match up with the results from Tuesday?

A chart showing price changes for 15 items from 1998 to 2018

This chart shows price changes over the last 20 years. It gives the impression that these price changes are always steady, but that isn’t the case for all products.

Lisa Rost at Datawrapper created a chart—building on the work of Olivier Ballou—that shows the change in the price of goods using the Consumer Price Index. She provides detailed coverage of how her chart is put together, as well as making clear what is missing from both hers and Ballou’s chart based on what products are chosen to show on the graph. This behind-the-scenes information provides useful advise for how to read and design charts that are clear and informative.

An image showing a scale of scientific visualizations from figurative on the left to abstract on the right.

There are a lot of ways to make scientific research accessible through data visualization.

Visualization isn’t just charts and graphs—it’s all manner of visual objects that contribute information to a piece. Jen Christiansen, the Senior Graphics Editor at Scientific American, knows this well, and her blog post “Visualizing Science: Illustration and Beyond” on Scientific American covers some key elements of what it takes to make engaging and clear scientific graphics and visualizations. She shares lessons learned at all levels of creating visualizations, as well as covering a few ways to visualize uncertainty and the unknown.

I hope you enjoyed this data visualization news! If you have any data visualization questions, please feel free to email the Scholarly Commons.

An Introduction to Google MyMaps

Geographic information systems (GIS) are a fantastic way to visualize spatial data. As any student of geography will happily explain, a well-designed map can tell compelling stories with data which could not be expressed through any other format. Unfortunately, traditional GIS programs such as ArcGIS and QGIS are incredibly inaccessible to people who aren’t willing or able to take a class on the software or at least dedicate significant time to self-guided learning.

Luckily, there’s a lower-key option for some simple geospatial visualizations that’s free to use for anybody with a Google account. Google MyMaps cannot do most of the things that ArcMap can, but it’s really good at the small number of things it does set out to do. Best of all, it’s easy!

How easy, you ask? Well, just about as easy as filling out a spreadsheet! In fact, that’s exactly where you should start. After logging into your Google Drive account, open a new spreadsheet in Sheets. In order to have a functioning end product you’ll want at least two columns. One of these columns will be the name of the place you are identifying on the map, and the other will be its location. Column order doesn’t matter here- you’ll get the chance later to tell MyMaps which column is supposed to do what. Locations can be as specific or as broad as you’d like. For example, you could input a location like “Canada” or “India,” or you could choose to input “1408 W. Gregory Drive, Urbana, IL 61801.” The catch is that each location is only represented by a marker indicating a single point. So if you choose a specific address, like the one above, the marker will indicate the location of that address. But if you choose a country or a state, you will end up with a marker located somewhere over the center of that area.

So, let’s say you want to make a map showing the locations of all of the libraries on the University of Illinois’ campus. Your spreadsheet would look something like this:

Sample spreadsheet

Once you’ve finished compiling your spreadsheet, it’s time to actually make your map. You can access the Google MyMaps page by going to www.google.com/mymaps. From here, simply select “Create a New Map” and you’ll be taken to a page that looks suspiciously similar to Google Maps. In the top left corner, where you might be used to typing in directions to the nearest Starbucks, there’s a window that allows you to name your map and import a spreadsheet. Click on “Import,”  and navigate through Google Drive to wherever you saved your spreadsheet.

When you are asked to “Choose columns to position your placemarks,” select whatever column you used for your locations. Then select the other column when you’re prompted to “Choose a column to title your markers.” Voila! You have a map. Mine looks like this:  

Michael's GoogleMyMap

At this point you may be thinking to yourself, “that’s great, but how useful can a bunch of points on a map really be?” That’s a great question! This ultra-simple geospatial visualization may not seem like much. But it actually has a range of uses. For one, this type of visualization is excellent at giving viewers a sense of how geographically concentrated a certain type of place is. As an example, say you were wondering whether it’s true that most of the best universities in the U.S. are located in the Northeast. Google MyMaps can help with that!

Map of best universities in the United States

This map, made using the same instructions detailed above, is based off of the U.S. News and World Report’s 2019 Best Universities Ranking. Based on the map, it does in fact appear that more of the nation’s top 25 universities are located in the northeastern part of the country than anywhere else, while the West (with the notable exception of California) is wholly underrepresented.

This is only the beginning of what Google MyMaps can do: play around with the options and you’ll soon learn how to color-code the points on your map, add labels, and even totally change the appearance of the underlying base map. Check back in a few weeks for another tutorial on some more advanced things you can do with Google MyMaps!

Try it yourself!

Analyze and Visualize Your Humanities Data with Palladio

How do you make sense of hundreds of years of handwritten scholarly correspondence? Humanists at Stanford University had the same question, and developed the project Mapping the Republic of Letters to answer it. The project maps scholarly social networks in a time when exchanging ideas meant waiting months for a letter to arrive from across the Atlantic, not mere seconds for a tweet to show up in your feed. The tools used in this project inspired the Humanities + Design lab at Stanford University to create a set of free tools specifically designed for historical data, which can be multi-dimensional and not suitable for analysis with statistical software. Enter Palladio!

To start mapping connections in Palladio, you first need some structured, tabular data. An Excel spreadsheet in CSV format with data that is categorized and sorted is sufficient. Once you have your data, just upload it and get analyzing. Palladio likes data about two types of things: people and places. The sample data Palladio provides is information about influential people who visited or were otherwise connected with the itty bitty country of Monaco. Read on for some cool things you can do with historical data.

Mapping

Use the Map feature to mark coordinates and connections between them. Using the sample data that HD Lab provided, I created the map below, which shows birthplaces and arrival points. Hovering over the connection shows you the direction of the move. By default, you can change the map itself to be standard maps like satellite or terrain, or even just land masses with no human-created geography, like roads or place names.

Map of Mediterranean sea and surrounding lands of Europe, red lines across map show movement, all end in Monaco

One person in our dataset was born in Galicia, and later arrived in Monaco.

But, what if you want to combine this new-fangled spatial analysis with something actually historic? You’re in luck! Palladio allows you to use other maps as bases, provided that the map has been georeferenced (assigned coordinates based on locations represented on the image). The New York Public Library’s Map Warper is a collection of some georeferenced maps. Now you can show movement on a map that’s actually from the time period you’re studying!

Same red lines across map as above, but image of map itself is a historical map

The same birthplace to arrival point data, but now with an older map!

Network Graphs

Perhaps the connections you want to see don’t make sense to be on a map, like those between people. This is where the Graph feature comes in. Graph allows you to create network visualizations based on different facets of your data. In general, network graphs display relationships between entities, and work best if all your nodes (dots) are the same type of information. They are especially useful to show connections between people, but our sample data doesn’t have that information. Instead, we can visualize our peoples’ occupation by gender.

network graph shows connections between peoples' occupations and their gender

Most occupations have both males and females, but only males are Monegasque, Author, Gambler, or Journalist, and only females are Aristocracy or Spouse.

The network graph makes it especially visible that there are some slight inconsistencies in the data; at least one person has “Aristocracy” as an occupation, while others have “Aristocrat.” Cleaning and standardizing your data is key! That sounds like a job for…OpenRefine!

Timelines

All of the tools in Palladio have the same Timeline functionality. This basically allows you to filter the data used in your visualization by a date, whether that’s birthdate, date of death, publication date, or whatever timey wimey stuff you have in your dataset. Other types of data can be filtered using the Facet function, right next to the Timeline. Play around with filtering, and watch your visualization change.

Try Palladio today! If you need more direction, check out this step-by-step tutorial by Miriam Posner. The tutorial is a few years old so the interface has changed slightly, so don’t panic if the buttons look different!

Did you create something cool in Palladio? Post a comment below, or tell us about it on Twitter!

 

OASIS: The Search Tool for the Open Educational Resource Desert

Guest Post by Kaylen Dwyer

Open Educational Resources (OER) are teaching, learning, and research resources that reside in the public domain or have been released under an open license so they are free to access, use, remix, and share again.

Source: The Review Project. For more information about OER, the University of Illinois’ guide is available online.

Last year, the Common Knowledge blog discussed the cost of OER to professors and institutions in grants, time, sabbatical funding, and more. Yet professors felt that the main barrier between OER and the classroom were not these hidden costs, but rather lack of awareness, the difficulties of finding texts to use, and the monumental task of evaluating the texts and tools they did find.

The U.S. Public Interest Research Group’s study, “Fixing the Broken Textbook Market,” determined that many students chose not to buy their textbooks due to the costs despite concern for their grade, and felt that they would benefit from open resources. Even as textbook costs have skyrocketed and faculty awareness of OER continues to increase, only 5.3% of classrooms are using open textbooks.

Enter OASIS (Openly Available Sources Integrated Search), a search tool recently developed and launched by SUNY Geneseo’s Milne Library. OASIS addresses the main frustration expressed by faculty—how do I know what I’m looking for? Or even what open sources are out there?

Oasis Logo Image

The easy-to-use interface and highly selective nature of OASIS are both evident from the front page. At the outset, users can start a search if they know what they’re looking for, or they can view the variety of OER source types available to them—textbooks, courses, interactive simulations, audiobooks, and learning objects are just a few of the tools one can look for.

Image of the options within Oasis for OER materials

Users can also refine their search by the source, license, and whether or not the resource has been reviewed. For those who need a text which has already been evaluated, this certainly helps. At launch, there are over 150,000 items available coming from 52 different sources like Open NYS, CUNY, Open Textbooks, OER Services, and SUNY. And, as a way to increase awareness of the tool and open resources, OASIS also created a search widget that libraries and other institutions can embed on their webpages.

OASIS is one step closer to getting OER into the classroom, providing equal access and increasing the discoverability of texts.

Check it out here!

Lightning Review: Optical Character Recognition: An Illustrated Guide to the Frontier

Lightning Review: Optical Character Recognition: An Illustrated Guide to the Frontier

Picture of OCR Book

Stephen V. Rice, George Nagy, and Thomas A. Nartaker’s work on OCR, though written in 1999, is still a remarkably valuable bedrock text for diving into the technology. Though OCR systems have, and continue to, evolve with each passing day, the study presented within their book still highlights some of the major issues one faces when performing optical character recognition. Text is in an unusual typeface or contains stray marks, print is too heavy or too light. This text gives those interested in learning the general problems that arise in OCR a great guide to what they and their patrons might encounter.

The book opens with a quote from C-3PO, and a discussion of how our collective sci-fi imagination believe technology will have “cognitive and linguistic abilities” that match and perhaps even exceed our own (Rice et al., 1999, p. 1).

C3PO Gif

 

The human eye is the most powerful character identifier to exist. As the authors note “A seven year old child can identify characters with far greater accuracy than the leading OCR systems” (Rice et al., 1999, 165). I found this simple explanation so helpful for when I get questions here in the Scholarly Commons from patron who are confused as to why their document, even after been run through and  OCR software, is not perfectly recognized. It is very easy, with our human eyes, to discern when a mark on a page is nothing of importance, and when it is a letter. Ninety-nine percent character accuracy doesn’t mean ninety-nine percent page accuracy.

Look with your special eyes Gif

In summary, this work presents a great starting point for those with an interest in understanding OCR technology, even at almost two decades old.

Give it, and the many other fabulous books in our reference collection, a read!

Lightning Review: How to Use SPSS

“A nice step-by-step explanation!”

“Easy, not too advanced!”

“A great start!”

           Real, live reviews of Brian C. Cronk’s How to Use SPSS: A Step-By-Step Guide to Analysis and Interpretation by some of our patrons! This book, the Tenth Edition of this nine-chapter text published by Taylor and Francis, is ripe with walkthroughs, images, and simple explanations that demystifies the process of learning this statistical software. Also containing six appendixes, our patrons sang its praises after a two-hour research session here in the Scholarly Commons!

           SPSS, described on IBM’s webpage as “the world’s leading statistical software used to solve business and research problems by means of ad-hoc analysis, hypothesis testing, geospatial analysis and predictive analytics. Organizations use IBM SPSS Statistics to understand data, analyze trends, forecast and plan to validate assumptions and drive accurate conclusions’ is one of many tools CITL Statistical Consulting uses on a day-to-day basis in assisting Scholarly Commons patrons. Schedule a consultation with them from 10 am to 2 pm, Monday through Thursday, for the rest of the summer!

           We’re thrilled to hear this 2018 title is a hit with the researcher’s we serve! Cronk’s book, and so many more works on software, digital publishing, data analysis, and so much more make up our reference collection – free to use by anyone and everyone in the Scholarly Commons!

Using an Art Museum’s Open Data

*Edits on original idea and original piece by C. Berman by Billy Tringali

As a former art history student, I’m incredibly interested in the how the study of art history can be aided by the digital humanities. More and more museums have started allowing the public to access a portion of their data. When it comes to open data, museums seem to be lagging a bit behind other cultural heritage institutions, but many are providing great open data for humanists.

For art museums, the range of data provided ranges. Some museums are going the extra mile to give a lot of their metadata to the public. Others are picking and choosing aspects of their collection, such as the Museum of Modern Art’s Exhibition and Staff Histories.

Many museums, especially those that collect modern and contemporary art, can have their hands tied by copyright laws when it comes to the data they present. A few of the data sets currently available from art museums are the Cooper Hewitt’s Collection Data, the Minneapolis Institute of Arts metadata, the Rijksmuseum API, the Tate Collection metadata, and the Getty Vocabularies.

The Metropolitan Museum of Art has recently released all images of the museum’s public domain works under a Creative Commons Zero license.

More museum data can be found here!