Europeana Common Culture webinar: Increasing (raw) data quality using OpenRefine
This webinar highlights how aggregators and cultural heritage institutions can use OpenRefine, a free, open source tool for handling messy data.
This webinar highlights how aggregators and cultural heritage institutions can use OpenRefine, a free, open source tool for handling messy data.
This webinar was hosted by Europeana in collaboration with Rob Davies, Cyprus Institute of Technology, as part of Europeana Common Culture Project. Speakers introduce the tool, discuss their own experience working with OpenRefine, answer questions and offer their support to colleagues interested in using the tool in their workflow.
The webinar took place on the 20th of April 2020, between 13.00 - 14.00 CET.
Is OpenRefine suitable for results of surveys? Is there a visualisation option on the data? (Tamara NBS)
Cosmina’s answer at minute 47:42 in the recording: You have the tabular format where you can browse, you can use the faceting that can be used for statistical purposes. I always try to export the data and look at the end result. You have various possibilities for looking at the data, so faceting (and under-facetting), clustering and exporting offer very nice opportunities to analyse it.
Could you repeat how to export data in EDM? (Laura / Hispana)
Cosmina recaps how to export data in EDM at minute 49:00 in the recording.
Would your data partners be interested in using this tool at source, so that you already receive cleaned data? Or are your data partners using it? (Adina, Europeana Foundation)
Cosmina answers at minute 52:00 in the recording: Not really. Data providers offer data over OAI PMH and they program the interface for different formats and they will not change it for our (Europeana Aggregator) purposes. Others might be interested, if resources are available they might try. Museums use their own export tools for their own format. At German Digital Library we try to unify the data from various providers (in various formats) under a common format which is the DDB-EDM.
Tom answers at minute 53:20: I’m the main user of OpenRefine and use it regularly, and am able to look at the data, and have a conversation with the data provider and offer to change some values for them. Some data providers may not have the time or skills to do that themselves. It’s a powerful tool to see how something can be revised, if you are used to the faceting and cleaning functions what seems a laborious exercise becomes quite simple. I’d be happy to offer training on this to whoever needs it, as I think this is a very good aggregator tool.
If more aggregators are interested in working with OpenRefine, would you, Cosmina and Tom, be happy providing support or what is your view of organising support for more people working with OpenRefine? (Henning, Europeana Foundation)
Tom answers at minute 59:12 in the recording: I’d be happy to help, we could have an OpenRefine user group on Basecamp is maybe an option, where Cosmina and I would be able to assist.
Cosmina answers at minute 1:00:10 in the recording: A hands on webinar would also help, everyone bring a dataset to work on together during a webinar. We could use Basecamp to share the recipes, commands, etc.
How does it work to get another source, like a thesaurus, available for reconciling other than wikidata? (Lois Miller)
Cosmina answers at minute 55:22 in the recording: I use GND - the controlled vocabulary used by libraries in Germany, there’s a list of extensions available where you download openrefine from. Cosmina shows practically how this works at minute 56:30 in the recording.
Do you need a training period to get familiar with the tool? (Rob, CUT)
Tom answers at minute 1:02:00 in the recording: It’s mostly crash and burn, bite the bullet, install it, click around and see what becomes out of it. There’s lots of information on github and the links we shared. But it would be good to have some sort of forum, to discuss together what the solution is.
Cosmina answers at minute 1:03:53 in the recording: Crash and burn, is indeed the way I learned it too. The biggest difficulty was exporting the data after cleaning and performing all the operations I needed, but after googling and reading forums I found what I was looking for. On stackoverflow they have a subgroup where they answer questions really quickly
Tom adding to that at minute 1:05:12: I had difficulties in exporting data in EDM, with Cosmina’s help, I was able to learn that. At Europeana Sounds we use MINT as a mapping tool but Open refine could do everything for you, be used from scratch, or to cover some gaps that are in the mapping tool that you are using.
Cosmina adding at minute 1:06:23 in the recording: Exactly. We haven't talked about the disadvantage of Open Refine, as it has its limits. The biggest set that I was able to handle in OpenRefine was of 150.000 records, over 1mil lines, and the tool gets very slow. What helped was spilling the sets in subsets
Tom adding at minute 1:07:15: Other thing you can do is allocate more memory - here’s how you can allocate more RAM to OpenRefine on your computer: https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Allocate-More-Memory
Why Open Refine? Do you know other tools out there that can do similar things? (Rob, CUT)
Cosmina answers at minute 1:09:06 in the recording: Other than MInt,I’m not aware of other tools
Tom answers at minute 1:09:13 in the recording: It’s more what people would be more comfortable using, like Python, or other programming languages. Until now, I was using google sheets, excel and Mint. For me the real benefit of Open refine is that it allows you to have an overview, gives a break down of values.
Cosmina adding at minute 1:10:50 in the recording: This answers Tamara’s question, because the statistical overview on facets, it is actually what you would like to see. In my experience, smaller sets are messier than big sets than the ones that are automatically generated. From the smaller institutions we get messy data, sometimes in xsl or word tables and it is very difficult to put that in a structured format, and here OpenRefine helps a lot.
Questions from participants
Is OpenRefine suitable for results of surveys? Is there a visualisation option on the data? (Tamara NBS)
Cosmina’s answer at minute 47:42 in the recording: You have the tabular format where you can browse, you can use the faceting that can be used for statistical purposes. I always try to export the data and look at the end result. You have various possibilities for looking at the data, so faceting (and under-facetting), clustering and exporting offer very nice opportunities to analyse it.
Could you repeat how to export data in EDM? (Laura / Hispana)
Cosmina recaps how to export data in EDM at minute 49:00 in the recording.
Would your data partners be interested in using this tool at source, so that you already receive cleaned data? Or are your data partners using it? (Adina, Europeana Foundation)
Cosmina answers at minute 52:00 in the recording: Not really. Data providers offer data over OAI PMH and they program the interface for different formats and they will not change it for our (Europeana Aggregator) purposes. Others might be interested, if resources are available they might try. Museums use their own export tools for their own format. At German Digital Library we try to unify the data from various providers (in various formats) under a common format which is the DDB-EDM.
Tom answers at minute 53:20: I’m the main user of OpenRefine and use it regularly, and am able to look at the data, and have a conversation with the data provider and offer to change some values for them. Some data providers may not have the time or skills to do that themselves. It’s a powerful tool to see how something can be revised, if you are used to the faceting and cleaning functions what seems a laborious exercise becomes quite simple. I’d be happy to offer training on this to whoever needs it, as I think this is a very good aggregator tool.
If more aggregators are interested in working with OpenRefine, would you, Cosmina and Tom, be happy providing support or what is your view of organising support for more people working with OpenRefine? (Henning, Europeana Foundation)
Tom answers at minute 59:12 in the recording: I’d be happy to help, we could have an OpenRefine user group on Basecamp is maybe an option, where Cosmina and I would be able to assist.
Cosmina answers at minute 1:00:10 in the recording: A hands on webinar would also help, everyone bring a dataset to work on together during a webinar. We could use Basecamp to share the recipes, commands, etc.
How does it work to get another source, like a thesaurus, available for reconciling other than wikidata? (Lois Miller)
Cosmina answers at minute 55:22 in the recording: I use GND - the controlled vocabulary used by libraries in Germany, there’s a list of extensions available where you download openrefine from. Cosmina shows practically how this works at minute 56:30 in the recording.
Do you need a training period to get familiar with the tool? (Rob, CUT)
Tom answers at minute 1:02:00 in the recording: It’s mostly crash and burn, bite the bullet, install it, click around and see what becomes out of it. There’s lots of information on github and the links we shared. But it would be good to have some sort of forum, to discuss together what the solution is.
Cosmina answers at minute 1:03:53 in the recording: Crash and burn, is indeed the way I learned it too. The biggest difficulty was exporting the data after cleaning and performing all the operations I needed, but after googling and reading forums I found what I was looking for. On stackoverflow they have a subgroup where they answer questions really quickly
Tom adding to that at minute 1:05:12: I had difficulties in exporting data in EDM, with Cosmina’s help, I was able to learn that. At Europeana Sounds we use MINT as a mapping tool but Open refine could do everything for you, be used from scratch, or to cover some gaps that are in the mapping tool that you are using.
Cosmina adding at minute 1:06:23 in the recording: Exactly. We haven't talked about the disadvantage of Open Refine, as it has its limits. The biggest set that I was able to handle in OpenRefine was of 150.000 records, over 1mil lines, and the tool gets very slow. What helped was spilling the sets in subsets
Tom adding at minute 1:07:15: Other thing you can do is allocate more memory - here’s how you can allocate more RAM to OpenRefine on your computer: https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Allocate-More-Memory
Why Open Refine? Do you know other tools out there that can do similar things? (Rob, CUT)
Cosmina answers at minute 1:09:06 in the recording: Other than MInt,I’m not aware of other tools
Tom answers at minute 1:09:13 in the recording: It’s more what people would be more comfortable using, like Python, or other programming languages. Until now, I was using google sheets, excel and Mint. For me the real benefit of Open refine is that it allows you to have an overview, gives a break down of values.
Cosmina adding at minute 1:10:50 in the recording: This answers Tamara’s question, because the statistical overview on facets, it is actually what you would like to see. In my experience, smaller sets are messier than big sets than the ones that are automatically generated. From the smaller institutions we get messy data, sometimes in xsl or word tables and it is very difficult to put that in a structured format, and here OpenRefine helps a lot.