English subtitles for clip: File:Wikidata Editing with OpenRefine - Part 1.webm
Jump to navigation
Jump to search
1 00:00:00,000 --> 00:00:05,333 Welcome to this tutorial series on using OpenRefine to import data into Wikidata. 2 00:00:05,333 --> 00:00:06,833 My name is Antonin 3 00:00:06,850 --> 00:00:09,674 I'm going to walk you through the entire process 4 00:00:09,674 --> 00:00:11,489 of cleaning up the dataset, 5 00:00:11,489 --> 00:00:13,468 matching it with Wikidata items, 6 00:00:13,468 --> 00:00:17,601 and uploading the information as statements on these items. 7 00:00:17,612 --> 00:00:20,133 No previous knowledge of OpenRefine is necessary to follow this tutorial 8 00:00:20,133 --> 00:00:23,333 but some familiarity with Wikidata will help. 9 00:00:24,078 --> 00:00:26,627 All the links necessary to follow the tutorial 10 00:00:26,627 --> 00:00:28,485 can be found in the description of the video. 11 00:00:28,485 --> 00:00:30,828 So let's get started! 12 00:00:30,828 --> 00:00:35,561 OpenRefine is free software that you can download on openrefine.org. 13 00:00:35,930 --> 00:00:40,330 Once you have installed it, it runs in your browser like this. 14 00:00:40,679 --> 00:00:43,363 In this tutorial, we are going to import data 15 00:00:43,363 --> 00:00:46,496 about shooting locations of films in Paris. 16 00:00:47,592 --> 00:00:49,947 The dataset we are going to work on is available 17 00:00:49,947 --> 00:00:52,947 on the Parisian open data portal 18 00:00:53,455 --> 00:00:55,962 and we can download it as a CSV file. 19 00:00:55,962 --> 00:00:58,501 We can just copy the URL of that file 20 00:00:58,501 --> 00:01:01,501 and paste that in OpenRefine. 21 00:01:01,794 --> 00:01:04,395 We now have a preview of the table 22 00:01:04,395 --> 00:01:06,604 and we are happy with this format 23 00:01:06,604 --> 00:01:10,004 so we give a name to the project and create it. 24 00:01:13,482 --> 00:01:15,824 The first step to import this data in Wikidata 25 00:01:15,824 --> 00:01:17,324 is to match the film names 26 00:01:17,324 --> 00:01:20,191 with the Wikidata items they correspond to. 27 00:01:20,766 --> 00:01:22,266 Click on the column that contains the names 28 00:01:22,266 --> 00:01:23,600 of the entities that you want to match. 29 00:01:23,600 --> 00:01:26,667 and choose "Reconcile" -> "Start reconciling". 30 00:01:27,200 --> 00:01:30,200 Pick the Wikidata reconciliation service. 31 00:01:31,150 --> 00:01:33,100 OpenRefine tries to guess 32 00:01:33,100 --> 00:01:37,100 the type of entity these names correspond to. 33 00:01:37,100 --> 00:01:37,688 In our case, 34 00:01:37,688 --> 00:01:40,688 its best guess is "film" 35 00:01:40,953 --> 00:01:43,638 which looks appropriate. 36 00:01:43,638 --> 00:01:46,572 OpenRefine will only consider instances of that class 37 00:01:46,572 --> 00:01:48,488 or subclasses of it 38 00:01:48,488 --> 00:01:51,472 when looking for matches. 39 00:01:51,472 --> 00:01:54,302 OpenRefine also lets you match on other properties 40 00:01:54,302 --> 00:01:56,993 stored in other columns of the table. 41 00:01:56,993 --> 00:01:59,785 In our case, the "Réalisateur" column 42 00:01:59,785 --> 00:02:02,145 contains the name of the film director, 43 00:02:02,145 --> 00:02:05,021 which is very useful for disambiguation. 44 00:02:05,021 --> 00:02:07,594 So tick that column and select 45 00:02:07,594 --> 00:02:10,114 the Wikidata property it should be matched against. 46 00:02:10,114 --> 00:02:13,066 Click "Start reconciling" 47 00:02:13,066 --> 00:02:16,066 and wait for the process to complete. 48 00:02:26,998 --> 00:02:29,153 Now that reconciliation is done, 49 00:02:29,153 --> 00:02:30,803 some names have turned into blue links 50 00:02:30,803 --> 00:02:34,270 which point to the corresponding Wikidata items. 51 00:02:34,990 --> 00:02:36,969 Others were not matched 52 00:02:36,969 --> 00:02:39,185 for instance because the director did not match 53 00:02:39,185 --> 00:02:42,185 in the case of this "Nadia" film. 54 00:02:42,411 --> 00:02:44,042 Some other films were not matched 55 00:02:44,042 --> 00:02:47,698 because Wikidata does not know who their director is. 56 00:02:47,698 --> 00:02:49,116 If you have time, 57 00:02:49,116 --> 00:02:51,265 you can go through these unmatched cells 58 00:02:51,265 --> 00:02:53,290 and manually reconcile them. 59 00:02:53,290 --> 00:02:55,097 But you can also leave them as they are: 60 00:02:55,097 --> 00:02:58,430 these rows will just be ignored in the import. 61 00:03:00,100 --> 00:03:02,993 On the left hand side, you can see two facets. 62 00:03:02,993 --> 00:03:04,530 These can be used to filter rows 63 00:03:04,530 --> 00:03:06,200 based on their matching status 64 00:03:06,200 --> 00:03:08,381 and matching score. 65 00:03:08,381 --> 00:03:10,896 You can select rows where matching succeeded 66 00:03:10,896 --> 00:03:13,896 by clicking on the "matched" status. 67 00:03:15,450 --> 00:03:17,200 It is important that you check 68 00:03:17,200 --> 00:03:19,500 the quality of these automated matches, 69 00:03:19,500 --> 00:03:21,250 and there are many ways to do this. 70 00:03:21,250 --> 00:03:23,250 In our case, the table contains 71 00:03:23,250 --> 00:03:25,000 the dates of the shootings 72 00:03:25,000 --> 00:03:26,700 so we can compare that 73 00:03:26,700 --> 00:03:28,774 to the release date of the movies 74 00:03:28,774 --> 00:03:30,440 and check that they are consistent. 75 00:03:30,440 --> 00:03:32,855 Click on the reconciled column, 76 00:03:32,855 --> 00:03:36,000 pick "Edit column" -> "Add column from reconciled values" 77 00:03:36,000 --> 00:03:39,000 and select "publication date". 78 00:03:46,700 --> 00:03:49,050 We will now create a column 79 00:03:49,050 --> 00:03:50,650 that will contain the difference 80 00:03:50,650 --> 00:03:52,150 between the publication date 81 00:03:52,150 --> 00:03:54,350 and the end of shooting date. 82 00:03:57,278 --> 00:04:01,211 Pick "Edit column" -> "Add column based on this column" 83 00:04:02,498 --> 00:04:04,800 The language used for the expression here 84 00:04:04,800 --> 00:04:06,750 is called GREL. 85 00:04:06,750 --> 00:04:08,550 It is a simple language 86 00:04:08,550 --> 00:04:10,150 that you can learn on OpenRefine's wiki. 87 00:04:10,150 --> 00:04:12,065 You can also select other languages 88 00:04:12,065 --> 00:04:14,398 if you are more familiar with them. 89 00:04:14,750 --> 00:04:17,588 This expression will compute the difference 90 00:04:17,588 --> 00:04:19,150 between the two dates 91 00:04:19,150 --> 00:04:22,159 as a number of days. 92 00:04:22,159 --> 00:04:24,196 Give the new column a name 93 00:04:24,196 --> 00:04:27,196 and create the column. 94 00:04:31,079 --> 00:04:32,579 We can now create a numeric facet 95 00:04:32,579 --> 00:04:33,682 on our new column 96 00:04:33,682 --> 00:04:37,149 and inspect the distribution of the differences. 97 00:04:39,704 --> 00:04:42,124 Some of these differences are negative 98 00:04:42,124 --> 00:04:44,700 which suggests that we might have matched cells 99 00:04:44,700 --> 00:04:48,443 to movies that were released before the shooting. 100 00:04:48,443 --> 00:04:52,200 In fact, that's just because the release date for them 101 00:04:52,200 --> 00:04:55,952 have a year precision on Wikidata. 102 00:04:57,041 --> 00:04:59,229 The maximum difference is less than two years 103 00:04:59,229 --> 00:05:00,643 which also makes sense, 104 00:05:00,643 --> 00:05:02,020 so we are confident 105 00:05:02,020 --> 00:05:05,020 that these matches are reliable. 106 00:05:08,515 --> 00:05:11,258 This is the end of the first part of the tutorial 107 00:05:11,258 --> 00:05:13,315 In the next video, we are going to reconcile 108 00:05:13,315 --> 00:05:16,315 the locations of the shootings.