English subtitles for clip: File:Wikidata Editing with OpenRefine - Part 2.webm

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
1
00:00:00,000 --> 00:00:02,367
Welcome back to this tutorial

2
00:00:02,367 --> 00:00:04,700
on using OpenRefine to import data

3
00:00:04,700 --> 00:00:06,400
into Wikidata.

4
00:00:06,400 --> 00:00:08,200
In the previous video,

5
00:00:08,200 --> 00:00:10,750
we have matched films against Wikidata items

6
00:00:10,750 --> 00:00:13,100
and checked the quality of these matches.

7
00:00:13,100 --> 00:00:15,050
For each of these films,

8
00:00:15,050 --> 00:00:17,300
we want to add the filming locations

9
00:00:17,300 --> 00:00:19,550
to the Wikidata items.

10
00:00:19,550 --> 00:00:22,150
This requires reconciling the locations as well.

11
00:00:22,150 --> 00:00:25,107
So let's do this.

12
00:00:25,107 --> 00:00:27,104
The locations we have in this dataset

13
00:00:27,104 --> 00:00:29,726
are given as street addresses.

14
00:00:29,726 --> 00:00:31,974
these particular addresses are unlikely

15
00:00:31,974 --> 00:00:33,939
to have a corresponding Wikidata item,

16
00:00:33,939 --> 00:00:36,939
but the streets they are in often have one.

17
00:00:38,665 --> 00:00:40,608
So we are first going to extract

18
00:00:40,608 --> 00:00:41,871
the street names from the addresses.

19
00:00:41,871 --> 00:00:44,000
We use a regular expression

20
00:00:44,000 --> 00:00:47,467
to remove any number at the beginning of the string.

21
00:00:51,504 --> 00:00:54,500
In the preview window, we notice that

22
00:00:55,089 --> 00:00:55,400
our regular expression did not catch

23
00:00:55,400 --> 00:00:57,800
the leading spaces.

24
00:00:57,800 --> 00:01:00,500
This is an indication that these strings

25
00:01:00,500 --> 00:01:03,050
contain non-standard space characters.

26
00:01:03,050 --> 00:01:04,550
They are likely to cause problems

27
00:01:04,550 --> 00:01:07,445
during reconciliation with Wikidata.

28
00:01:07,445 --> 00:01:08,892
So let's just copy these weird characters

29
00:01:08,892 --> 00:01:10,700
and get rid of them

30
00:01:10,700 --> 00:01:13,967
with a first replace function.

31
00:01:15,479 --> 00:01:16,979
The first call to replace

32
00:01:16,979 --> 00:01:18,646
cleans up the whitespace;

33
00:01:19,302 --> 00:01:22,302
the second removes the street numbers.

34
00:01:30,550 --> 00:01:32,538
Pick a name for the column

35
00:01:32,750 --> 00:01:34,600
and create it.

36
00:01:37,050 --> 00:01:40,464
We can now reconcile these streets to Wikidata.

37
00:01:40,464 --> 00:01:43,531
Again, pick "Reconcile" -> "Start reconciling"

38
00:01:43,548 --> 00:01:45,525
and choose the Wikidata service.

39
00:01:48,250 --> 00:01:51,638
In this case, the "street" type is too narrow.

40
00:01:52,071 --> 00:01:54,735
Some locations are parks or bridges

41
00:01:54,735 --> 00:01:57,735
so we manually pick a broader type.

42
00:01:57,936 --> 00:01:59,859
Let's see what other information we could use

43
00:01:59,859 --> 00:02:02,200
to improve the matches.

44
00:02:02,200 --> 00:02:04,754
The postcode looks like a good fit

45
00:02:04,754 --> 00:02:07,300
but unfortunately postcodes are rarely

46
00:02:07,300 --> 00:02:10,300
added on street items.

47
00:02:10,600 --> 00:02:13,000
The last column contains the geographical

48
00:02:13,000 --> 00:02:14,535
coordinates of the locations,

49
00:02:14,535 --> 00:02:17,535
expressed as latitude, comma, longitude.

50
00:02:18,912 --> 00:02:22,379
We can match that to the coordinates of the streets.

51
00:02:22,688 --> 00:02:25,218
The closer these geographical points will be,

52
00:02:25,218 --> 00:02:28,218
the higher the matching score will get.

53
00:02:35,400 --> 00:02:37,949
Once reconciliation is done,

54
00:02:37,949 --> 00:02:39,902
we can inspect the matches.

55
00:02:39,902 --> 00:02:41,644
In this case, we can see that two streets

56
00:02:41,644 --> 00:02:43,034
with the same name

57
00:02:43,034 --> 00:02:45,073
got different matching scores,

58
00:02:45,073 --> 00:02:48,073
thanks to the matching on coordinates.

59
00:02:48,111 --> 00:02:51,111
The first one is the correct one.

60
00:02:52,600 --> 00:02:55,209
This cell was not matched automatically

61
00:02:55,209 --> 00:02:56,550
because the gap between the two scores

62
00:02:56,550 --> 00:02:58,300
is not big enough.

63
00:02:58,300 --> 00:03:00,550
I suspect there are more cases like this,

64
00:03:00,550 --> 00:03:02,900
so I am just going to filter the cells

65
00:03:02,900 --> 00:03:06,100
which were not matched

66
00:03:06,100 --> 00:03:10,431
but whose best candidate score is very high.

67
00:03:12,050 --> 00:03:14,500
I'm also going to add a facet

68
00:03:14,500 --> 00:03:16,238
which computes the string similarity

69
00:03:16,238 --> 00:03:17,819
between the cell content

70
00:03:17,819 --> 00:03:20,202
and the name of the best match

71
00:03:20,202 --> 00:03:23,202
and restrict to the high quality matches.

72
00:03:24,500 --> 00:03:26,127
Let's review these filtered rows

73
00:03:26,127 --> 00:03:27,860
and their best candidates.

74
00:03:48,209 --> 00:03:50,679
All these candidates are correct.

75
00:03:50,679 --> 00:03:52,469
So click "Reconcile" -> "Actions"

76
00:03:52,469 --> 00:03:55,469
-> "Match each cell to its best candidate"

77
00:03:57,015 --> 00:03:58,931
Obviously this operation should be used with care

78
00:03:58,931 --> 00:04:01,931
because it can introduce false positives.

79
00:04:03,650 --> 00:04:05,200
Let's now check the quality

80
00:04:05,200 --> 00:04:07,671
of the matched cells.

81
00:04:08,079 --> 00:04:09,582
For instance,

82
00:04:09,582 --> 00:04:11,100
we can fetch the administrative location

83
00:04:11,100 --> 00:04:15,669
of these streets.

84
00:04:23,676 --> 00:04:24,918
Once these locations are fetched,

85
00:04:24,918 --> 00:04:27,514
we can create a text facet on this column

86
00:04:27,514 --> 00:04:29,168
and sort the facet

87
00:04:29,168 --> 00:04:32,168
by decreasing number of occurrences.

88
00:04:39,550 --> 00:04:42,100
This gives us a broad overview

89
00:04:42,100 --> 00:04:47,241
of the most frequent values.

90
00:04:47,241 --> 00:04:47,800
We can review this list.

91
00:04:47,800 --> 00:04:50,050
All these locations are neighborhoods in Paris,

92
00:04:50,050 --> 00:04:52,947
which is consistent with the dataset.

93
00:05:03,250 --> 00:05:06,551
This is the end of the second part of this tutorial.

94
00:05:06,551 --> 00:05:08,500
In the next video, we are going to

95
00:05:08,500 --> 00:05:10,965
transform our table into statements

96
00:05:10,965 --> 00:05:12,832
and upload them to Wikidata.