User:Gaurav/DBpedia
Jump to navigation
Jump to search
Between May and August 2014, I worked on a Google Summer of Code project to incorporate metadata from the Commons into the structured content collected by DBpedia. I worked on a Github fork, using Github issues to track the project. This work has now been published in the 14th International Semantic Web Conference, Bethlehem, PA, USA (doi:10.1007/978-3-319-25010-6_17).
Deliverables
[edit]Completed
[edit]- Added support for the Wikimedia Commons to the DBpedia Extraction Framework
- Added support for files to the DBpedia Extraction Framework
- Created a FileTypeExtractor that extracts filenames, file extensions and MIME types as triples: #3, #11
- Extended the ontology for describing media classes used in the Commons, including: dbo:File, dbo:Image, dbo:StillImage, dbo:MovingImage and dbo:Sound: #10
- Designed and built new extractors
- Created a GalleryExtractor to extract galleries from pages and store them using the dbo:hasGalleryItem property: #26
- Created an ImageAnnotationExtractor that stores annotations made using the ImageAnnotator gadget using the W3 Media Fragments recommendation: #31
- Created a CommonsKMLExtractor that extracts overlay.kml files stored on the Commons to provide overlays to Commons images: #22
- Made it easier to create and test template mappings to Commons templates
- Added support for Commons templates that include other templates, such as {{Self}}: #16
- Improved the extractionSamples page on the mappings server to support any namespace, not just Main, for example: extractionSamples in the File namespace for {{Gray's Anatomy plate}}: #9
- Created a generic identifier property (dbo:identifiedBy) and a category for identifiers on the Mappings wiki: #14
- Incorporated and extended my mentor's code for adding prefixes and suffixes to a property to support ObjectProperties: #29
- Mapped some Commons templates to test the above, including:
- Automatically mapped around 360 license templates: #18, #20
- {{VN}}: #12
Abandoned
[edit]- Examined the file metadata dump to see if there was any interesting metadata there (there wasn't): #21
- Handle disambiguation pages on the Commons (there are only ~5,000, so we decided not to work on this): #24
- Extract image captions from the language Wikipedias (ran out of time): #27
- Writing tests for the FileTypeExtractor and LabelExtractor: #8, #25
- Mapped template mappings for the top-10 most used templates (ran out of time): #7
- Propose a new scheme for linking objects in DBpedia through URI-based identifiers to the rest of the Web of Things (ran out of time): #15
Other outputs
[edit]- A test dataset of 2,033 files I used to test my code.
- The RDF output produced from this test dataset.
- A list of every major license type used in the Commons, how it can be identified, and how it can be represented in DBpedia: #34
Google Summer of Code Mentors
[edit]- Dimitris Kontokostas
- Andrea Di Menna
- Jim O'Regan