Czech Historical Named Entity Corpus v 1.0


General Information

Czech Historical Named Entity Corpus v 1.0 is a collection of annotated texts for historical Czech named-entity recognition. It is composed of Czech texts from newspaper "Posel od Čerchova" from the second half of 19th century. We specified following basic NE-types: Personal names, Institutions, Geographical names, Time expressions and Artifact names / Objects.

Every token is placed in a separate line which contains four columns, each column separated by a space. The first column is the token, whereas the second one is reserved for lemma (non-specified in our case, represented by an underscore symbol). The third column contains information about the language. Most tokens are Czech ones ("CZ"), but we can also find some tokens in German ("DE"), French ("FR") or Latin ("LA"). The last column is used to describe the named entity type.

We also used "BIO" notations to indicate the first word in a multiword entity (tag "B" as "beginning"), and inside words for all other NE units (tag "I" as "internal"). All tokens that are not a named entity are tagged as "O" - "outside".

Each sentence is separated by empty line.

License

This dataset is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported License, so it is available only for research purposes for free. Commercial use in any form is strictly excluded.

Download

For further information about this corpus, please, see the paper below:

  • H. Hubkova, P. Kral, E. Pettersson, Czech Historical Named Entity Corpus v 1.0, 13th Edition of the Language Resources and Evaluation Conference (LREC 2020), Marseille, France, 11-16 May 2020, pp. 4460-4467, European Language Resources Association (ELRA), ISBN: 979-10-95546-34-4, FullText, Bibtex.
  • Please, cite this paper when you used this corpus in your experiments.

    If you have additional questions / comments related to this corpus, please, do not hesitate to contact the authors: Helena Hubkova hhubkova@kiv.zcu.cz or Pavel Kral pkral@kiv.zcu.cz.