In my thesis, I use 7 different datasets.
Dataset | # of Documents | Vocabulary | Average Length | # of classes | Description | Download |
---|---|---|---|---|---|---|
AG's News | 127,000 | 63,008 | 26 | 4 | The original source is here. The version I use is from the paper "Character-level Convolutional Networks for Text Classification"[1] | Google Drive |
Yelp Review Polarity | 598,000 | 221,333 | 68 | 2 | The original source is here. The version I use is from the paper "Character-level Convolutional Networks for Text Classification"[1] | Google Drive |
IMDB | 100,000 | 137,570 | 119 | 2 | Movie reviews from IMDB. Constructed by Stanford. Include 50,000 unlabled and 50,000 labeled data. | Stanford |
Full Movie Review | 2,000 | 38,737 | 351 | 2 | Cited in paper "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification"[2] | Cornell |
20 News Group | 1,985 | 24,930 | 140 | 2 | Include news from 20 different groups. In my thesis I only use 2 groups. | Download |
arXiv | 240,000 | 161,068 | 98 | 2 | Include title and abstract of the papers from the CS and non-CS category of arXiv papers. | see details below |
arXiv Long | 200,000 | 1,599,799 | 748 | 2 | Include title, abstract and introduction of the papers from the CS and non-CS category of arXiv papers. | see details below |
We construct our arXiv dataset by ourselves. The official website provide the access to download the full-text data (pdf format or LaTex format) and the meta data.
For the meta data, we can use the provided API to get the title, abstract and author. We can use the following code to harvest the arXiv meta data.
OAI = "{http://www.openarchives.org/OAI/2.0/}" ARXIV = "{http://arxiv.org/OAI/arXiv/}" # here since I want to harvest cs code, I set arxiv="cs" # it also can be math, physics, econ, eess, stat, q-bio, q-fin def harvest(arxiv="cs"): data_list = [] base_url = "http://export.arxiv.org/oai2?verb=ListRecords&" # declear the range of the date url = (base_url + "from=1991-01-01&until=2018-08-31&" + "metadataPrefix=arXiv&set=%s"%arxiv) while True: print ("fetching", url) try: response = urllib.request.urlopen(url) except urllib.error.HTTPError as e: if e.code == 503: to = int(e.hdrs.get("retry-after", 30)) print ("Got 503. Retrying after {0:d} seconds.".format(to)) time.sleep(to) continue else: raise xml = response.read() root = ET.fromstring(xml) for record in root.find(OAI+'ListRecords').findall(OAI+"record"): data = [] arxiv_id = record.find(OAI+'header').find(OAI+'identifier') meta = record.find(OAI+'metadata') info = meta.find(ARXIV+"arXiv") created = info.find(ARXIV+"created").text created = datetime.datetime.strptime(created, "%Y-%m-%d") categories = info.find(ARXIV+"categories").text authors = info.find(ARXIV + "authors") authorList = [] #print(info.find(ARXIV+"id").text) for author in authors.findall(ARXIV+"author"): keyname = "" forenames = "" if(author.find(ARXIV+"keyname") != None): keyname = author.find(ARXIV+"keyname").text if(author.find(ARXIV+"forenames") != None): forenames = author.find(ARXIV+"forenames").text full_name = keyname + " " + forenames authorList.append(full_name) # if there is more than one DOI use the first one # often the second one (if it exists at all) refers # to an eratum or similar doi = info.find(ARXIV+"doi") if doi is not None: doi = doi.text.split()[0] ''' contents = {'title': info.find(ARXIV+"title").text, 'id': str(info.find(ARXIV+"id").text),#arxiv_id.text[4:], 'abstract': info.find(ARXIV+"abstract").text.strip(), 'authors': authorList, 'created': created, 'categories': categories.split(), 'main_area': categories.split()[0], 'doi': doi, } df = df.append(contents, ignore_index=True) #df.to_csv("eess_meta.csv")''' paper_id = "\'" + str(info.find(ARXIV+"id").text) + "\'" data.append(info.find(ARXIV+"title").text) data.append(info.find(ARXIV+"abstract").text.strip()) data.append(authorList) data.append(categories.split()) data.append(created) data.append(paper_id) data.append(categories.split()[0]) data.append(doi) data_list.append(data) # The list of articles returned by the API comes in chunks of # 1000 articles. The presence of a resumptionToken tells us that # there is more to be fetched. token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken") if token is None or token.text is None: break else: url = base_url + "resumptionToken=%s"%(token.text) df = pd.DataFrame(data_list, columns=("title", "abstract", "authors", "categories", "created", "id", "main_area", "doi")) return df
The code above will return the dataframe stracture, below is the example of the dataframe.
df.head()
title | abstract | authors | categories | created | id | main_area | doi | |
---|---|---|---|---|---|---|---|---|
0 | Sparsity-certifying Graph Decompositions | We describe a new algorithm, the (k,ℓ)(k,ℓ) -pe... | [Streinu Ileana, Theran Louis] | [math.CO, cs.CG] | 2007-03-30 | '0704.0002' | math.CO | None |
1 | A limit relation for entropy and channel capac... | In a quantum mechanical model, Diosi, Feldmann... | [Csiszar I., Hiai F., Petz D.] | [quant-ph, cs.IT, math.IT] | 2007-04-01 | '0704.0046' | quant-ph | 10.1063/1.2779138 |
2 | Intelligent location of simultaneously active ... | The intelligent acoustic emission locator is d... | [Kosel T., Grabec I.] | [cs.NE, cs.AI] | 2007-04-01 | '0704.0047' | cs.NE | None |
By changing the parameter of the previous function, we can get all the papers relates to all the category, then we remove the duplicates and get our final dataset. There are overall 1,402,997 papers, the number of papers in different category is shown as below.
Category | # of document | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Physics | 894,814 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Math | 314,003 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CS | 151,321 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistics(stat) | 19,615 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quantitative Biology(q-bio) | 15,285 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Quantitative Finance(q-fin) | 5,965 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Electrical Engineering and Systems Science(eess) | 1,696 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Economics(econ) | 278 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The data now is avaliable on s1. server: /home/tian118/arxiv_meta_new. All the data are in csv format.
We also download all the papers in LaTex format, and our group member Zeeshan has extracted the introduction section of some of the papers, which are stored
on the nfs://137.207.234.79 server at the following location Datasets/Arxiv2019/
Then we macth the introduction with the previous meta data through the paper id, and we can get our arXiv long data. There are 1,065,259 documents in the arXiv long
dataset.