Overview


In my thesis, I use 7 different datasets.

Dataset # of Documents Vocabulary Average Length # of classes Description Download
AG's News 127,000 63,008 26 4 The original source is here. The version I use is from the paper "Character-level Convolutional Networks for Text Classification"[1] Google Drive
Yelp Review Polarity 598,000 221,333 68 2 The original source is here. The version I use is from the paper "Character-level Convolutional Networks for Text Classification"[1] Google Drive
IMDB 100,000 137,570 119 2 Movie reviews from IMDB. Constructed by Stanford. Include 50,000 unlabled and 50,000 labeled data. Stanford
Full Movie Review 2,000 38,737 351 2 Cited in paper "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification"[2] Cornell
20 News Group 1,985 24,930 140 2 Include news from 20 different groups. In my thesis I only use 2 groups. Download
arXiv 240,000 161,068 98 2 Include title and abstract of the papers from the CS and non-CS category of arXiv papers. see details below
arXiv Long 200,000 1,599,799 748 2 Include title, abstract and introduction of the papers from the CS and non-CS category of arXiv papers. see details below

Details of arXiv Datasets


We construct our arXiv dataset by ourselves. The official website provide the access to download the full-text data (pdf format or LaTex format) and the meta data.

Meta data

For the meta data, we can use the provided API to get the title, abstract and author. We can use the following code to harvest the arXiv meta data.

OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"

# here since I want to harvest cs code, I set arxiv="cs"
# it also can be math, physics, econ, eess, stat, q-bio, q-fin
def harvest(arxiv="cs"):
    
    
    data_list = []
    
    base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
    
    # declear the range of the date
    url = (base_url +
           "from=1991-01-01&until=2018-08-31&" +
           "metadataPrefix=arXiv&set=%s"%arxiv)
    
    
    while True:
        print ("fetching", url)
        try:
            response = urllib.request.urlopen(url)
            
        except urllib.error.HTTPError as e:
            if e.code == 503:
                to = int(e.hdrs.get("retry-after", 30))
                print ("Got 503. Retrying after {0:d} seconds.".format(to))

                time.sleep(to)
                continue
                
            else:
                raise
            
        xml = response.read()

        root = ET.fromstring(xml)
        

        for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
            
            data = []
            
            arxiv_id = record.find(OAI+'header').find(OAI+'identifier')
            meta = record.find(OAI+'metadata')
            info = meta.find(ARXIV+"arXiv")
            created = info.find(ARXIV+"created").text
            created = datetime.datetime.strptime(created, "%Y-%m-%d")
            categories = info.find(ARXIV+"categories").text
            authors = info.find(ARXIV + "authors")
            authorList = []
            
            #print(info.find(ARXIV+"id").text)
            for author in authors.findall(ARXIV+"author"):
                keyname = ""
                forenames = ""
                
                if(author.find(ARXIV+"keyname") != None):
                    keyname = author.find(ARXIV+"keyname").text
                if(author.find(ARXIV+"forenames") != None):
                    forenames = author.find(ARXIV+"forenames").text
                
                full_name = keyname + " " + forenames
        
                authorList.append(full_name)
                

            # if there is more than one DOI use the first one
            # often the second one (if it exists at all) refers
            # to an eratum or similar
            doi = info.find(ARXIV+"doi")
            if doi is not None:
                doi = doi.text.split()[0]
                
            '''
            contents = {'title': info.find(ARXIV+"title").text,
                        'id': str(info.find(ARXIV+"id").text),#arxiv_id.text[4:],
                        'abstract': info.find(ARXIV+"abstract").text.strip(),
                        'authors': authorList,
                        'created': created,
                        'categories': categories.split(),
                        'main_area': categories.split()[0],
                        'doi': doi,
                        }

            df = df.append(contents, ignore_index=True)
            #df.to_csv("eess_meta.csv")'''
            
            paper_id = "\'" + str(info.find(ARXIV+"id").text) + "\'"
                                
            data.append(info.find(ARXIV+"title").text)
            data.append(info.find(ARXIV+"abstract").text.strip())
            data.append(authorList)
            data.append(categories.split())
            data.append(created)
            data.append(paper_id)
            data.append(categories.split()[0])
            data.append(doi)
                        
            data_list.append(data)

        # The list of articles returned by the API comes in chunks of
        # 1000 articles. The presence of a resumptionToken tells us that
        # there is more to be fetched.
        token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken")
        if token is None or token.text is None:
            break

        else:
            url = base_url + "resumptionToken=%s"%(token.text)
    
    
    df = pd.DataFrame(data_list, columns=("title", "abstract", "authors", "categories", "created", "id", "main_area", "doi"))        
    return df
  

The code above will return the dataframe stracture, below is the example of the dataframe.

  df.head()
  
title abstract authors categories created id main_area doi
0 Sparsity-certifying Graph Decompositions We describe a new algorithm, the (k,ℓ)(k,ℓ) -pe... [Streinu Ileana, Theran Louis] [math.CO, cs.CG] 2007-03-30 '0704.0002' math.CO None
1 A limit relation for entropy and channel capac... In a quantum mechanical model, Diosi, Feldmann... [Csiszar I., Hiai F., Petz D.] [quant-ph, cs.IT, math.IT] 2007-04-01 '0704.0046' quant-ph 10.1063/1.2779138
2 Intelligent location of simultaneously active ... The intelligent acoustic emission locator is d... [Kosel T., Grabec I.] [cs.NE, cs.AI] 2007-04-01 '0704.0047' cs.NE None

By changing the parameter of the previous function, we can get all the papers relates to all the category, then we remove the duplicates and get our final dataset. There are overall 1,402,997 papers, the number of papers in different category is shown as below.

Category # of document
Physics 894,814
hep-ph 102,464
astro-ph 94,228
hep-th 81,691
quant-ph 65,303
gr-qc 42,356
cond-mat.mes-hall 41,993
cond-mat.mtrl-sci 35,524
cond-mat.str-el 33,005
cond-mat.stat-mech 30,511
astro-ph.CO 27,357
astro-ph.SR 27,019
nucl-th 26,084
cond-mat.supr-con 23,770
astro-ph.GA 22,764
astro-ph.HE 20,785
cond-mat.soft 17,846
hep-ex 17,762
physics.optics 14,976
hep-lat 14,575
cond-mat 11,356
astro-ph.EP 10,842
astro-ph.IM 9,466
nucl-ex 8,764
cond-mat.dis-nn 8,735
cond-mat.quant-gas 8,556
physics.flu-dyn 8,307
physics.atom-ph 8,147
physics.ins-det 7,935
physics.gen-ph 7,115
physics.soc-ph 6,802
physics.plasm-ph 6,167
cond-mat.other 6,155
physics.chem-ph 5,333
physics.acc-ph 4,019
physics.bio-ph 3,881
physics.comp-ph 3,481
nlin.CD 3,427
physics.class-ph 3,176
nlin.SI 2,634
physics.data-an 2,362
nlin.PS 1,957
physics.hist-ph 1,911
physics.geo-ph 1,833
physics.ed-ph 1,717
physics.med-ph 1,577
physics.ao-ph 1,575
physics.app-ph 1,397
nlin.AO 1,248
physics.space-ph 1,113
physics.atm-clus 915
physics.pop-ph 847
chao-dyn 578
solv-int 360
nlin.CG 244
adap-org 184
mtrl-th 165
chem-ph 129
patt-sol 114
supr-con 69
atom-ph 68
acc-phys 46
comp-gas 42
plasm-ph 28
ao-sci 13
bayes-an 11
Math 314,003
math.AP 24,562
math.CO 24,243
math-ph 23,635
math.PR 23,480
math.AG 22,887
math.DG 18,756
math.NT 18,474
math.DS 13,006
math.OC 11,854
math.FA 11,425
math.NA 11,299
math.GT 10,173
math.CA 9,717
math.RT 9,429
math.GR 8,520
math.ST 8,242
math.QA 6,816
math.RA 6,587
math.CV 6,543
math.OA 5,631
math.LO 5,500
math.AT 5,367
math.AC 5,186
math.MG 3,697
math.SG 3,165
math.SP 3,084
math.CT 2,108
math.GM 2,076
math.GN 1,897
math.KT 1,718
math.HO 1,658
alg-geom 1,209
q-alg 1,177
dg-ga 562
funct-an 320
CS 151,321
cs.IT 20,499
cs.CV 16,723
cs.LG 10,054
cs.AI 8,222
cs.NI 7,806
cs.DS 7,453
cs.CL 7,023
cs.CR 6,118
cs.LO 5,565
cs.DC 5,234
cs.SY 4,076
cs.SI 4,050
cs.SE 3,946
cs.GT 3,194
cs.CY 3,178
cs.CC 3,058
cs.RO 3,020
cs.DM 2,933
cs.DB 2,772
cs.IR 2,550
cs.NE 2,462
cs.PL 2,306
cs.CG 2,236
cs.HC 1,876
cs.DL 1,602
cs.OH 1,600
cs.FL 1,467
cs.CE 1,426
cs.SD 961
cs.NA 955
cs.MM 921
cmp-lg 894
cs.AR 816
cs.MA 767
cs.SC 758
cs.ET 737
cs.GR 665
cs.MS 554
cs.PF 534
cs.OS 242
cs.GL 68
Statistics(stat) 19,615
stat.ME 7,001
stat.ML 6,561
stat.AP 3,926
stat.CO 1,844
stat.OT 283
Quantitative Biology(q-bio) 15,285
q-bio.PE 3,916
q-bio.NC 2,833
q-bio.QM 2,212
q-bio.BM 1,624
q-bio.MN 1,526
q-bio.GN 1,064
q-bio.CB 615
q-bio.TO 575
q-bio.SC 491
q-bio.OT 429
Quantitative Finance(q-fin) 5,965
q-fin.ST 960
q-fin.GN 948
q-fin.PR 870
q-fin.RM 620
q-fin.PM 577
q-fin.MF 569
q-fin.CP 527
q-fin.TR 509
q-fin.EC 385
Electrical Engineering and Systems Science(eess) 1,696
eess.SP 1,278
eess.IV 246
eess.AS 172
Economics(econ) 278
econ.EM 232
econ.GN 35
econ.TH 11

The data now is avaliable on s1. server: /home/tian118/arxiv_meta_new. All the data are in csv format.


arXiv Long

We also download all the papers in LaTex format, and our group member Zeeshan has extracted the introduction section of some of the papers, which are stored on the nfs://137.207.234.79 server at the following location Datasets/Arxiv2019/
Then we macth the introduction with the previous meta data through the paper id, and we can get our arXiv long data. There are 1,065,259 documents in the arXiv long dataset.