Μεγάλη καθυστέρηση στο classification με naive bayes σε R

computeras13 · 1 Απριλίου 2014

Έχω τον ακόλουθο κώδικα γραμμένο σε R όπου στην ουσία θέλω να κάνω classification κάποιων reviews με την χρήση του naive bayes.

H αλήθεια είναι οτι τα tutorials στην R περισσότερο με μπέρδεψαν παρά με βοήθησαν και είπα να μοιραστώ το θέμα μου μήπως έχει κανείς καμία ιδέα τι κάνω λάθος και έχω παράξενη συμπεριφορά.

Το πρόβλημα μου είναι οτι όταν τρέχει η εντολή "predict" δεν τελειώνει ποτέ (τουλάχιστον σε εύλογο χρονικό διάστημα)...

Παραθέτω και ένα δεύτερο κομμάτι snippet που έτρεξα το οποίο δεν εκτελεί τις δύο γραμμές που είναι μετά το σχόλιο "# Append class vector in the last column of data frame (This might take a bit)" και αντί για αυτά που είναι στο γενικό script τρέχει αυτά που φαίνονται στο δεύτερο κομμάτι κώδικα που έχω. Σε αυτή την περίπτωση έχω λειτουργία (πάλι βέβαια σε λίγο αργό χρόνο) αλλά τα αποτελέσματά μου δεν είναι και τόσο καλά. Γι' αυτό λοιπόν ρωτώ, μήπως έχω κάποιο λάθος στην λογική με την οποία διαχειρίζομαι τα δεδομένα στην R;

# Suppose that the txt_sentoken deirectory exists in the same directory 
# as this script. Example:
# - Current Directory
# | - classifier.R
# | - txt_sentoken
#   | - pos
#   | | - cv980_10953.txt
#   | | - ...
#   | - neg
#     | - cv490_18986.txt
#     | - ...

# Import the required libraries
library("tm")
library("SnowballC")
library("e1071")

# Function to perform preprocessing in the data
preProcess <- function(corp) {
  x <- corp
  x <- tm_map(x, tolower)
  x <- tm_map(x, removePunctuation)
  x <- tm_map(x, removeNumbers)
  x <- tm_map(x, removeWords, stopwords("english"))
  x <- tm_map(x, stemDocument)
  x <- tm_map(x, stripWhitespace)
  return(x)
}

# Create the data directory paths
print("Reading data from filesystem...")
pos_dir <- paste(getwd(), "txt_sentoken/pos", sep="/")
neg_dir <- paste(getwd(), "txt_sentoken/neg", sep="/")

# Read data from their directories
pos <- Corpus(DirSource(pos_dir), readerControl=list(language="english"))
neg <- Corpus(DirSource(neg_dir), readerControl=list(language="english"))

# Create training and testing corpuses
print("Creating training and testing corpuses...")
split.percentage      <- 0.75
split.pos.size        <- length(pos)
split.neg.size        <- length(neg)
split.pos.train.size  <- floor(split.pos.size * split.percentage)
split.neg.train.size  <- floor(split.neg.size * split.percentage)
split.pos.test.size   <- split.pos.size - split.pos.train.size
split.neg.test.size   <- split.neg.size - split.neg.train.size
corpus.train          <- c(pos[1:split.pos.train.size], neg[1:split.neg.train.size])
corpus.test           <- c(pos[(split.pos.train.size + 1) : split.pos.size], neg[(split.neg.train.size + 1) : split.neg.size])

# Perform the preprocessing
print("Pre-processing corpuses...")
corpus.train <- preProcess(corpus.train)
corpus.test  <- preProcess(corpus.test)

# Create the Document Term Matrix
print("Creating document term matrices...")
corpus.train.dtm <- DocumentTermMatrix(corpus.train, control=list(minWordLength=2))
corpus.test.dtm  <- DocumentTermMatrix(corpus.test, control=list(minWordLength=2))

# Create the Data Frame
print("Creating data matrices...")
corpus.train.df <- as.matrix(corpus.train.dtm)
corpus.test.df  <- as.matrix(corpus.test.dtm)

# Generate vector with class calues
print("Creating and appending class information...")
class.train <- c(rep("pos", split.pos.train.size), rep("neg", split.neg.train.size))
class.test  <- c(rep("pos", split.pos.test.size), rep("neg", split.neg.test.size))

# Append class vector in the last column of data frame (This might take a bit)
corpus.train.df <- cbind(corpus.train.df, class.train)
corpus.test.df  <- cbind(corpus.test.df, class.test)

# Train classifier
print("Training classifier...")
classifier <- naiveBayes(corpus.train.df[, 1 : (ncol(corpus.train.df) - 1)], corpus.train.df[, ncol(corpus.train.df)])

# Evaluate Classifier
print("Evaluating...")
corpus.predictions <- predict(classifier, corpus.test.df[, (-1 * ncol(corpus.test.df))])
corpus.results     <- table(corpus.predictions, corpus.test.df[,ncol(corpus.test.df)])

classifier <- naiveBayes(corpus.train.df, as.factor(class.train))
corpus.predictions <- predict(classifier, corpus.test.df)
table(corpus.predictions, class.test)

Σύνδεση

Μεγάλη καθυστέρηση στο classification με naive bayes σε R

Προτεινόμενες αναρτήσεις

computeras13

Δημιουργήστε ένα λογαριασμό ή συνδεθείτε για να σχολιάσετε

Δημιουργία λογαριασμού

Σύνδεση

Σύνδεση