04:15
How about
string[] result = Regex.Split(value, "\r\n\.,; ");
I can't tell you how many times I've come across problems with splitting strings in various languages. Whether it be splitting a string on all white space and getting only text, splitting a string on commas and periods or breaking up a string by line breaks, I've done it all. Sometimes there's an easy way to perform what I need to do in the language I'm working in, but often there's not. I end up hacking together something that will work in my particular situation and being satisfied enough to move on.
This has happened so many times that I finally decided to put together a simple method that does it all. It reads in a string and returns an array created by splitting that string based on an array of delimiters the user provides. I haven't done any testing against possible algorithms that already exist to see if my solution is slower and by how much, but I'll get to that at another time. Onto the solution:
int splitString(char* stringToSplit, char* splitters, int numSplitters, char** toModify) {
int size = 0;
int curIndex = 0;
bool parsingWord = false;
char* curWord = new char();
for (int i = 0; stringToSplit[i] != '\0'; i++) {
if (splitterContains(stringToSplit[i], splitters, numSplitters)) {
if (parsingWord) {
toModify[size++] = curWord;
curWord = new char();
curIndex = 0;
}
parsingWord = false;
}
else
parsingWord = true;
if (parsingWord)
curWord[curIndex++] = stringToSplit[i];
}
if (parsingWord)
*(toModify + (size++)) = curWord;
return size;
}
bool splitterContains(char toCheck, char* splitters, int numSplitters) {
for (int i = 0; i < numSplitters; i++)
if (toCheck == *(splitters + i))
return true;
return false;
}
The input parameters are as follows:
char* stringToSplit
-Pointer to the string to split
char* splitters
-Pointer to the array of delimiters
int numSplitters
-the size of splitter
char** to Modify
-A pointer to an array of char pointers. An example of intialization for this parameter would be char* toModify[16];
return value
-returns the size of toModify after the algorithm has been run.
The method may be a tad dirty for now, But this is my initial solution. I'll do some benchmarking soon and attempt to get my algorithm either equivalent to or faster than the current library algorithms.
Following is some sample input/output for the program. Enjoy!
char testString[] = "This is my test string"; //Full of tabs and spaces between words
char* result[16];
char splitters[] = {' ', '\t'};
int size = splitString(testString, splitters, 2, result);
for (int i = 0; i < size; i++)
cout << result[i] << endl;
Output:
This
is
my
test
string
char testString[] = "This,,.....;is,another;test.string";
char* result[16];
char splitters[] = {',', '.',';'};
int size = splitString(testString, splitters, 3, result);
for (int i = 0; i < size; i++)
cout << result[i] << endl;
Output:
This
is
another
test
string
Download the source here
The r n and period within the quotes should have a backslash before them (the blog removed the slashes).
Funny you should mention that - I actually did some benchmarking originally using Regular expressions to parse on any white space in a string. Running on around 350,000 strings, It took around 3 seconds. Using this self written method, it took roughly half as long.
I know that most if not all languages have support for doing something like this, I just felt like seeing how well an algorithm like this would compare to the actual libraries in use.
By the way, thanks for pointing out the slash issue. That's my own algorithm messing up. I'll get that formatting issue taken care of too so line breaks are actually line breaks.