|
Post by sdlbasic on Sept 29, 2016 9:39:07 GMT -6
where should i put does tokens like rcbasic? should they be in an enum, class, array?
enum { number, text, identifier, add, sub, mult, div, mod }
|
|
|
Post by n00b on Sept 29, 2016 14:24:00 GMT -6
Is this going to be in c++ or c? Do you have a template that you want me to fallow or i just based in the uploaded code i had for the calculator? I just need to be able to feed an expression to a tokenizer function and have it fill a vector with the tokens. This is kind of a template to start with. vector<string> tokens;
bool rc_tokenize(string src_line) { //read a line of code and break it down into tokens and store the tokens in the tokens vector
//return true if successful and false on failure } I want to avoid using static arrays as much as possible to eliminate the chance of getting the memory leaks that I was getting with previous versions. For this reason it will be C++. I know you could implement something like this in C but the fact that C++ already has these features in the standard library means we don't have to re-invent the wheel. You don't have to follow the template I outlined and if you have something better than using a vector for tokens feel free to use it. Its just really important that the same tokenizer is able to be used for the interpreter and the compiler that way we can guarantee 100% compatability across both. Also, I realize I forgot to put a token in the previous list for stuff like commas. Those obviously need to be tokenized too. Something as simple as <comma> would work. For any other tokens I forgot for characters that will need to be recognized just make up a token and let me know what it is so I can document it for when I am parsing a rule in the compiler that accepts that particular token. Also, while not necessary right now, the tokenizer will need to eventually be able to tell the difference between recognized keywords and stuff that it would consider identifiers for variables and functions. Each keyword should have its own token. Its something that you should have a plan to implement later if you don't decide to implement it right now. For now I just want to be able to parse a complex mathematical expression to test parsing rules. For this the tokenizer needs to acknowledge every character that could be used in the standard arithmetic order of operations. That includes the "^" character for exponents.
|
|
|
Post by n00b on Sept 29, 2016 14:54:01 GMT -6
where should i put does tokens like rcbasic? should they be in an enum, class, array? enum { number, text, identifier, add, sub, mult, div, mod }
You can look at my previous post for where to store the tokens. The tokenizer library will also need a function to remove a token from a certain position within the array and a function that can place a token in a certain position in the array. For example, lets say we had this expression: 2*(4+5) In this expression, we would generate these tokens <num>2
| <mul>
| <par>
| <num>4 | <add> | <num>5 | </par> |
Since we want to do the math that is in parenthesis first, for the compiler I would probably write this to the vm assembler file: mov m0 4 --- store 4 in m0 add m0 5 --- add 5 to m0 and store the result in m0 So now m0 has the result of the math in the parenthesis. To finish parsing the rest of the expression I would need to replace all the tokens for the math I just did with m0. So my tokens vector should look like this: If you could just make a function that could replace a range of tokens with a new token that would be extremely helpful as well. Right now I am working on new opcodes for the parser. The actual text opcodes that get passed to the vm assembler probably won't have to change but I will need different binary variations of the opcode for the different types of arguments that they can accept. I am also thinking about implementing user defined types for 3.0. Do you think that would be a good feature to add or would it just over complicate the language? In the end I just want it to be as accessible to new programmers as possible.
|
|
|
Post by sdlbasic on Sept 30, 2016 4:48:16 GMT -6
Well i put something quick to see were to start or were to go, a bit lost here, so i decide to learn a bit c++, way different from c and put this piece of code together, see it and see were we can go from here.
#include <iostream> #include <string> #include <cctype>
void tokens(const std::string &data);
int main(void) { std::string line;
while(std::getline(std::cin, line)) { tokens(line); }
return 0; }
void tokens(const std::string &data) { std::string::size_type x = 0; while(x < data.length()) { char c = data[x]; if (std::isspace(c)) { x++; } else if (std::isalnum(c)) { std::cout << "<number>" << c << "\n"; x++; } else if (!std::isalnum(c) && !std::isspace(c)) { std::cout << "<operator>" << c << "\n"; x++; } } }
|
|
|
Post by sdlbasic on Sept 30, 2016 4:52:01 GMT -6
In this piece of code i accept a string from the stdin and then parse it, what i did was, compare length from x = 0 to the all data, i used a size_type from string so the compiler then wont scream saying that we are comparing int's with unsigned ints blalala, after that i started to fetch a char at a time and the rest you can see.
|
|
|
Post by sdlbasic on Sept 30, 2016 9:20:03 GMT -6
I was looking thru some code again and i thing that we could change this function:
int rc_isSpecialCharacter(std::string sline) { if(sline.compare(" ")==0) { return 1; } else if(sline.compare("\t")==0) { return 1; } else if(sline.compare("\n")==0) { return 1; } else if(sline.compare("\r")==0) { return 1; } else if(sline.compare("\b")==0) { return 1; } else if(sline.compare("`")==0) { return 1; } else if(sline.compare("~")==0) { return 1; } else if(sline.compare("!")==0) { return 1; } else if(sline.compare("@")==0) { return 1; } else if(sline.compare("#")==0) { return 1; } else if(sline.compare("%")==0) { return 1; } else if(sline.compare("^")==0) { return 1; } else if(sline.compare("&")==0) { return 1; } else if(sline.compare("*")==0) { return 1; } else if(sline.compare("(")==0) { return 1; } else if(sline.compare(")")==0) { return 1; } else if(sline.compare("-")==0) { return 1; } else if(sline.compare("=")==0) { return 1; } else if(sline.compare("+")==0) { return 1; } else if(sline.compare(",")==0) { return 1; } else if(sline.compare(".")==0) { return 1; } else if(sline.compare("/")==0) { return 1; } else if(sline.compare("<")==0) { return 1; } else if(sline.compare(">")==0) { return 1; } else if(sline.compare("?")==0) { return 1; } else if(sline.compare(";")==0) { return 1; } else if(sline.compare(":")==0) { return 1; } else if(sline.compare("'")==0) { return 1; } else if(sline.compare("\"")==0) { return 1; } else if(sline.compare("[")==0) { return 1; } else if(sline.compare("]")==0) { return 1; } else if(sline.compare("\\")==0) { return 1; } else if(sline.compare("{")==0) { return 1; } else if(sline.compare("}")==0) { return 1; } else if(sline.compare("|")==0) { return 1; } else { return 0; } }
with something like:
int rc_isSpecialCharacter(char token) { const char specialChars[] = {'!', '[', ']'}; int size_of_keys = (sizeof(specialChars)/sizeof(specialChars[0])); int i = 0; while(i < size_of_keys) { if(token == specialChars[i]) return 1; i++; } return 0; }
|
|
|
Post by n00b on Sept 30, 2016 22:13:17 GMT -6
For your token function the output still needs to go to a global vector that can be parsed by the interpreter or compiler. Also, why don't you just use namespace std. It would save you a lot of typing. I haven't had time to work on the opcodes in the past few days because I was on overnight shift at my job for the past 5 days but I am back on dayshift so I should be able to work on it some tomorrow. As soon as I finish the doc I will post it. I want to finish the doc before I start on the parser that way I have a good reference to work with. The tokenizer seems to be going pretty well so far. As soon as you are able to break a line down into all the tokens I listed I will be able to test it out against the parser. Thanks again for the help.
|
|
|
Post by sdlbasic on Oct 3, 2016 7:02:16 GMT -6
I made a few changes but i still got some issues if is this what you are looking for:
/* <num> number <string> text <id> identifer <par> </par> <curly> </curly> <square> </square> <add> <sub> <mul> <div> <equal> <greater_equal> <less_equal> <not_equal> <greater> <less> <mod> <and> <or> <xor> <not> */
#include <iostream> #include <string> #include <cctype> #include <vector>
using namespace std;
vector<string> token; void tokens(const std::string &data);
int main(void) { string line;
while(getline(cin, line)) { tokens(line); } return 0; }
void tokens(const std::string &data) { string::size_type x = 0; int index = 0;
while(x < data.length()) { char ch = data[x]; while(ch == ' ') x++;
switch(ch) { case '+': x++; token.push_back("<add>"); break; case '-': x++; token.push_back("<sub>"); break; case '*': x++ token.push_back("<mul>"); break; case '/': x++; token.push_back("<div>"); break; case '%': x++; token.push_back("<mod>"); default: x++; break; } } }
Is this what you need to go to the vector? If not please post some code on how to push values into the vector.
|
|
|
Post by n00b on Oct 3, 2016 22:47:20 GMT -6
This is what I need. However, the MOD keyword should be used for the modulous operater. Even though this function is still not complete it seems to be more efficient and cleaner code than the mess I came up with.
|
|
|
Post by sdlbasic on Oct 4, 2016 8:49:30 GMT -6
Well I made some improvements see code:
#include <iostream> #include <string> #include <cctype> #include <vector>
using namespace std;
vector<string> token; void tokens(const std::string &data); int inc(string::size_type &, int); int iswhite(int);
int main(void) { string line;
while(getline(cin, line)) { tokens(line); } return 0; }
void tokens(const std::string &data) { string::size_type x = 0; char temp;
while(x < data.length()) { char ch = data[x]; while(iswhite(ch)) inc(x, 1);
switch(ch) { case '+': inc(x, 1); token.push_back("<add>"); cout<<"<add>"<<endl; break; case '-': inc(x, 1); token.push_back("<sub>"); cout<<"<sub>"<<endl; break; case '*': inc(x, 1); token.push_back("<mul>"); cout<<"<mul>"<<endl; break; case '/': inc(x, 1); token.push_back("<div>"); break; case '%': inc(x, 1); token.push_back("<mod>"); case '^': inc(x, 1); token.push_back("<pow>"); break; case '(': inc(x, 1); token.push_back("<par>"); cout<<"<par>"<<endl; break; case ')': inc(x, 1); token.push_back("</par>"); cout<<"</par>"<<endl; break; case '.': inc(x, 1); token.push_back("<dot>"); break; case '=': inc(x, 1); token.push_back("<equal>"); cout<<"<equal>"<<endl; break; case '>': temp = data[x+1]; if(temp == '=') { inc(x, 2); cout<<"<greater_equal>"<<endl; token.push_back("<greater_equal>"); } else { inc(x, 1); cout<<"<greater>"<<endl; token.push_back("<greater>"); } break; case '<': temp = data[x+1]; if (temp == '=') { inc(x, 2); token.push_back("<less_equal>"); cout<<"<less_equal>"<<endl; } else if (temp == '>') { inc(x, 2); token.push_back("<not_equal>"); cout<<"<not_equal>"<<endl; } else { inc(x, 1); token.push_back("<less>"); cout<<"<less>"<<endl; } break; case '{': inc(x, 1); token.push_back("<curly>"); break; case '}': inc(x, 1); token.push_back("</curly>"); break; case '[': inc(x, 1); token.push_back("<square>"); break; case ']': inc(x, 1); token.push_back("</square>"); break; case '"': temp = data[inc(x, 1)]; while(temp != '\"' && temp != '\r') { temp = data[inc(x, 1)]; } token.push_back("<string>"); cout<<"<string>"<<endl; inc(x, 1); break; default: if (isdigit(ch)) { do { temp = data[inc(x, 1)]; } while(isdigit(temp) || temp == '.'); token.push_back("<num>"); cout<<"<num> "<<endl; } break; } } }
int inc(string::size_type &x, int by) { return x = x + by; }
int iswhite(int c) { return (c == ' ' || c == '\t'); }
I made some improvements since the last code but this is going not to fast cause i work and i pick this in my spare time, i also need to know how do you want me to implement the mod ='%', not='!' and='&&' etc. Are this % && ! ok?
Thanks.
|
|
|
Post by n00b on Oct 5, 2016 2:52:43 GMT -6
The MOD keyword will act as the mod operator. Same thing for AND, NOT, OR, and XOR. Also I understand that you can't devote much time as I do this in my spare time as well. I am not even a professional programmer and all I know about programming I taught myself. Projects like this are something I do for fun and to learn. I appreciate any help you can give.
|
|
|
Post by sdlbasic on Oct 10, 2016 9:28:15 GMT -6
I made some minor changes, it has almost every token, i add some helper function but then i change them for the built in ones, despite that i let them here for future reference and it can be useful.
#include <iostream> #include <string> #include <cctype> #include <vector>
using namespace std;
vector<string> token; void tokens(const std::string &data); int inc(string::size_type &, int); int iswhite(int); bool isLetter(char c); bool isDigit(char ch);
int main(void) { string line;
while(getline(cin, line)) { tokens(line); } return 0; }
void tokens(const std::string &data) { string::size_type x = 0;
while(x < data.length()) { char ch = data[x]; while(isspace(ch)) ch = data[inc(x, 1)];
switch(ch) { case '+': inc(x, 1); token.push_back("<add>"); cout<<"<add>"<<endl; break; case '-': inc(x, 1); token.push_back("<sub>"); cout<<"<sub>"<<endl; break; case '*': inc(x, 1); token.push_back("<mul>"); cout<<"<mul>"<<endl; break; case '/': inc(x, 1); token.push_back("<div>"); break; case '%': inc(x, 1); token.push_back("<mod>"); case '^': inc(x, 1); token.push_back("<pow>"); break; case '(': inc(x, 1); token.push_back("<par>"); cout<<"<par>"<<endl; break; case ')': inc(x, 1); token.push_back("</par>"); cout<<"</par>"<<endl; break; case '.': inc(x, 1); token.push_back("<dot>"); break; case '=': inc(x, 1); token.push_back("<equal>"); cout<<"<equal>"<<endl; break; case '>': ch = data[x+1]; if(ch == '=') { inc(x, 2); cout<<"<greater_equal>"<<endl; token.push_back("<greater_equal>"); } else { inc(x, 1); cout<<"<greater>"<<endl; token.push_back("<greater>"); } break; case '<': ch = data[x+1]; if (ch == '=') { inc(x, 2); token.push_back("<less_equal>"); cout<<"<less_equal>"<<endl; } else if (ch == '>') { inc(x, 2); token.push_back("<not_equal>"); cout<<"<not_equal>"<<endl; } else { inc(x, 1); token.push_back("<less>"); cout<<"<less>"<<endl; } break; case '{': inc(x, 1); token.push_back("<curly>"); break; case '}': inc(x, 1); token.push_back("</curly>"); break; case '[': inc(x, 1); token.push_back("<square>"); break; case ']': inc(x, 1); token.push_back("</square>"); break; case '\"': ch = data[inc(x, 1)]; while(ch != '\"' && ch != '\r' && ch != '\n') { ch = data[inc(x, 1)]; } if ((ch = data[x - 1]) == '\"') token.push_back("<string>"); cout<<"<string>"<<endl; inc(x, 1); break; default: if (isdigit(ch)) { do { ch = data[inc(x, 1)]; } while(isdigit(ch) || ch == '.'); token.push_back("<num>"); cout<<"<num> "<<endl; } else if (isalpha(ch) || ch == '_') { do { ch = data[inc(x, 1)]; } while(isalnum(ch) || ch == '_'); cout<<"<id>"<<endl; } break; } } }
int inc(string::size_type &x, int by) { return x = x + by; }
int iswhite(int c) { return (c == ' ' || c == '\t'); }
bool isLetter(char c) { return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')); }
bool isDigit(char c) { return (c >= '0' && c <= '9'); }
|
|
|
Post by n00b on Oct 10, 2016 21:42:58 GMT -6
I added some comments to your code a made a few modifications. The main changes I made was I took out all the cout statements in the tokens function and instead created a new function called output_tokens that will out put all the tokens read to the console. I would like to keep all debugging in separate function calls from the actual code for the project so we can choose when and where we want to show debug info. I also added a new variable to the tokens function called s_data. s_data just stores the token for a number, identifier, or string so that it can all be put in the vector at once. Originally you were having it so that if you had a number it would create a <num> token but it wasn't storing the number in the token. Now the token will include the <num>, <id>, or <string> tags along with the data it refers to. Also I split this into multiple files. I have the name of the files above each peice of code in this post. I have commited this to github but the changes have not posted at the time of this post.
file: tokenizer.h
#ifndef TOKENIZER_H_INCLUDED #define TOKENIZER_H_INCLUDED
#include <iostream> #include <string> #include <cctype> #include <vector>
using namespace std;
vector<string> token; //stores tokens for the current source line void tokens(const std::string &data); //reads current source line and fills token vector int inc(string::size_type &, int); //returns the current position being read on the current source line int iswhite(int); //returns whether the current character is a whitespace bool isLetter(char c); //returns whether or not the current character is a letter bool isDigit(char ch); //returns whether or not the current character is a digit void output_tokens(); //outputs the last set of tokens generated
void tokens(const std::string &data) { string::size_type x = 0;
while(x < data.length()) { char ch = data[x];
//s_data will hold a number or identifier string s_data = "";
while(isspace(ch)) ch = data[inc(x, 1)];
switch(ch) { case '+': inc(x, 1); token.push_back("<add>"); break; case '-': inc(x, 1); token.push_back("<sub>"); break; case '*': inc(x, 1); token.push_back("<mul>"); break; case '/': inc(x, 1); token.push_back("<div>"); break; case '%': inc(x, 1); token.push_back("<mod>"); case '^': inc(x, 1); token.push_back("<pow>"); break; case '(': inc(x, 1); token.push_back("<par>"); break; case ')': inc(x, 1); token.push_back("</par>"); break; case '.': inc(x, 1); token.push_back("<dot>"); break; case '=': inc(x, 1); token.push_back("<equal>"); break; case '>': ch = data[x+1]; if(ch == '=') { inc(x, 2); token.push_back("<greater_equal>"); } else { inc(x, 1); token.push_back("<greater>"); } break; case '<': ch = data[x+1]; if (ch == '=') { inc(x, 2); token.push_back("<less_equal>"); } else if (ch == '>') { inc(x, 2); token.push_back("<not_equal>"); } else { inc(x, 1); token.push_back("<less>"); } break; case '{': inc(x, 1); token.push_back("<curly>"); break; case '}': inc(x, 1); token.push_back("</curly>"); break; case '[': inc(x, 1); token.push_back("<square>"); break; case ']': inc(x, 1); token.push_back("</square>"); break; case '\"': s_data = "<string>"; ch = data[inc(x, 1)]; while(ch != '\"' && ch != '\r' && ch != '\n') { s_data.push_back(ch); ch = data[inc(x, 1)]; } if ((ch = data[x - 1]) == '\"') token.push_back(s_data); inc(x, 1); break; default: if (isdigit(ch)) { s_data = "<num>"; do { s_data.push_back(ch); ch = data[inc(x, 1)]; } while(isdigit(ch) || ch == '.');
token.push_back(s_data); } else if (isalpha(ch) || ch == '_') { s_data = "<id>"; do { s_data.push_back(ch); ch = data[inc(x, 1)]; } while(isalnum(ch) || ch == '_');
token.push_back(s_data); } break; } } }
int inc(string::size_type &x, int by) { return x = x + by; }
int iswhite(int c) { return (c == ' ' || c == '\t'); }
bool isLetter(char c) { return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')); }
bool isDigit(char c) { return (c >= '0' && c <= '9'); }
void output_tokens() { for(int i = 0; i < token.size(); i++) { try { cout << token.at(i) << endl; } catch(out_of_range& e) { cout << "Token Out of Range Error: " << e.what() << endl; } } }
#endif // TOKENIZER_H_INCLUDED
file: main.cpp
#include <iostream> #include "tokenizer.h"
using namespace std;
int main(void) { string line;
while(getline(cin, line)) { tokens(line); output_tokens(); } return 0; }
|
|
|
Post by sdlbasic on Oct 11, 2016 3:49:58 GMT -6
Thanks for the clean up, anyway this was only a draft not the final code but it's always a good idea to separate the files and start as soon as possible even if it is a draft, cause this most likely in some cases won't be replaced :-), so thanks for the heads up. I was not storing but the code was there i was not shore what you want me to do about that, but i was asking in a near feature... Github doesn't have a commit, just a read file and since we are starting to organize the code in files i take a step forward and create a new file called tokenizer.cpp and put all the code there and left only the declaration on the header files. See :-).
tokenizer.h
#ifndef TOKENIZER_H_INCLUDED #define TOKENIZER_H_INCLUDED
#include <iostream> #include <string> #include <cctype> #include <vector>
std::vector<std::string> token; //stores tokens for the current source line void tokens(const std::string &data); //reads current source line and fills token vector int inc(string::size_type &, int); //returns the current position being read on the current source line int iswhite(int); //returns whether the current character is a whitespace bool isLetter(char c); //returns whether or not the current character is a letter bool isDigit(char ch); //returns whether or not the current character is a digit void output_tokens(); //outputs the last set of tokens generated
#endif // TOKENIZER_H_INCLUDED
tokenizer.cpp
#include "tokenizer.h"
using namespace std;
void tokens(const std::string &data) { string::size_type x = 0;
while(x < data.length()) { char ch = data[x];
//s_data will hold a number or identifier string s_data = "";
while(isspace(ch)) ch = data[inc(x, 1)];
switch(ch) { case '+': inc(x, 1); token.push_back("<add>"); break; case '-': inc(x, 1); token.push_back("<sub>"); break; case '*': inc(x, 1); token.push_back("<mul>"); break; case '/': inc(x, 1); token.push_back("<div>"); break; case '%': inc(x, 1); token.push_back("<mod>"); case '^': inc(x, 1); token.push_back("<pow>"); break; case '(': inc(x, 1); token.push_back("<par>"); break; case ')': inc(x, 1); token.push_back("</par>"); break; case '.': inc(x, 1); token.push_back("<dot>"); break; case '=': inc(x, 1); token.push_back("<equal>"); break; case '>': ch = data[x+1]; if(ch == '=') { inc(x, 2); token.push_back("<greater_equal>"); } else { inc(x, 1); token.push_back("<greater>"); } break; case '<': ch = data[x+1]; if (ch == '=') { inc(x, 2); token.push_back("<less_equal>"); } else if (ch == '>') { inc(x, 2); token.push_back("<not_equal>"); } else { inc(x, 1); token.push_back("<less>"); } break; case '{': inc(x, 1); token.push_back("<curly>"); break; case '}': inc(x, 1); token.push_back("</curly>"); break; case '[': inc(x, 1); token.push_back("<square>"); break; case ']': inc(x, 1); token.push_back("</square>"); break; case '\"': s_data = "<string>"; ch = data[inc(x, 1)]; while(ch != '\"' && ch != '\r' && ch != '\n') { s_data.push_back(ch); ch = data[inc(x, 1)]; } if ((ch = data[x - 1]) == '\"') token.push_back(s_data); inc(x, 1); break; default: if (isdigit(ch)) { s_data = "<num>"; do { s_data.push_back(ch); ch = data[inc(x, 1)]; } while(isdigit(ch) || ch == '.');
token.push_back(s_data); } else if (isalpha(ch) || ch == '_') { s_data = "<id>"; do { s_data.push_back(ch); ch = data[inc(x, 1)]; } while(isalnum(ch) || ch == '_');
token.push_back(s_data); } break; } } }
int inc(string::size_type &x, int by) { return x = x + by; }
int iswhite(int c) { return (c == ' ' || c == '\t'); }
bool isLetter(char c) { return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')); }
bool isDigit(char c) { return (c >= '0' && c <= '9'); }
void output_tokens() { for(int i = 0; i < token.size(); i++) { try { cout << token.at(i) << endl; } catch(out_of_range& e) { cout << "Token Out of Range Error: " << e.what() << endl; } } }
main.cpp
#include "tokenizer.h"
using namespace std;
int main(void) { string line;
while(getline(cin, line)) { tokens(line); output_tokens(); } return 0; }
|
|
|
Post by sdlbasic on Oct 13, 2016 11:08:09 GMT -6
I have more code to add, here it goes.
bool iskeyWord(string s); string keyWords[] = {"AND", "OR", "NOT", "XOR", "MOD"};
bool iskeyWord(string s) { int len = sizeof(keyWords)/sizeof(keyWords[0]); int i; for(i = 0; i < len; i++) { if (s==keyWord[i]) return true; } return false; }
|
|