Version of C# StringBuilder to allow for strings larger than 2 billion characters
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
$endgroup$
|
show 10 more comments
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
$endgroup$
6
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
4
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
2
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32
|
show 10 more comments
$begingroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
$endgroup$
In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects
in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.
So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>
).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!
I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.
Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char
) instead of List<char>
, but Lists are simpler to work with, so I stuck with that.
I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...}
)
// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}
c# performance strings pagination
c# performance strings pagination
edited Mar 7 at 20:06
Simon Forsberg♦
48.8k7130286
48.8k7130286
asked Mar 7 at 13:05
Dan WDan W
1385
1385
6
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
4
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
2
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32
|
show 10 more comments
6
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
4
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
2
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32
6
6
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
4
4
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
2
2
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
1
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
1
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32
|
show 10 more comments
2 Answers
2
active
oldest
votes
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23
3
$begingroup$
@DanW:using
translates to atry/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
$begingroup$
By the way, for your info, appending usingstring.CopyTo
is around 3.2x as fast as my original code! Code is:s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried:Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33
|
show 3 more comments
$begingroup$
Design considerations
In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.
But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder
class allows you to read the whole file into memory, so you don't have to read it from disk twice.
Why not use dynamically resizable arrays (List<T>
) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,]
around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.
For example, a List<List<string>>
as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.
If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column]
indexer and a GetLength(int dimension)
method, chances are that you only need to change the type of a bunch of parameters and variables.
Other notes
- Putting
if
andfor
bodies on the same line makes control flow difficult to follow.
fileSave
andfileOpen
are not very flexible. Letting the caller pass in aStream
orTextReader
/TextWriter
would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads likeCopyToStream(Stream toStream, long offset, long count)
andCopyFromStream(Stream fromStream, long count)
are probably also a good idea.
c[(int)(n >> pagedepth)][n & mpagesize]
is duplicated several times in the code. Usethis[n]
instead.- There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.
- I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':
this[n]
->this[index]
substring(x, y)
->Substring(startIndex, length)
replace(s, pos)
->Overwrite(value, startIndex)
(I findreplace
confusing because it's so different from whatstring.Replace
does)
- Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in
ToString
andAppend
? - I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.
- I agree with Peter that
fileOpen
clearing the string-builder is unexpected behavior, in a negative way.
Certainly the caller can callclear
beforefileOpen
if that's the behavior they want? Otherwise, I would clearly document this.
length
being a method is inconsistent withStringBuilder
and other collection types. A little bit of calculation in a property should be fine.- If you're using
returnPagesForTestingPurposes
for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?
$endgroup$
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,string[,]
is poor only allowing for 2^28 elements in total.string
allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into aList<string>
and convert it to astring
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter withMethodImplOptions.AggressiveInlining
. Note thatreplace
can be optimized by usingstring.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. RegardingAggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23
3
$begingroup$
@DanW:using
translates to atry/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
$begingroup$
By the way, for your info, appending usingstring.CopyTo
is around 3.2x as fast as my original code! Code is:s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried:Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33
|
show 3 more comments
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23
3
$begingroup$
@DanW:using
translates to atry/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
$begingroup$
By the way, for your info, appending usingstring.CopyTo
is around 3.2x as fast as my original code! Code is:s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried:Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33
|
show 3 more comments
$begingroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
$endgroup$
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
Some of these names are rather cryptic. I'm not sure why c
isn't private. And surely some of the fields should be readonly
?
pagesize = (long)Math.Pow(2, pagedepth);
IMO it's better style to use 1L << pagedepth
.
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
Shouldn't this have bounds checks?
public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
There's no need for this to be public: you can make it internal
and give your unit test project access with [assembly:InternalsVisibleTo]
. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")]
.
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
In C# it's conventional for method names to start with an upper case letter.
There's no need to throw quite as much to the garbage collector. Consider as an alternative:
var page0 = c[0];
c.Clear();
c.Add(page0);
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
Why? I don't think it sheds any light on the following method.
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
Missing some disposal: I'd use a using
statement.
new string(char)
copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter
has a method Write(char, int, int)
.
public void fileOpen(string path) {
clear();
Yikes! That should be mentioned in the method documentation.
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder
is an exact multiple of pagesize
then currentPosInPage == 0
and c[currentPage]
is empty, but this can give you currentPosInPage == pagesize
and c[currentPage]
is full.
This method is also missing disposal.
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
Why is this a method rather than a property? Why use multiplication rather than <<
?
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
What is 8s
? Why append one character at a time? StringBuilder
also has a method which takes (char, int, int)
.
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith
(by default - the offset complicates it).
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
Bounds checks?
// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo
(thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo
.
edited Mar 8 at 9:14
answered Mar 7 at 15:12
Peter TaylorPeter Taylor
18k2962
18k2962
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23
3
$begingroup$
@DanW:using
translates to atry/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
$begingroup$
By the way, for your info, appending usingstring.CopyTo
is around 3.2x as fast as my original code! Code is:s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried:Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33
|
show 3 more comments
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
$begingroup$
Regarding the last point, there's alsostring.CopyTo
.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, soc
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23
3
$begingroup$
@DanW:using
translates to atry/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
$begingroup$
By the way, for your info, appending usingstring.CopyTo
is around 3.2x as fast as my original code! Code is:s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried:Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26
2
2
$begingroup$
Regarding the last point, there's also
string.CopyTo
.$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
Regarding the last point, there's also
string.CopyTo
.$endgroup$
– Pieter Witvoet
Mar 7 at 16:34
$begingroup$
c# class instance members are private by default, so
c
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…$endgroup$
– BurnsBA
Mar 7 at 19:23
$begingroup$
c# class instance members are private by default, so
c
is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…$endgroup$
– BurnsBA
Mar 7 at 19:23
3
3
$begingroup$
@DanW:
using
translates to a try/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
$begingroup$
@DanW:
using
translates to a try/finally
statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.$endgroup$
– Pieter Witvoet
Mar 8 at 10:21
1
1
$begingroup$
By the way, for your info, appending using
string.CopyTo
is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.$endgroup$
– Dan W
Mar 8 at 13:33
$begingroup$
By the way, for your info, appending using
string.CopyTo
is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;
, and that was not quite as good (2.1x faster), but still better than my original code.$endgroup$
– Dan W
Mar 8 at 13:33
|
show 3 more comments
$begingroup$
Design considerations
In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.
But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder
class allows you to read the whole file into memory, so you don't have to read it from disk twice.
Why not use dynamically resizable arrays (List<T>
) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,]
around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.
For example, a List<List<string>>
as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.
If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column]
indexer and a GetLength(int dimension)
method, chances are that you only need to change the type of a bunch of parameters and variables.
Other notes
- Putting
if
andfor
bodies on the same line makes control flow difficult to follow.
fileSave
andfileOpen
are not very flexible. Letting the caller pass in aStream
orTextReader
/TextWriter
would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads likeCopyToStream(Stream toStream, long offset, long count)
andCopyFromStream(Stream fromStream, long count)
are probably also a good idea.
c[(int)(n >> pagedepth)][n & mpagesize]
is duplicated several times in the code. Usethis[n]
instead.- There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.
- I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':
this[n]
->this[index]
substring(x, y)
->Substring(startIndex, length)
replace(s, pos)
->Overwrite(value, startIndex)
(I findreplace
confusing because it's so different from whatstring.Replace
does)
- Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in
ToString
andAppend
? - I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.
- I agree with Peter that
fileOpen
clearing the string-builder is unexpected behavior, in a negative way.
Certainly the caller can callclear
beforefileOpen
if that's the behavior they want? Otherwise, I would clearly document this.
length
being a method is inconsistent withStringBuilder
and other collection types. A little bit of calculation in a property should be fine.- If you're using
returnPagesForTestingPurposes
for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?
$endgroup$
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,string[,]
is poor only allowing for 2^28 elements in total.string
allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into aList<string>
and convert it to astring
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter withMethodImplOptions.AggressiveInlining
. Note thatreplace
can be optimized by usingstring.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. RegardingAggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18
add a comment |
$begingroup$
Design considerations
In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.
But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder
class allows you to read the whole file into memory, so you don't have to read it from disk twice.
Why not use dynamically resizable arrays (List<T>
) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,]
around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.
For example, a List<List<string>>
as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.
If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column]
indexer and a GetLength(int dimension)
method, chances are that you only need to change the type of a bunch of parameters and variables.
Other notes
- Putting
if
andfor
bodies on the same line makes control flow difficult to follow.
fileSave
andfileOpen
are not very flexible. Letting the caller pass in aStream
orTextReader
/TextWriter
would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads likeCopyToStream(Stream toStream, long offset, long count)
andCopyFromStream(Stream fromStream, long count)
are probably also a good idea.
c[(int)(n >> pagedepth)][n & mpagesize]
is duplicated several times in the code. Usethis[n]
instead.- There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.
- I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':
this[n]
->this[index]
substring(x, y)
->Substring(startIndex, length)
replace(s, pos)
->Overwrite(value, startIndex)
(I findreplace
confusing because it's so different from whatstring.Replace
does)
- Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in
ToString
andAppend
? - I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.
- I agree with Peter that
fileOpen
clearing the string-builder is unexpected behavior, in a negative way.
Certainly the caller can callclear
beforefileOpen
if that's the behavior they want? Otherwise, I would clearly document this.
length
being a method is inconsistent withStringBuilder
and other collection types. A little bit of calculation in a property should be fine.- If you're using
returnPagesForTestingPurposes
for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?
$endgroup$
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,string[,]
is poor only allowing for 2^28 elements in total.string
allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into aList<string>
and convert it to astring
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter withMethodImplOptions.AggressiveInlining
. Note thatreplace
can be optimized by usingstring.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. RegardingAggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18
add a comment |
$begingroup$
Design considerations
In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.
But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder
class allows you to read the whole file into memory, so you don't have to read it from disk twice.
Why not use dynamically resizable arrays (List<T>
) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,]
around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.
For example, a List<List<string>>
as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.
If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column]
indexer and a GetLength(int dimension)
method, chances are that you only need to change the type of a bunch of parameters and variables.
Other notes
- Putting
if
andfor
bodies on the same line makes control flow difficult to follow.
fileSave
andfileOpen
are not very flexible. Letting the caller pass in aStream
orTextReader
/TextWriter
would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads likeCopyToStream(Stream toStream, long offset, long count)
andCopyFromStream(Stream fromStream, long count)
are probably also a good idea.
c[(int)(n >> pagedepth)][n & mpagesize]
is duplicated several times in the code. Usethis[n]
instead.- There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.
- I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':
this[n]
->this[index]
substring(x, y)
->Substring(startIndex, length)
replace(s, pos)
->Overwrite(value, startIndex)
(I findreplace
confusing because it's so different from whatstring.Replace
does)
- Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in
ToString
andAppend
? - I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.
- I agree with Peter that
fileOpen
clearing the string-builder is unexpected behavior, in a negative way.
Certainly the caller can callclear
beforefileOpen
if that's the behavior they want? Otherwise, I would clearly document this.
length
being a method is inconsistent withStringBuilder
and other collection types. A little bit of calculation in a property should be fine.- If you're using
returnPagesForTestingPurposes
for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?
$endgroup$
Design considerations
In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.
But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder
class allows you to read the whole file into memory, so you don't have to read it from disk twice.
Why not use dynamically resizable arrays (List<T>
) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,]
around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.
For example, a List<List<string>>
as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.
If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column]
indexer and a GetLength(int dimension)
method, chances are that you only need to change the type of a bunch of parameters and variables.
Other notes
- Putting
if
andfor
bodies on the same line makes control flow difficult to follow.
fileSave
andfileOpen
are not very flexible. Letting the caller pass in aStream
orTextReader
/TextWriter
would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads likeCopyToStream(Stream toStream, long offset, long count)
andCopyFromStream(Stream fromStream, long count)
are probably also a good idea.
c[(int)(n >> pagedepth)][n & mpagesize]
is duplicated several times in the code. Usethis[n]
instead.- There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.
- I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':
this[n]
->this[index]
substring(x, y)
->Substring(startIndex, length)
replace(s, pos)
->Overwrite(value, startIndex)
(I findreplace
confusing because it's so different from whatstring.Replace
does)
- Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in
ToString
andAppend
? - I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.
- I agree with Peter that
fileOpen
clearing the string-builder is unexpected behavior, in a negative way.
Certainly the caller can callclear
beforefileOpen
if that's the behavior they want? Otherwise, I would clearly document this.
length
being a method is inconsistent withStringBuilder
and other collection types. A little bit of calculation in a property should be fine.- If you're using
returnPagesForTestingPurposes
for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?
answered Mar 8 at 19:29
Pieter WitvoetPieter Witvoet
6,429826
6,429826
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,string[,]
is poor only allowing for 2^28 elements in total.string
allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into aList<string>
and convert it to astring
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter withMethodImplOptions.AggressiveInlining
. Note thatreplace
can be optimized by usingstring.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. RegardingAggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18
add a comment |
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,string[,]
is poor only allowing for 2^28 elements in total.string
allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into aList<string>
and convert it to astring
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter withMethodImplOptions.AggressiveInlining
. Note thatreplace
can be optimized by usingstring.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. RegardingAggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,
string[,]
is poor only allowing for 2^28 elements in total. string
allows essentially around 2^56 elements, far better.$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI,
string[,]
is poor only allowing for 2^28 elements in total. string
allows essentially around 2^56 elements, far better.$endgroup$
– Dan W
Mar 10 at 7:50
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into a
List<string>
and convert it to a string
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining
. Note that replace
can be optimized by using string.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into a
List<string>
and convert it to a string
afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining
. Note that replace
can be optimized by using string.CopyTo
. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.$endgroup$
– Pieter Witvoet
Mar 11 at 8:48
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding
AggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.$endgroup$
– Dan W
Mar 11 at 11:18
$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding
AggressiveInlining
, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.$endgroup$
– Dan W
Mar 11 at 11:18
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
6
$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22
4
$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02
2
$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08
1
$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21
1
$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32