Version of C# StringBuilder to allow for strings larger than 2 billion characters












7












$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question











$endgroup$








  • 6




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    Mar 7 at 15:22








  • 4




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    Mar 7 at 16:02








  • 2




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    Mar 7 at 16:08








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    Mar 7 at 16:21








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:32
















7












$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question











$endgroup$








  • 6




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    Mar 7 at 15:22








  • 4




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    Mar 7 at 16:02








  • 2




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    Mar 7 at 16:08








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    Mar 7 at 16:21








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:32














7












7








7


2



$begingroup$


In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}









share|improve this question











$endgroup$




In C#, 64bit Windows, .NET 4.5 (or later), and enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in an array is still limited to about 2^31 = 2.15 billion. Testing confirmed this.



To overcome this, Microsoft recommends in Option B creating the arrays natively. Problem is we need to use unsafe code, and as far as I know, unicode won't be supported, at least not easily.



So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char>).



Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone!



I also added some common functions to provide some functionality such as fileSave(), length(), substring(), replace(), etc. Like the StringBuilder, in-place character writing (mutability), and instant truncation are possible.



Speed-wise, some quick tests show that it's not significantly slower than a StringBuilder when appending (found it was 33% slower in one test). I got similar performance if I went for a 2D jagged char array (char) instead of List<char>, but Lists are simpler to work with, so I stuck with that.



I'm looking for advice to potentially speed up performance, particularly for the append function, and to access or write faster via the indexer (public char this[long n] {...} )



// A simplified version specially for StackOverflow / Codereview
public class BigStringBuilder
{
List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;

public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}

// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}

public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}

// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}

public void fileOpen(string path) {
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}

public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}

public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}

public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}

public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}

// Simple implementation of an append() function. Testing shows this to be about
// as fast or faster than the more sophisticated Append2() function further below
// despite its simplicity:
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}

// This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.
public void Append2(string s)
{
if (currentPosInPage + s.Length <= pagesize)
{
// append s entirely to current page
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
else
{
int stringpos;
int topup = (int)pagesize - currentPosInPage;
// Finish off current page with substring of s
for (int i = 0; i < topup; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
currentPage++;
currentPosInPage = 0;
stringpos = topup;
int remainingPagesToFill = (s.Length - topup) >> pagedepth; // We want the floor here
// fill up complete pages if necessary:
if (remainingPagesToFill > 0)
{
for (int i = 0; i < remainingPagesToFill; i++)
{
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int j = 0; j < pagesize; j++)
{
c[currentPage][j] = s[stringpos];
stringpos++;
}
currentPage++;
}
}
// finish off remainder of string s on new page:
if (currentPage == c.Count) c.Add(new char[pagesize]);
for (int i = stringpos; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
}
}
}
}






c# performance strings pagination






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 at 20:06









Simon Forsberg

48.8k7130286




48.8k7130286










asked Mar 7 at 13:05









Dan WDan W

1385




1385








  • 6




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    Mar 7 at 15:22








  • 4




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    Mar 7 at 16:02








  • 2




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    Mar 7 at 16:08








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    Mar 7 at 16:21








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:32














  • 6




    $begingroup$
    So what kind of crazy stuff one needs this large data-types for?
    $endgroup$
    – t3chb0t
    Mar 7 at 15:22








  • 4




    $begingroup$
    ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
    $endgroup$
    – t3chb0t
    Mar 7 at 16:02








  • 2




    $begingroup$
    The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
    $endgroup$
    – t3chb0t
    Mar 7 at 16:08








  • 1




    $begingroup$
    oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
    $endgroup$
    – t3chb0t
    Mar 7 at 16:21








  • 1




    $begingroup$
    @DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:32








6




6




$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22






$begingroup$
So what kind of crazy stuff one needs this large data-types for?
$endgroup$
– t3chb0t
Mar 7 at 15:22






4




4




$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02






$begingroup$
ok... but isn't streaming it easier and faster than loading the entire file into memory? It screams: the XY Problem. Your users are not responsible for you wasting RAM :-P
$endgroup$
– t3chb0t
Mar 7 at 16:02






2




2




$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08






$begingroup$
The question you should be asking is how you can convert this giant CSV more efficiently rather than brute-forcing it into your RAM.
$endgroup$
– t3chb0t
Mar 7 at 16:08






1




1




$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21






$begingroup$
oh boy... this sounds like you're pushing json over csv... this is even more scarry then I thought. This entire concept seems to be pretty odd :-| Why don't you do the filtering on the fly? Read, filter, write...? Anyway, have fun with this monster solution ;-]
$endgroup$
– t3chb0t
Mar 7 at 16:21






1




1




$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32




$begingroup$
@DanW: it still sounds like treating the input as one giant string is not the most efficient approach. If you really can't process it in a streaming fashion, then did you look into specialized data structures such as ropes, gap buffers, piece tables, that sort of stuff?
$endgroup$
– Pieter Witvoet
Mar 7 at 16:32










2 Answers
2






active

oldest

votes


















9












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    Mar 7 at 16:26






  • 2




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:34










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    Mar 7 at 19:23






  • 3




    $begingroup$
    @DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
    $endgroup$
    – Pieter Witvoet
    Mar 8 at 10:21






  • 1




    $begingroup$
    By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
    $endgroup$
    – Dan W
    Mar 8 at 13:33





















5












$begingroup$

Design considerations



In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.



But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder class allows you to read the whole file into memory, so you don't have to read it from disk twice.



Why not use dynamically resizable arrays (List<T>) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,] around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.



For example, a List<List<string>> as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.



If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column] indexer and a GetLength(int dimension) method, chances are that you only need to change the type of a bunch of parameters and variables.





Other notes




  • Putting if and for bodies on the same line makes control flow difficult to follow.


  • fileSave and fileOpen are not very flexible. Letting the caller pass in a Stream or TextReader/TextWriter would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads like CopyToStream(Stream toStream, long offset, long count) and CopyFromStream(Stream fromStream, long count) are probably also a good idea.


  • c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code. Use this[n] instead.

  • There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.

  • I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':



    • this[n] -> this[index]


    • substring(x, y) -> Substring(startIndex, length)


    • replace(s, pos) -> Overwrite(value, startIndex) (I find replace confusing because it's so different from what string.Replace does)



  • Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in ToString and Append?

  • I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.

  • I agree with Peter that fileOpen clearing the string-builder is unexpected behavior, in a negative way.
    Certainly the caller can call clear before fileOpen if that's the behavior they want? Otherwise, I would clearly document this.


  • length being a method is inconsistent with StringBuilder and other collection types. A little bit of calculation in a property should be fine.

  • If you're using returnPagesForTestingPurposes for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?






share|improve this answer









$endgroup$













  • $begingroup$
    I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
    $endgroup$
    – Dan W
    Mar 10 at 7:50










  • $begingroup$
    Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
    $endgroup$
    – Dan W
    Mar 10 at 7:51










  • $begingroup$
    So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
    $endgroup$
    – Pieter Witvoet
    Mar 11 at 8:48










  • $begingroup$
    One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
    $endgroup$
    – Dan W
    Mar 11 at 11:18













Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









9












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    Mar 7 at 16:26






  • 2




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:34










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    Mar 7 at 19:23






  • 3




    $begingroup$
    @DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
    $endgroup$
    – Pieter Witvoet
    Mar 8 at 10:21






  • 1




    $begingroup$
    By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
    $endgroup$
    – Dan W
    Mar 8 at 13:33


















9












$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$













  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    Mar 7 at 16:26






  • 2




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:34










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    Mar 7 at 19:23






  • 3




    $begingroup$
    @DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
    $endgroup$
    – Pieter Witvoet
    Mar 8 at 10:21






  • 1




    $begingroup$
    By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
    $endgroup$
    – Dan W
    Mar 8 at 13:33
















9












9








9





$begingroup$


    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.






share|improve this answer











$endgroup$




    List<char> c = new List<char>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;



Some of these names are rather cryptic. I'm not sure why c isn't private. And surely some of the fields should be readonly?






        pagesize = (long)Math.Pow(2, pagedepth);



IMO it's better style to use 1L << pagedepth.






    public char this[long n]    {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}



Shouldn't this have bounds checks?






    public string returnPagesForTestingPurposes() {
string s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}



There's no need for this to be public: you can make it internal and give your unit test project access with [assembly:InternalsVisibleTo]. Also, since it's for testing purposes, it could probably be marked [System.Diagnostics.Conditional("DEBUG")].






    public void clear() {
c = new List<char>();
c.Add(new char[pagesize]);



In C# it's conventional for method names to start with an upper case letter.



There's no need to throw quite as much to the garbage collector. Consider as an alternative:



var page0 = c[0];
c.Clear();
c.Add(page0);





    // See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372



Why? I don't think it sheds any light on the following method.




    public void fileSave(string path)   {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}



Missing some disposal: I'd use a using statement.



new string(char) copies the entire array to ensure that the string is immutable. That's completely unnecessary here: StreamWriter has a method Write(char, int, int).






    public void fileOpen(string path)   {
clear();



Yikes! That should be mentioned in the method documentation.




        StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0){
if (!sw.EndOfStream) {
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;



I think this can give rise to inconsistencies. Other methods seem to assume that if the length of the BigStringBuilder is an exact multiple of pagesize then currentPosInPage == 0 and c[currentPage] is empty, but this can give you currentPosInPage == pagesize and c[currentPage] is full.



This method is also missing disposal.






    public long length()    {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}



Why is this a method rather than a property? Why use multiplication rather than <<?






    public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s



What is 8s? Why append one character at a time? StringBuilder also has a method which takes (char, int, int).






    public bool match(string find, long start = 0)  {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}



What does this method do? The name implies something regexy, but there's no regex in sight. The implementation looks like StartsWith (by default - the offset complicates it).






    public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}



Bounds checks?






    // This method is a more sophisticated version of the Append() function above.
// Surprisingly, in real-world testing, it doesn't seem to be any faster.



I'm not surprised. It's still copying character by character. It's almost certainly faster to use string.CopyTo (thanks to Pieter Witvoet for mentioning this method) or ReadOnlySpan.CopyTo.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 8 at 9:14

























answered Mar 7 at 15:12









Peter TaylorPeter Taylor

18k2962




18k2962












  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    Mar 7 at 16:26






  • 2




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:34










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    Mar 7 at 19:23






  • 3




    $begingroup$
    @DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
    $endgroup$
    – Pieter Witvoet
    Mar 8 at 10:21






  • 1




    $begingroup$
    By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
    $endgroup$
    – Dan W
    Mar 8 at 13:33




















  • $begingroup$
    Thanks! Added responses to my main post if you want to look.
    $endgroup$
    – Dan W
    Mar 7 at 16:26






  • 2




    $begingroup$
    Regarding the last point, there's also string.CopyTo.
    $endgroup$
    – Pieter Witvoet
    Mar 7 at 16:34










  • $begingroup$
    c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
    $endgroup$
    – BurnsBA
    Mar 7 at 19:23






  • 3




    $begingroup$
    @DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
    $endgroup$
    – Pieter Witvoet
    Mar 8 at 10:21






  • 1




    $begingroup$
    By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
    $endgroup$
    – Dan W
    Mar 8 at 13:33


















$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26




$begingroup$
Thanks! Added responses to my main post if you want to look.
$endgroup$
– Dan W
Mar 7 at 16:26




2




2




$begingroup$
Regarding the last point, there's also string.CopyTo.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34




$begingroup$
Regarding the last point, there's also string.CopyTo.
$endgroup$
– Pieter Witvoet
Mar 7 at 16:34












$begingroup$
c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23




$begingroup$
c# class instance members are private by default, so c is private. But you're right that it is inconsistent to not explicitly declare it private like the other fields are. docs.microsoft.com/en-us/dotnet/csharp/language-reference/…
$endgroup$
– BurnsBA
Mar 7 at 19:23




3




3




$begingroup$
@DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21




$begingroup$
@DanW: using translates to a try/finally statement that ensures disposal even when an exception is thrown, so it's almost always the better approach.
$endgroup$
– Pieter Witvoet
Mar 8 at 10:21




1




1




$begingroup$
By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33






$begingroup$
By the way, for your info, appending using string.CopyTo is around 3.2x as fast as my original code! Code is: s.CopyTo(0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;. Just for fun, I also tried: Array.Copy(s.ToCharArray(), 0, c[currentPage], currentPosInPage, s.Length); currentPosInPage += s.Length;, and that was not quite as good (2.1x faster), but still better than my original code.
$endgroup$
– Dan W
Mar 8 at 13:33















5












$begingroup$

Design considerations



In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.



But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder class allows you to read the whole file into memory, so you don't have to read it from disk twice.



Why not use dynamically resizable arrays (List<T>) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,] around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.



For example, a List<List<string>> as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.



If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column] indexer and a GetLength(int dimension) method, chances are that you only need to change the type of a bunch of parameters and variables.





Other notes




  • Putting if and for bodies on the same line makes control flow difficult to follow.


  • fileSave and fileOpen are not very flexible. Letting the caller pass in a Stream or TextReader/TextWriter would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads like CopyToStream(Stream toStream, long offset, long count) and CopyFromStream(Stream fromStream, long count) are probably also a good idea.


  • c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code. Use this[n] instead.

  • There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.

  • I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':



    • this[n] -> this[index]


    • substring(x, y) -> Substring(startIndex, length)


    • replace(s, pos) -> Overwrite(value, startIndex) (I find replace confusing because it's so different from what string.Replace does)



  • Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in ToString and Append?

  • I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.

  • I agree with Peter that fileOpen clearing the string-builder is unexpected behavior, in a negative way.
    Certainly the caller can call clear before fileOpen if that's the behavior they want? Otherwise, I would clearly document this.


  • length being a method is inconsistent with StringBuilder and other collection types. A little bit of calculation in a property should be fine.

  • If you're using returnPagesForTestingPurposes for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?






share|improve this answer









$endgroup$













  • $begingroup$
    I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
    $endgroup$
    – Dan W
    Mar 10 at 7:50










  • $begingroup$
    Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
    $endgroup$
    – Dan W
    Mar 10 at 7:51










  • $begingroup$
    So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
    $endgroup$
    – Pieter Witvoet
    Mar 11 at 8:48










  • $begingroup$
    One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
    $endgroup$
    – Dan W
    Mar 11 at 11:18


















5












$begingroup$

Design considerations



In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.



But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder class allows you to read the whole file into memory, so you don't have to read it from disk twice.



Why not use dynamically resizable arrays (List<T>) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,] around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.



For example, a List<List<string>> as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.



If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column] indexer and a GetLength(int dimension) method, chances are that you only need to change the type of a bunch of parameters and variables.





Other notes




  • Putting if and for bodies on the same line makes control flow difficult to follow.


  • fileSave and fileOpen are not very flexible. Letting the caller pass in a Stream or TextReader/TextWriter would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads like CopyToStream(Stream toStream, long offset, long count) and CopyFromStream(Stream fromStream, long count) are probably also a good idea.


  • c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code. Use this[n] instead.

  • There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.

  • I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':



    • this[n] -> this[index]


    • substring(x, y) -> Substring(startIndex, length)


    • replace(s, pos) -> Overwrite(value, startIndex) (I find replace confusing because it's so different from what string.Replace does)



  • Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in ToString and Append?

  • I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.

  • I agree with Peter that fileOpen clearing the string-builder is unexpected behavior, in a negative way.
    Certainly the caller can call clear before fileOpen if that's the behavior they want? Otherwise, I would clearly document this.


  • length being a method is inconsistent with StringBuilder and other collection types. A little bit of calculation in a property should be fine.

  • If you're using returnPagesForTestingPurposes for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?






share|improve this answer









$endgroup$













  • $begingroup$
    I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
    $endgroup$
    – Dan W
    Mar 10 at 7:50










  • $begingroup$
    Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
    $endgroup$
    – Dan W
    Mar 10 at 7:51










  • $begingroup$
    So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
    $endgroup$
    – Pieter Witvoet
    Mar 11 at 8:48










  • $begingroup$
    One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
    $endgroup$
    – Dan W
    Mar 11 at 11:18
















5












5








5





$begingroup$

Design considerations



In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.



But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder class allows you to read the whole file into memory, so you don't have to read it from disk twice.



Why not use dynamically resizable arrays (List<T>) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,] around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.



For example, a List<List<string>> as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.



If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column] indexer and a GetLength(int dimension) method, chances are that you only need to change the type of a bunch of parameters and variables.





Other notes




  • Putting if and for bodies on the same line makes control flow difficult to follow.


  • fileSave and fileOpen are not very flexible. Letting the caller pass in a Stream or TextReader/TextWriter would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads like CopyToStream(Stream toStream, long offset, long count) and CopyFromStream(Stream fromStream, long count) are probably also a good idea.


  • c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code. Use this[n] instead.

  • There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.

  • I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':



    • this[n] -> this[index]


    • substring(x, y) -> Substring(startIndex, length)


    • replace(s, pos) -> Overwrite(value, startIndex) (I find replace confusing because it's so different from what string.Replace does)



  • Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in ToString and Append?

  • I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.

  • I agree with Peter that fileOpen clearing the string-builder is unexpected behavior, in a negative way.
    Certainly the caller can call clear before fileOpen if that's the behavior they want? Otherwise, I would clearly document this.


  • length being a method is inconsistent with StringBuilder and other collection types. A little bit of calculation in a property should be fine.

  • If you're using returnPagesForTestingPurposes for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?






share|improve this answer









$endgroup$



Design considerations



In the comments you mentioned that you're going to analyze and manipulate this string? If so, that could be a problem, because replacing, inserting and removing text requires moving all subsequent characters to make room or to fill the gap. With this much data, you can expect that to have a major impact on performance. That is why I mentioned data structures like ropes, gap buffers and piece tables: those are often used in text editors to efficiently support these kind of operations.



But I think that would be solving the wrong problem. From what I understand so far, the rest of your code is built around a 2D array of strings - a table, essentially - that can be converted to a variety of formats. Because it's an array, you need to know the number of columns and rows up-front, so you need to parse the input before you can allocate the array, and then parse it again to fill the array. This BigStringBuilder class allows you to read the whole file into memory, so you don't have to read it from disk twice.



Why not use dynamically resizable arrays (List<T>) instead? That would allow you to read the data directly into your final data structure with only a single parsing pass. If that requires a lot of work, then you're probably suffering from a lack of abstraction. Instead of passing a raw string[,] around, wrap it in a class. That allows you to swap the internal 2D array for a more suitable data structure without having to modify all the code that uses this table.



For example, a List<List<string>> as internal data structure lets you add variable-length rows on-the-fly, so you only need a single parsing pass. That also allows you to read the input file in a streaming fashion instead of having to read it fully into memory first.



If modifying existing code that relies in a 2D string array sounds like a lot of work, consider emulating the 'interface' of a 2D array. With a class that provides a string this[int row, int column] indexer and a GetLength(int dimension) method, chances are that you only need to change the type of a bunch of parameters and variables.





Other notes




  • Putting if and for bodies on the same line makes control flow difficult to follow.


  • fileSave and fileOpen are not very flexible. Letting the caller pass in a Stream or TextReader/TextWriter would make them more reusable. That also gives callers control over things like encoding and buffer size. Additionally, overloads like CopyToStream(Stream toStream, long offset, long count) and CopyFromStream(Stream fromStream, long count) are probably also a good idea.


  • c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code. Use this[n] instead.

  • There's very little documentation, especially for infrastructure-level code like this. I don't know about you, but I tend to forget things about code I wrote a while ago, so documentation that explains why something works the way it does, or how something is intended to be used, is quite useful.

  • I'd recommend using more self-descriptive names, and aiming for consistency with the standard types you're 'emulating':



    • this[n] -> this[index]


    • substring(x, y) -> Substring(startIndex, length)


    • replace(s, pos) -> Overwrite(value, startIndex) (I find replace confusing because it's so different from what string.Replace does)



  • Regarding camelCase vs PascalCase, I don't see why it's important to distinguish between standard library code and your own, but to each their own I guess. But why the inconsistency in ToString and Append?

  • I'd argue that correctness is more important than performance. Leaving out bounds checks should be a last resort, and only when profiling shows that they're actually a bottleneck. Chances are that it's possible to improve the API so you don't need to access individual characters so often.

  • I agree with Peter that fileOpen clearing the string-builder is unexpected behavior, in a negative way.
    Certainly the caller can call clear before fileOpen if that's the behavior they want? Otherwise, I would clearly document this.


  • length being a method is inconsistent with StringBuilder and other collection types. A little bit of calculation in a property should be fine.

  • If you're using returnPagesForTestingPurposes for automated testing then you're testing implementation details, which is a bad idea. If you're using it for debugging, then why not inspect the data with a debugger instead, or use tracepoints?







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 8 at 19:29









Pieter WitvoetPieter Witvoet

6,429826




6,429826












  • $begingroup$
    I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
    $endgroup$
    – Dan W
    Mar 10 at 7:50










  • $begingroup$
    Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
    $endgroup$
    – Dan W
    Mar 10 at 7:51










  • $begingroup$
    So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
    $endgroup$
    – Pieter Witvoet
    Mar 11 at 8:48










  • $begingroup$
    One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
    $endgroup$
    – Dan W
    Mar 11 at 11:18




















  • $begingroup$
    I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
    $endgroup$
    – Dan W
    Mar 10 at 7:50










  • $begingroup$
    Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
    $endgroup$
    – Dan W
    Mar 10 at 7:51










  • $begingroup$
    So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
    $endgroup$
    – Pieter Witvoet
    Mar 11 at 8:48










  • $begingroup$
    One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
    $endgroup$
    – Dan W
    Mar 11 at 11:18


















$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50




$begingroup$
I manipulate the string so that in-place replacements are either equal in length, or shorter, so that reading can continue whilst the same BigStringBuilder is being operated on. Your idea of using List<T> is interesting regardless. Honestly, a part of me would love to experiment and refactor the program like you say to compare speed, but time is limited. FYI, string[,] is poor only allowing for 2^28 elements in total. string allows essentially around 2^56 elements, far better.
$endgroup$
– Dan W
Mar 10 at 7:50












$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51




$begingroup$
Quote: "c[(int)(n >> pagedepth)][n & mpagesize] is duplicated several times in the code": Sadly, I found function calling slower than writing the code direct, possibly due to the compiler not inlining it. Quote: "But why the inconsistency in ToString and Append?": I have less excuse with Append(), but ToString was due to overriding the default ToString method. Quote: "Leaving out bounds checks should be a last resort": Either way, it'll crash and I can easily solve the bug. Besides, the program has already been heavily tested, so safety is less of a concern now.
$endgroup$
– Dan W
Mar 10 at 7:51












$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48




$begingroup$
So you're using a jagged array then? In that case you could parse the input directly into a List<string> and convert it to a string afterwards, so you don't need to modify any of the other code at all. About inlining, you could try marking the indexer's getter and setter with MethodImplOptions.AggressiveInlining. Note that replace can be optimized by using string.CopyTo. About bounds checks, if 'it'll crash' means that you're expecting it to throw an exception anyway, then that's not the case when accessing unused characters in the last page.
$endgroup$
– Pieter Witvoet
Mar 11 at 8:48












$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18






$begingroup$
One other factor to consider is when the user changes settings after the results are displayed. Reading it all out from HDD into a 2D array, would next a second HDD read after the user adjusts a single setting afterwards as it would need to be reparsed. Regarding AggressiveInlining, it seems to do the trick, but requires .NET 4.5, and I fear many of my users are still using .NET 3.5/4. Pretty astonishing that the C# team haven't fixed the issue until .NET4.5+.
$endgroup$
– Dan W
Mar 11 at 11:18




















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214917%2fversion-of-c-stringbuilder-to-allow-for-strings-larger-than-2-billion-character%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?