Cfrgtkky

Question

I have take a cumulative sum for data that goes up from 198x to 2016 and is now in the form:

State   Year    Month   Value

TN      1987    1       24410.0

TN      1987    2       24410.0

TN      1987    3       24410.0

TN      1987    4       24410.0

.

.

TN      1996    1       24410.0

TN      1996    2       24410.0

TN      1996    3       24410.0

TN      1996    4       24410.0

TN      1996    5       37109.0

TN      1996    6       37109.0

TN      1996    7       37109.0

TN      1996    8       37109.0

TN      1996    9       37109.0

TN      1996    10      37109.0

TN      1996    11      37109.0

TN      1996    12      37109.0

TN      2016    1       49808.0

TN      2016    2       49808.0

The data actually skips from 1996 to 2016 (for the case of TN but varies on State to State). I need to find a method to generally fill all the missing holes in the data because some years just don't exist (2010-2015) and I want to fill them so that the output goes all the way to 2018.

I want the missing values to be filled with values preceding from the values before to get an output that looks like:

TN      1996    4       24410.0

TN      1996    5       37109.0

TN      1996    6       37109.0

.

.

TN      2010    1       37109.0

TN      2010    2       37109.0

TN      2010    3       37109.0

.

.

TN      2016    1       37109.0

TN      2016    2       37109.0

.

.

TN      2016    11      49808.0

TN      2016    12      49808.0

.

.

TN      2017    1       49808.0

TN      2017    2       49808.0

TN      2017    3       49808.0

TN      2017    4       49808.0

.

.

TN      2018    1       49808.0

TN      2018    2       49808.0

Have you tried any methods yet? Look at fillna() (pandas.pydata.org/pandas-docs/stable/generated/…) — Nov 13 at 17:36

kon_u 1616 · Answer 1 · 2018-11-13 18:00:02Z

How about pandas.interpolate?: interpolate values according to different methods

See section 'interpolate' here: https://pandas.pydata.org/pandas-docs/stable/missing_data.html

And some existing example previously posted: Pandas interpolate() backwards in dataframe

score 0 · Answer 2 · 2018-11-13 18:04:31Z

You can create a dataframe with the missing months and them merge your result with it:

dates = pd.date_range(start='1/1/%d' %df['Year'].min(),

                      end='1/08/%d' %df['Year'].max(),

                      freq='MS', closed='left')



>> dates



DatetimeIndex(['1987-02-01', '1987-03-01', '1987-04-01', '1987-05-01',

               '1987-06-01', '1987-07-01', '1987-08-01', '1987-09-01',

               '1987-10-01', '1987-11-01',

               ...

               '2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',

               '2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',

               '2015-12-01', '2016-01-01'],

              dtype='datetime64[ns]', length=348, freq='MS')

Then you can create a dataframe with all the months:

all_months = pd.DataFrame.from_records((dates.year, dates.month),

      index=['Year', 'Month']).T.sort_values(by=['Year', 'Month'])

And then merge it with the original dataframe and forward-fill it:

df.merge(all_months, how='right').ffill()



    State    Year  Month    Value

0      TN  1987.0    1.0  24410.0

1      TN  1987.0    2.0  24410.0

2      TN  1987.0    3.0  24410.0

3      TN  1987.0    4.0  24410.0

4      TN  1996.0    1.0  24410.0

5      TN  1996.0    2.0  24410.0

6      TN  1996.0    3.0  24410.0

7      TN  1996.0    4.0  24410.0

8      TN  1996.0    5.0  37109.0

9      TN  1996.0    6.0  37109.0

10     TN  1996.0    7.0  37109.0

11     TN  1996.0    8.0  37109.0

12     TN  1996.0    9.0  37109.0

13     TN  1996.0   10.0  37109.0

14     TN  1996.0   11.0  37109.0

15     TN  1996.0   12.0  37109.0

16     TN  2016.0    1.0  49808.0

17     TN  1987.0    5.0  49808.0

18     TN  1987.0    6.0  49808.0

19     TN  1987.0    7.0  49808.0

20     TN  1987.0    8.0  49808.0

21     TN  1987.0    9.0  49808.0

22     TN  1987.0   10.0  49808.0

23     TN  1987.0   11.0  49808.0

24     TN  1987.0   12.0  49808.0

25     TN  1988.0    1.0  49808.0

26     TN  1988.0    2.0  49808.0

27     TN  1988.0    3.0  49808.0

28     TN  1988.0    4.0  49808.0

29     TN  1988.0    5.0  49808.0

..    ...     ...    ...      ...

319    TN  2013.0    7.0  49808.0

320    TN  2013.0    8.0  49808.0

321    TN  2013.0    9.0  49808.0

322    TN  2013.0   10.0  49808.0

323    TN  2013.0   11.0  49808.0

324    TN  2013.0   12.0  49808.0

325    TN  2014.0    1.0  49808.0

326    TN  2014.0    2.0  49808.0

327    TN  2014.0    3.0  49808.0

328    TN  2014.0    4.0  49808.0

329    TN  2014.0    5.0  49808.0

330    TN  2014.0    6.0  49808.0

331    TN  2014.0    7.0  49808.0

332    TN  2014.0    8.0  49808.0

333    TN  2014.0    9.0  49808.0

334    TN  2014.0   10.0  49808.0

335    TN  2014.0   11.0  49808.0

336    TN  2014.0   12.0  49808.0

337    TN  2015.0    1.0  49808.0

338    TN  2015.0    2.0  49808.0

339    TN  2015.0    3.0  49808.0

340    TN  2015.0    4.0  49808.0

341    TN  2015.0    5.0  49808.0

342    TN  2015.0    6.0  49808.0

343    TN  2015.0    7.0  49808.0

344    TN  2015.0    8.0  49808.0

345    TN  2015.0    9.0  49808.0

346    TN  2015.0   10.0  49808.0

347    TN  2015.0   11.0  49808.0

348    TN  2015.0   12.0  49808.0

Using pandas.resample

Another solution is index by date and then resample there:

df['Day'] = 1



df1 = df.assign(date= lambda x:pd.to_datetime(x[['Year', 'Month', 'Day']])).set_index('date')



>> df1



           State    Year  Month    Value  Day

date                                         

1987-01-01    TN  1987.0    1.0  24410.0    1

1987-02-01    TN  1987.0    2.0  24410.0    1

1987-03-01    TN  1987.0    3.0  24410.0    1

1987-04-01    TN  1987.0    4.0  24410.0    1

1996-01-01    TN  1996.0    1.0  24410.0    1

1996-02-01    TN  1996.0    2.0  24410.0    1

1996-03-01    TN  1996.0    3.0  24410.0    1

1996-04-01    TN  1996.0    4.0  24410.0    1

1996-05-01    TN  1996.0    5.0  37109.0    1

1996-06-01    TN  1996.0    6.0  37109.0    1

1996-07-01    TN  1996.0    7.0  37109.0    1

1996-08-01    TN  1996.0    8.0  37109.0    1

1996-09-01    TN  1996.0    9.0  37109.0    1

1996-10-01    TN  1996.0   10.0  37109.0    1

1996-11-01    TN  1996.0   11.0  37109.0    1

1996-12-01    TN  1996.0   12.0  37109.0    1

2016-01-01    TN  2016.0    1.0  49808.0    1

2016-02-01    TN  2016.0    2.0  49808.0    1

Then you can resample it by month by doing:

    res = df1.resample('M').first().ffill()



    >> res 



               State    Year  Month    Value  Day

    date                                         

    1987-01-31    TN  1987.0    1.0  24410.0  1.0

    1987-02-28    TN  1987.0    2.0  24410.0  1.0

    1987-03-31    TN  1987.0    3.0  24410.0  1.0

    1987-04-30    TN  1987.0    4.0  24410.0  1.0

    1987-05-31    TN  1987.0    4.0  24410.0  1.0

    1987-06-30    TN  1987.0    4.0  24410.0  1.0

    1987-07-31    TN  1987.0    4.0  24410.0  1.0

    1987-08-31    TN  1987.0    4.0  24410.0  1.0

    1987-09-30    TN  1987.0    4.0  24410.0  1.0

    1987-10-31    TN  1987.0    4.0  24410.0  1.0

    1987-11-30    TN  1987.0    4.0  24410.0  1.0

    1987-12-31    TN  1987.0    4.0  24410.0  1.0

    1988-01-31    TN  1987.0    4.0  24410.0  1.0

    1988-02-29    TN  1987.0    4.0  24410.0  1.0

    1988-03-31    TN  1987.0    4.0  24410.0  1.0

    1988-04-30    TN  1987.0    4.0  24410.0  1.0

    1988-05-31    TN  1987.0    4.0  24410.0  1.0

    1988-06-30    TN  1987.0    4.0  24410.0  1.0

    1988-07-31    TN  1987.0    4.0  24410.0  1.0

    1988-08-31    TN  1987.0    4.0  24410.0  1.0

    1988-09-30    TN  1987.0    4.0  24410.0  1.0

    1988-10-31    TN  1987.0    4.0  24410.0  1.0

    1988-11-30    TN  1987.0    4.0  24410.0  1.0

    1988-12-31    TN  1987.0    4.0  24410.0  1.0

    1989-01-31    TN  1987.0    4.0  24410.0  1.0

    1989-02-28    TN  1987.0    4.0  24410.0  1.0

    1989-03-31    TN  1987.0    4.0  24410.0  1.0

    1989-04-30    TN  1987.0    4.0  24410.0  1.0

    1989-05-31    TN  1987.0    4.0  24410.0  1.0

    1989-06-30    TN  1987.0    4.0  24410.0  1.0

    ...          ...     ...    ...      ...  ...

    2013-09-30    TN  1996.0   12.0  37109.0  1.0

    2013-10-31    TN  1996.0   12.0  37109.0  1.0

    2013-11-30    TN  1996.0   12.0  37109.0  1.0

    2013-12-31    TN  1996.0   12.0  37109.0  1.0

    2014-01-31    TN  1996.0   12.0  37109.0  1.0

    2014-02-28    TN  1996.0   12.0  37109.0  1.0

    2014-03-31    TN  1996.0   12.0  37109.0  1.0

    2014-04-30    TN  1996.0   12.0  37109.0  1.0

    2014-05-31    TN  1996.0   12.0  37109.0  1.0

    2014-06-30    TN  1996.0   12.0  37109.0  1.0

    2014-07-31    TN  1996.0   12.0  37109.0  1.0

    2014-08-31    TN  1996.0   12.0  37109.0  1.0

    2014-09-30    TN  1996.0   12.0  37109.0  1.0

    2014-10-31    TN  1996.0   12.0  37109.0  1.0

    2014-11-30    TN  1996.0   12.0  37109.0  1.0

    2014-12-31    TN  1996.0   12.0  37109.0  1.0

    2015-01-31    TN  1996.0   12.0  37109.0  1.0

    2015-02-28    TN  1996.0   12.0  37109.0  1.0

    2015-03-31    TN  1996.0   12.0  37109.0  1.0

    2015-04-30    TN  1996.0   12.0  37109.0  1.0

    2015-05-31    TN  1996.0   12.0  37109.0  1.0

    2015-06-30    TN  1996.0   12.0  37109.0  1.0

    2015-07-31    TN  1996.0   12.0  37109.0  1.0

    2015-08-31    TN  1996.0   12.0  37109.0  1.0

    2015-09-30    TN  1996.0   12.0  37109.0  1.0

    2015-10-31    TN  1996.0   12.0  37109.0  1.0

    2015-11-30    TN  1996.0   12.0  37109.0  1.0

    2015-12-31    TN  1996.0   12.0  37109.0  1.0

    2016-01-31    TN  2016.0    1.0  49808.0  1.0

    2016-02-29    TN  2016.0    2.0  49808.0  1.0

You can get the original structure by doing:

>> res.reset_index(drop=True).drop(['Day'], axis=1).head()



        State    Year  Month    Value

    0      TN  1987.0    1.0  24410.0

    1      TN  1987.0    2.0  24410.0

    2      TN  1987.0    3.0  24410.0

    3      TN  1987.0    4.0  24410.0

    4      TN  1987.0    4.0  24410.0

    5      TN  1987.0    4.0  24410.0

    6      TN  1987.0    4.0  24410.0

    7      TN  1987.0    4.0  24410.0

    8      TN  1987.0    4.0  24410.0

kon_u 1616 · Answer 3 · 2018-11-13 18:00:02Z

How about pandas.interpolate?: interpolate values according to different methods

See section 'interpolate' here: https://pandas.pydata.org/pandas-docs/stable/missing_data.html

And some existing example previously posted: Pandas interpolate() backwards in dataframe

score 0 · Answer 4 · 2018-11-13 18:04:31Z

You can create a dataframe with the missing months and them merge your result with it:

dates = pd.date_range(start='1/1/%d' %df['Year'].min(),

                      end='1/08/%d' %df['Year'].max(),

                      freq='MS', closed='left')



>> dates



DatetimeIndex(['1987-02-01', '1987-03-01', '1987-04-01', '1987-05-01',

               '1987-06-01', '1987-07-01', '1987-08-01', '1987-09-01',

               '1987-10-01', '1987-11-01',

               ...

               '2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',

               '2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',

               '2015-12-01', '2016-01-01'],

              dtype='datetime64[ns]', length=348, freq='MS')

Then you can create a dataframe with all the months:

all_months = pd.DataFrame.from_records((dates.year, dates.month),

      index=['Year', 'Month']).T.sort_values(by=['Year', 'Month'])

And then merge it with the original dataframe and forward-fill it:

df.merge(all_months, how='right').ffill()



    State    Year  Month    Value

0      TN  1987.0    1.0  24410.0

1      TN  1987.0    2.0  24410.0

2      TN  1987.0    3.0  24410.0

3      TN  1987.0    4.0  24410.0

4      TN  1996.0    1.0  24410.0

5      TN  1996.0    2.0  24410.0

6      TN  1996.0    3.0  24410.0

7      TN  1996.0    4.0  24410.0

8      TN  1996.0    5.0  37109.0

9      TN  1996.0    6.0  37109.0

10     TN  1996.0    7.0  37109.0

11     TN  1996.0    8.0  37109.0

12     TN  1996.0    9.0  37109.0

13     TN  1996.0   10.0  37109.0

14     TN  1996.0   11.0  37109.0

15     TN  1996.0   12.0  37109.0

16     TN  2016.0    1.0  49808.0

17     TN  1987.0    5.0  49808.0

18     TN  1987.0    6.0  49808.0

19     TN  1987.0    7.0  49808.0

20     TN  1987.0    8.0  49808.0

21     TN  1987.0    9.0  49808.0

22     TN  1987.0   10.0  49808.0

23     TN  1987.0   11.0  49808.0

24     TN  1987.0   12.0  49808.0

25     TN  1988.0    1.0  49808.0

26     TN  1988.0    2.0  49808.0

27     TN  1988.0    3.0  49808.0

28     TN  1988.0    4.0  49808.0

29     TN  1988.0    5.0  49808.0

..    ...     ...    ...      ...

319    TN  2013.0    7.0  49808.0

320    TN  2013.0    8.0  49808.0

321    TN  2013.0    9.0  49808.0

322    TN  2013.0   10.0  49808.0

323    TN  2013.0   11.0  49808.0

324    TN  2013.0   12.0  49808.0

325    TN  2014.0    1.0  49808.0

326    TN  2014.0    2.0  49808.0

327    TN  2014.0    3.0  49808.0

328    TN  2014.0    4.0  49808.0

329    TN  2014.0    5.0  49808.0

330    TN  2014.0    6.0  49808.0

331    TN  2014.0    7.0  49808.0

332    TN  2014.0    8.0  49808.0

333    TN  2014.0    9.0  49808.0

334    TN  2014.0   10.0  49808.0

335    TN  2014.0   11.0  49808.0

336    TN  2014.0   12.0  49808.0

337    TN  2015.0    1.0  49808.0

338    TN  2015.0    2.0  49808.0

339    TN  2015.0    3.0  49808.0

340    TN  2015.0    4.0  49808.0

341    TN  2015.0    5.0  49808.0

342    TN  2015.0    6.0  49808.0

343    TN  2015.0    7.0  49808.0

344    TN  2015.0    8.0  49808.0

345    TN  2015.0    9.0  49808.0

346    TN  2015.0   10.0  49808.0

347    TN  2015.0   11.0  49808.0

348    TN  2015.0   12.0  49808.0

Using pandas.resample

Another solution is index by date and then resample there:

df['Day'] = 1



df1 = df.assign(date= lambda x:pd.to_datetime(x[['Year', 'Month', 'Day']])).set_index('date')



>> df1



           State    Year  Month    Value  Day

date                                         

1987-01-01    TN  1987.0    1.0  24410.0    1

1987-02-01    TN  1987.0    2.0  24410.0    1

1987-03-01    TN  1987.0    3.0  24410.0    1

1987-04-01    TN  1987.0    4.0  24410.0    1

1996-01-01    TN  1996.0    1.0  24410.0    1

1996-02-01    TN  1996.0    2.0  24410.0    1

1996-03-01    TN  1996.0    3.0  24410.0    1

1996-04-01    TN  1996.0    4.0  24410.0    1

1996-05-01    TN  1996.0    5.0  37109.0    1

1996-06-01    TN  1996.0    6.0  37109.0    1

1996-07-01    TN  1996.0    7.0  37109.0    1

1996-08-01    TN  1996.0    8.0  37109.0    1

1996-09-01    TN  1996.0    9.0  37109.0    1

1996-10-01    TN  1996.0   10.0  37109.0    1

1996-11-01    TN  1996.0   11.0  37109.0    1

1996-12-01    TN  1996.0   12.0  37109.0    1

2016-01-01    TN  2016.0    1.0  49808.0    1

2016-02-01    TN  2016.0    2.0  49808.0    1

Then you can resample it by month by doing:

    res = df1.resample('M').first().ffill()



    >> res 



               State    Year  Month    Value  Day

    date                                         

    1987-01-31    TN  1987.0    1.0  24410.0  1.0

    1987-02-28    TN  1987.0    2.0  24410.0  1.0

    1987-03-31    TN  1987.0    3.0  24410.0  1.0

    1987-04-30    TN  1987.0    4.0  24410.0  1.0

    1987-05-31    TN  1987.0    4.0  24410.0  1.0

    1987-06-30    TN  1987.0    4.0  24410.0  1.0

    1987-07-31    TN  1987.0    4.0  24410.0  1.0

    1987-08-31    TN  1987.0    4.0  24410.0  1.0

    1987-09-30    TN  1987.0    4.0  24410.0  1.0

    1987-10-31    TN  1987.0    4.0  24410.0  1.0

    1987-11-30    TN  1987.0    4.0  24410.0  1.0

    1987-12-31    TN  1987.0    4.0  24410.0  1.0

    1988-01-31    TN  1987.0    4.0  24410.0  1.0

    1988-02-29    TN  1987.0    4.0  24410.0  1.0

    1988-03-31    TN  1987.0    4.0  24410.0  1.0

    1988-04-30    TN  1987.0    4.0  24410.0  1.0

    1988-05-31    TN  1987.0    4.0  24410.0  1.0

    1988-06-30    TN  1987.0    4.0  24410.0  1.0

    1988-07-31    TN  1987.0    4.0  24410.0  1.0

    1988-08-31    TN  1987.0    4.0  24410.0  1.0

    1988-09-30    TN  1987.0    4.0  24410.0  1.0

    1988-10-31    TN  1987.0    4.0  24410.0  1.0

    1988-11-30    TN  1987.0    4.0  24410.0  1.0

    1988-12-31    TN  1987.0    4.0  24410.0  1.0

    1989-01-31    TN  1987.0    4.0  24410.0  1.0

    1989-02-28    TN  1987.0    4.0  24410.0  1.0

    1989-03-31    TN  1987.0    4.0  24410.0  1.0

    1989-04-30    TN  1987.0    4.0  24410.0  1.0

    1989-05-31    TN  1987.0    4.0  24410.0  1.0

    1989-06-30    TN  1987.0    4.0  24410.0  1.0

    ...          ...     ...    ...      ...  ...

    2013-09-30    TN  1996.0   12.0  37109.0  1.0

    2013-10-31    TN  1996.0   12.0  37109.0  1.0

    2013-11-30    TN  1996.0   12.0  37109.0  1.0

    2013-12-31    TN  1996.0   12.0  37109.0  1.0

    2014-01-31    TN  1996.0   12.0  37109.0  1.0

    2014-02-28    TN  1996.0   12.0  37109.0  1.0

    2014-03-31    TN  1996.0   12.0  37109.0  1.0

    2014-04-30    TN  1996.0   12.0  37109.0  1.0

    2014-05-31    TN  1996.0   12.0  37109.0  1.0

    2014-06-30    TN  1996.0   12.0  37109.0  1.0

    2014-07-31    TN  1996.0   12.0  37109.0  1.0

    2014-08-31    TN  1996.0   12.0  37109.0  1.0

    2014-09-30    TN  1996.0   12.0  37109.0  1.0

    2014-10-31    TN  1996.0   12.0  37109.0  1.0

    2014-11-30    TN  1996.0   12.0  37109.0  1.0

    2014-12-31    TN  1996.0   12.0  37109.0  1.0

    2015-01-31    TN  1996.0   12.0  37109.0  1.0

    2015-02-28    TN  1996.0   12.0  37109.0  1.0

    2015-03-31    TN  1996.0   12.0  37109.0  1.0

    2015-04-30    TN  1996.0   12.0  37109.0  1.0

    2015-05-31    TN  1996.0   12.0  37109.0  1.0

    2015-06-30    TN  1996.0   12.0  37109.0  1.0

    2015-07-31    TN  1996.0   12.0  37109.0  1.0

    2015-08-31    TN  1996.0   12.0  37109.0  1.0

    2015-09-30    TN  1996.0   12.0  37109.0  1.0

    2015-10-31    TN  1996.0   12.0  37109.0  1.0

    2015-11-30    TN  1996.0   12.0  37109.0  1.0

    2015-12-31    TN  1996.0   12.0  37109.0  1.0

    2016-01-31    TN  2016.0    1.0  49808.0  1.0

    2016-02-29    TN  2016.0    2.0  49808.0  1.0

You can get the original structure by doing:

>> res.reset_index(drop=True).drop(['Day'], axis=1).head()



        State    Year  Month    Value

    0      TN  1987.0    1.0  24410.0

    1      TN  1987.0    2.0  24410.0

    2      TN  1987.0    3.0  24410.0

    3      TN  1987.0    4.0  24410.0

    4      TN  1987.0    4.0  24410.0

    5      TN  1987.0    4.0  24410.0

    6      TN  1987.0    4.0  24410.0

    7      TN  1987.0    4.0  24410.0

    8      TN  1987.0    4.0  24410.0

搜尋此網誌

Cfrgtkky

Filling in past and future data from partial data in Python

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

mysqli_query(): Empty query in /home/lucindabrummitt/public_html/blog/wp-includes/wp-db.php on line 1924

Multiple Hosts connection in Hyperledger Fabric

How to send String Array data to Server using php in android

Filling in past and future data from partial data in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

mysqli_query(): Empty query in /home/lucindabrummitt/public_html/blog/wp-includes/wp-db.php on line 1924

Multiple Hosts connection in Hyperledger Fabric

How to send String Array data to Server using php in android

2 Answers
2

2 Answers
2

2 Answers
2